Handling imbalanced object dataset using SMOTE technique

1 year ago

#384768

Zeyad Tarek

Here I have sdf file which is my training data consisting of 3 features and the last feature is my output.

I read my dataset using this function.

def read_sdf(file):
   with open(file, 'r') as rf:
       content = rf.read()
   samples = content.split('$$$$')

   def parse_sample(s):
       lines = s.splitlines()
       links = []
       nodes = []
       label = 0
       for l in lines:
           if l.strip() == '1.0':
               label = 1
           if l.strip() == '0.0':
               label = 0
           if l.startswith('    '):
               feature = l.split()
               node = feature[3]
               nodes.append(node)
           elif l.startswith(' '):
               lnk = l.split()
               # edge: (from, to,) (1-based index)
               if int(lnk[0]) - 1 < len(nodes):
                   links.append((
                       int(lnk[0])-1, 
                       int(lnk[1])-1, # zero-based index
                       # int(lnk[2]) ignore edge weight
                   ))
       return nodes, np.array(links), label

   return [parse_sample(s) for s in tqdm(samples) if len(s[0]) > 0]

training_set = np.array(read_sdf('../input/gcn-data/train.sdf'),dtype=object)

#print the first sample from the dataset
print(training_set[0])

And the output was

[list(['S', 'O', 'O', 'O', 'O', 'N', 'N', 'N', 'C', 'C', 'C', 'C', 'C', 'C', 'C', 'C', 'C', 'C', 'C', 'C', 'C', 'C', 'C', 'C', 'C'])
 array([[ 0,  8],
        [ 0, 14],
        [ 1, 10],
        [ 2, 11],
        [ 3,  7],
        [ 4,  7],
        [ 5,  9],
        [ 5, 14],
        [ 6, 14],
        [ 6, 17],
        [ 7, 22],
        [ 8,  9],
        [ 8, 10],
        [ 9, 11],
        [10, 12],
        [11, 13],
        [12, 13],
        [12, 15],
        [13, 16],
        [15, 18],
        [16, 19],
        [17, 20],
        [17, 21],
        [18, 19],
        [20, 23],
        [21, 24],
        [22, 23],
        [22, 24]]), 0]

My problem is that this imbalanced dataset. it has 23806 samples of 0 and 1218 samples of 1.

So I tried to solve this problem using SMOTE technique

oversample = SMOTE()
training_set[:,0:-1],training_set[:,-1] = oversample.fit_resample(training_set[:,0:-1],training_set[:,-1])

But then I got this error and I think it's because the 2 input features here are an object type.

ValueError: Unknown label type: 'unknown'

So any solutions here to oversampling this dataset.

Edit 1: Don't bother yourself by reading and understanding the read_sdf function it doesn't do anything but read the sdf file and there isn't any problem with it.

python

python-3.x

machine-learning

deep-learning

imbalanced-data

0 Answers

Your Answer

Posts

Questions

Blogs

Jobs