1 year ago
#384768
Zeyad Tarek
Handling imbalanced object dataset using SMOTE technique
Here I have sdf file which is my training data consisting of 3 features and the last feature is my output.
I read my dataset using this function.
def read_sdf(file):
with open(file, 'r') as rf:
content = rf.read()
samples = content.split('$$$$')
def parse_sample(s):
lines = s.splitlines()
links = []
nodes = []
label = 0
for l in lines:
if l.strip() == '1.0':
label = 1
if l.strip() == '0.0':
label = 0
if l.startswith(' '):
feature = l.split()
node = feature[3]
nodes.append(node)
elif l.startswith(' '):
lnk = l.split()
# edge: (from, to,) (1-based index)
if int(lnk[0]) - 1 < len(nodes):
links.append((
int(lnk[0])-1,
int(lnk[1])-1, # zero-based index
# int(lnk[2]) ignore edge weight
))
return nodes, np.array(links), label
return [parse_sample(s) for s in tqdm(samples) if len(s[0]) > 0]
training_set = np.array(read_sdf('../input/gcn-data/train.sdf'),dtype=object)
#print the first sample from the dataset
print(training_set[0])
And the output was
[list(['S', 'O', 'O', 'O', 'O', 'N', 'N', 'N', 'C', 'C', 'C', 'C', 'C', 'C', 'C', 'C', 'C', 'C', 'C', 'C', 'C', 'C', 'C', 'C', 'C'])
array([[ 0, 8],
[ 0, 14],
[ 1, 10],
[ 2, 11],
[ 3, 7],
[ 4, 7],
[ 5, 9],
[ 5, 14],
[ 6, 14],
[ 6, 17],
[ 7, 22],
[ 8, 9],
[ 8, 10],
[ 9, 11],
[10, 12],
[11, 13],
[12, 13],
[12, 15],
[13, 16],
[15, 18],
[16, 19],
[17, 20],
[17, 21],
[18, 19],
[20, 23],
[21, 24],
[22, 23],
[22, 24]]), 0]
My problem is that this imbalanced dataset. it has 23806 samples of 0 and 1218 samples of 1.
So I tried to solve this problem using SMOTE technique
oversample = SMOTE()
training_set[:,0:-1],training_set[:,-1] = oversample.fit_resample(training_set[:,0:-1],training_set[:,-1])
But then I got this error and I think it's because the 2 input features here are an object type.
ValueError: Unknown label type: 'unknown'
So any solutions here to oversampling this dataset.
Edit 1: Don't bother yourself by reading and understanding the read_sdf function it doesn't do anything but read the sdf file and there isn't any problem with it.
python
python-3.x
machine-learning
deep-learning
imbalanced-data
0 Answers
Your Answer