1 year ago

#332431

test-img

JPJW

Deduplication and Replacement Using Fuzzywuzzy

I'm trying to count how many times an organization was cited, but am coming across this problem:

('The Regents Of The University Of California', 468), (' The Regents Of The University Of California', 64)

The 2 organizations are clearly the same but I am unable to clean the data to merge the count. I came across the fuzzy wuzzy library but am unable to get it to work (nor understand how it works as it parses through data).

I need to replace the text so that it can get the count correct.

The code I have thus far:

raw_assignee = list(clean1.iloc[:,1])
assignee_list_example = ['The Regents of the University of California', 'llc', 'ltd', 'inc']
deduplicated_assignee = process.dedupe(raw_assignee, threshold=80)
print(deduplicated_assignee)

I'm new to data science and have no idea what's going on - appreciate your help!

python

data-science

data-cleaning

fuzzy-search

fuzzywuzzy

0 Answers

Your Answer

Accepted video resources