1 year ago
#332431
JPJW
Deduplication and Replacement Using Fuzzywuzzy
I'm trying to count how many times an organization was cited, but am coming across this problem:
('The Regents Of The University Of California', 468), (' The Regents Of The University Of California', 64)
The 2 organizations are clearly the same but I am unable to clean the data to merge the count. I came across the fuzzy wuzzy library but am unable to get it to work (nor understand how it works as it parses through data).
I need to replace the text so that it can get the count correct.
The code I have thus far:
raw_assignee = list(clean1.iloc[:,1])
assignee_list_example = ['The Regents of the University of California', 'llc', 'ltd', 'inc']
deduplicated_assignee = process.dedupe(raw_assignee, threshold=80)
print(deduplicated_assignee)
I'm new to data science and have no idea what's going on - appreciate your help!
python
data-science
data-cleaning
fuzzy-search
fuzzywuzzy
0 Answers
Your Answer