1 year ago

#262015

test-img

roelmetgevoel

Merging named entities with spaCy's Matcher module

def match_patterns(cleanests_post):

    mark_rutte = [
    [{"LOWER": "mark", 'OP': '?'}, {"LOWER": "rutte", 'OP': '?'}],

    [{"LOWER": "markie"}]

    ]

    matcher.add("Mark Rutte", mark_rutte, on_match=add_person_ent)


    hugo_dejonge = [
    [{"LOWER": "hugo", 'OP': '?'}, {"LOWER": "de jonge", 'OP': '?'}]

    ]

    matcher.add("Hugo de Jonge", hugo_dejonge, on_match=add_person_ent)



    adolf_hitler = [
    [{"LOWER": "adolf", 'OP': '?'}, {"LOWER": "hitler", 'OP': '?'}]

    ]

    matcher.add("Adolf Hitler", adolf_hitler, on_match=add_person_ent)

    matches = matcher(cleanests_post)
    matches.sort(key = lambda x:x[1])

    for match_id, start, end in matches:
        string_id = nlp.vocab.strings[match_id]  # Get string representation
        span = cleanests_post[start:end]  # The matched span
        # print('matches', match_id, string_id, start, end, span.text)
        # print ('$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$')

    
    return (cleanests_post)



def add_person_ent(matcher, cleanests_post, i, matches):
        
    # Get the current match and create tuple of entity label, start and end.
    # Append entity to the doc's entity. (Don't overwrite doc.ents!)

    match_id, start, end = matches[i]
    entity = Span(cleanests_post, start, end, label="PERSON")

    filtered = filter_spans(cleanests_post.ents) # When spans overlap, the (first) longest span is preferred over shorter spans.

    filtered += (entity,)

    cleanests_post = filtered

    return (cleanests_post)

 

with open(filepath, encoding='latin-1') as csvfile:
    reader = csv.reader(csvfile, delimiter=';')

    next(reader, None) # Skip first row (= header) of the csv file

    dict_from_csv = {rows[0]:rows[2] for rows in reader} # creates a dictionary with 'date' as keys and 'text' as values
    #print (dict_from_csv)

    values = dict_from_csv.values()
    values_list = list(values)
    #print ('values_list:', values_list)

    people = []


    for post in values_list: # iterate over each post
       

        # Do some preprocessing here  


        clean_post = remove_images(post)

        cleaner_post = remove_forwards(clean_post)

        cleanest_post = remove_links(cleaner_post)

        cleanests_post = delete_breaks(cleanest_post)

        cleaned_posts.append(cleanests_post)

        cleanests_post = nlp(cleanests_post)

        cleanests_post = match_patterns(cleanests_post) 


        if cleanests_post.ents:
            show_results = displacy.render(cleanests_post, style='ent')
   


        # GET PEOPLE
        
        for named_entity in cleanests_post.ents:
            if named_entity.label_ == "PERSON":
                #print ('NE PERSON:', named_entity)
                people.append(named_entity.text)


    people_tally = Counter(people)

    df = pd.DataFrame(people_tally.most_common(), columns=['character', 'count'])
    print ('people:', df)



I'm using spaCy to extract named entities mentioned in a range of Telegram groups. My data are csv files with columns 'date' and 'text' (a string with the content of each post).

To optimize my output I'd like to merge entities such as 'Mark', 'Rutte', 'Mark Rutte', 'Markie' (and their lowercase forms) as they refer to the same person. My approach is to use spaCy built-in Matcher module for merging these entities.

In my code, match_patterns() is used to define patterns such as mark_rutte and add_person_ent() is used to append that pattern as entity to doc.ents (in my case cleanests_post.ents).

The order of the script is this:

  • open the csv file with the Telegram date as a with-open-loop
  • iterate over each post (= a string with text of the post) individually and do some preprocessing
  • call spaCy's built-in nlp() function on each of the posts to extract named entities
  • call my own match_patterns() function on each of these posts to merge the entities I defined in patterns mark_rutte, hugo_dejonge and adolf_hitler
  • finally, loop over the entities in cleanests_post.ents and append all the PERSON entities to people (= list) and use Counter() and pandas to generate a ranking of each of the persons identified

What goes wrong: it seems as if match_patterns() and add_person_ent() does not work. My output is exactly the same as when I do not call match_patterns(), i.e. 'Mark', 'mark', 'Rutte', 'rutte', 'Mark Rutte', 'MARK RUTTE', 'markie' are still categorised as separate entities. It seems as if something goes wrong with overwriting cleanests_posts.ents. In add_person_ent() I have tried using spaCy's filter_spans() to solve the problem, but without success.

python

merge

spacy

entities

matcher

0 Answers

Your Answer

Accepted video resources