Efficient string similarity search for huge corpora

2 years ago

#236719

Per Bock

I am doing a similarity search between a 256 characters long string and a corpus made of 9000 entries with each about 1000 words.

I used LocalitySensitiveHashing, see https://github.com/Jmkernes/Locality-sensitive-hashing-tutorial/blob/main/LocalitySensitiveHashing.ipynb . It creates pairs, which I filtered.

One problem here is that documents gets each entry with about 1000 words, which makes search inefficient, as it all has to remain in memory. In general, it is very slow.

The goal is to output the index of the corpus whose content has the biggest similarity to the 256 characters long string quickly.

My thoughts are: the entries need to be simplified and serialized to a file for quick recovery.

Which paper or implementation do you recommend?

python

nlp

cosine-similarity

sentence-similarity

locality-sensitive-hash

0 Answers

Your Answer

Posts

Questions

Blogs

Jobs