1 year ago
#236719
Per Bock
Efficient string similarity search for huge corpora
I am doing a similarity search between a 256 characters long string and a corpus made of 9000 entries with each about 1000 words.
I used LocalitySensitiveHashing
, see https://github.com/Jmkernes/Locality-sensitive-hashing-tutorial/blob/main/LocalitySensitiveHashing.ipynb . It creates pairs, which I filtered.
One problem here is that documents
gets each entry with about 1000 words, which makes search inefficient, as it all has to remain in memory. In general, it is very slow.
The goal is to output the index of the corpus whose content has the biggest similarity to the 256 characters long string quickly.
My thoughts are: the entries need to be simplified and serialized to a file for quick recovery.
Which paper or implementation do you recommend?
python
nlp
cosine-similarity
sentence-similarity
locality-sensitive-hash
0 Answers
Your Answer