How to make text processing operations on big corpus work (more effici - Enhance your coding expertise with Lucas Johnston on @onlycoders.net

1 year ago

#369378

Lucas Johnston

How to make text processing operations on big corpus work (more efficiently)

How to get my Python script to work (efficiently) on a 33-million large set of documents?

Creating a Topic Model from 33-million PubMed abstracts. At this point I have a binary pickle file that contains a Python list with the 33m abstracts (pubmed_abstracts.bin) as strings. I am trying to unpickle this list, tokenize the documents and then pickle the result, so I can use that in my next steps.

What I'd like to achieve is that I could tokenize these 33m documents with a 100% CPU-load (hence the chunking and asynchronous processing), while using less than 100% RAM and that it finishes without errors (these last two I cannot figure out yet). Feel free to download the pubmed abstracts pickled 6k and 1m test sets. Full 33m corpus available upon request.

This script below works fine on smaller dataset (n=6k/6MB, or n=1m/600MB). But it doesn't work on the full n=33m/19GB dataset. In one case the script stopped because the memory ran out. In another case I'm getting "Process SpawnProcess-7:Process SpawnProcess-8:Process SpawnProcess-9:etc..". I'm working with i5/32GB, Win11, Anaconda, Python v3, Spyder v5.

## TOKENIZED CORPUS TO BINARY
## Spyder 5.1.5
## Python 3.9.7 64-bit


# %%
# Essentials
import pickle
import time
import concurrent.futures # Async script execution
import math

# Gensim
import gensim
from gensim.utils import simple_preprocess


# %%
## Set environment
if __name__ == '__main__':
    is_test = False # Set to false to process full datasets (can take long)
    if is_test:
        file_location = 'C:/.../datasets_test/'; # Small dataset (n=5000) for demo
    else:
        file_location = 'C:/.../datasets_full/'; # Big dataset (n=33,000,000)
    
    
    # %%
    ## Retrieve binary file
    documents = pickle.load(open(file_location + "pubmed_abstracts.bin", 'rb'))
    #print(documents[55]) # Debugging
    
    
    # %%
    ## Separate full corpus into chunks
    big_chunks = []
    documents_per_big_chunk = 1000
    documents_per_small_chunk = 100
    number_of_big_chunks = math.ceil(len(documents) / documents_per_big_chunk) # Define enough chunks to contain entire corpus
    number_of_small_chunks = math.ceil(documents_per_big_chunk / documents_per_small_chunk) # Equally divide big chunk up into small chunks
    range_from = 0
    range_to = documents_per_small_chunk
    for _ in range(number_of_big_chunks):
        big_chunk = []
        for _ in range(number_of_small_chunks):
            if len(documents[range_from:range_to]) > 0:
                big_chunk.append(documents[range_from:range_to+1]) # If it isn't range_to+1, every 100th document is omitted. So range_to+1.
            print(f"range_from = {range_from}, range_to = {range_to}")
            range_from = range_to + 1
            range_to = range_to + documents_per_small_chunk
        big_chunks.append(big_chunk)
    del documents # Free up some much needed memory
    #print(len(big_chunks[0][0])) # Debugging
    #print(big_chunks[0][0][55]) # Debugging
    
    
    # %%
    ## 
    def tokenize_documents(documents):
        tokenized_documents = [gensim.utils.simple_preprocess(document, deacc=True) for document in documents]
        return tokenized_documents
    
    
    # %%
    ## Asynchronously tokenize documents
    ## https://www.youtube.com/watch?v=fKl2JW_qrso
    ## https://www.youtube.com/watch?v=8OKTAedgFYg
    tokenized_documents = []
    time_start = time.perf_counter()
    for i in range(len(big_chunks)):
        with concurrent.futures.ProcessPoolExecutor() as executor:
            results = executor.map(tokenize_documents, big_chunks[i])
            for result in results:
                [tokenized_documents.append(document) for document in result]
    time_spent = str(round(time.perf_counter() - time_start, 2))
    print (f"time_spent = {time_spent}")
    
    # %%
    ## Pickle the resulting tokenized corpus
    pickle.dump(tokenized_documents, open(file_location + "pubmed_abstracts_tokenized.bin", 'wb'))

What I am looking for by posting this on SO is that someone may help me point at my mistakes and help me make a few steps in the right direction. I have made steps by myself like adding if _name_ == '_main_': and chunking my data and processing it asynchronously. I just figure that you, experienced folks on SO may help me save precious masters thesis time by taking a look at my code. What mistakes can you identify and what changes would you suggest?

python

nlp

processing-efficiency

memory-efficient

0 Answers

Your Answer

Posts

Questions

Blogs

Jobs