How can I find the optimal number of topics in LDA with scikit-learn?

1 year ago

#383570

Himan

I'm computing topic models through scikit-learn with this script (I'm starting with a dataset "df" which has one document per row in the column "Text")

from sklearn.decomposition import LatentDirichletAllocation

#Applying LDA
# the vectorizer object will be used to transform text to vector form
vectorizer = CountVectorizer(max_df=int(0.9*len(df)), min_df=int(0.01*len(df)), token_pattern='\w+|\$[\d\.]+|\S+') 

# apply transformation
tf = vectorizer.fit_transform(df.Text).toarray()

# tf_feature_names tells us what word each column in the matric represents
tf_feature_names = vectorizer.get_feature_names()



number_of_topics = 6 
model = LatentDirichletAllocation(n_components=number_of_topics, random_state=0)
model.fit(tf)

I'm interested in comparing models with different number of topics (kind of from 2 to 20 topics) through a coherence measure. How can I do it?

optimization

scikit-learn

lda

topic-modeling

hyperparameters

0 Answers

Your Answer

Posts

Questions

Blogs

Jobs