1 year ago
#383570
Himan
How can I find the optimal number of topics in LDA with scikit-learn?
I'm computing topic models through scikit-learn with this script (I'm starting with a dataset "df" which has one document per row in the column "Text")
from sklearn.decomposition import LatentDirichletAllocation
#Applying LDA
# the vectorizer object will be used to transform text to vector form
vectorizer = CountVectorizer(max_df=int(0.9*len(df)), min_df=int(0.01*len(df)), token_pattern='\w+|\$[\d\.]+|\S+')
# apply transformation
tf = vectorizer.fit_transform(df.Text).toarray()
# tf_feature_names tells us what word each column in the matric represents
tf_feature_names = vectorizer.get_feature_names()
number_of_topics = 6
model = LatentDirichletAllocation(n_components=number_of_topics, random_state=0)
model.fit(tf)
I'm interested in comparing models with different number of topics (kind of from 2 to 20 topics) through a coherence measure. How can I do it?
optimization
scikit-learn
lda
topic-modeling
hyperparameters
0 Answers
Your Answer