1 year ago
#366551
user3668129
How to ignore numbers and use min_df when using TfidfVectorizer?
I'm trying to run simple code of TfidfVectorizer
with some properties:
- Ignore numbers
- Use
min_df
(ignore terms that have a document frequency strictly lower than the given threshold)
But I can't get the right results:
from sklearn.datasets import fetch_20newsgroups
from sklearn.feature_extraction.text import TfidfVectorizer
from nltk.corpus import stopwords
import pandas as pd
import nltk
import re
nltk.download('stopwords')
data = fetch_20newsgroups(subset='all')['data']
english_stop_words = set(stopwords.words('english'))
vectorizer = TfidfVectorizer(stop_words=english_stop_words,
max_features=5000,
min_df=200,
#token_pattern=u'(?u)\b\w*[a-zA-Z]\w*\b'
)
tfidf = vectorizer.fit_transform(data)
df_tfidf = pd.DataFrame(tfidf.toarray(), columns=vectorizer.get_feature_names_out())
print(df_tfidf.head())
Results:
00 000 01 02 03 04 05 10 100 1000 ... wrote \
0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.000000 0.0 0.0 ... 0.000000
1 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.000000 0.0 0.0 ... 0.000000
2 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.000000 0.0 0.0 ... 0.047383
3 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.024252 0.0 0.0 ... 0.000000
4 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.000000 0.0 0.0 ... 0.000000
When I uncomment the line token_pattern=u'(?u)\b\w*[a-zA-Z]\w*\b'
I'm getting error:
ValueError: After pruning, no terms remain. Try a lower min_df or a higher max_df.
So I need to comment the line: min_df=200,
and still I'm getting strange values:
a b d e f i k l n o p r \
0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
1 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
2 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
3 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
4 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
I have tried to use the answer from this post: How can I prevent TfidfVectorizer to get numbers as vocabulary but it didn't work.
How can I use TfidfVectorizer
and ignore numbers and use min_df
?
python
scikit-learn
tfidfvectorizer
0 Answers
Your Answer