How to ignore numbers and use min_df when using TfidfVectorizer?

1 year ago

#366551

user3668129

I'm trying to run simple code of TfidfVectorizer with some properties:

Ignore numbers
Use min_df (ignore terms that have a document frequency strictly lower than the given threshold)

But I can't get the right results:

from sklearn.datasets                import fetch_20newsgroups
from sklearn.feature_extraction.text import TfidfVectorizer
from nltk.corpus                     import stopwords

import pandas as pd
import nltk
import re

nltk.download('stopwords')

data               = fetch_20newsgroups(subset='all')['data']

english_stop_words = set(stopwords.words('english'))
vectorizer         = TfidfVectorizer(stop_words=english_stop_words, 
                                     max_features=5000, 
                                     min_df=200,
                                     #token_pattern=u'(?u)\b\w*[a-zA-Z]\w*\b'
                                     )
tfidf              = vectorizer.fit_transform(data)
df_tfidf           = pd.DataFrame(tfidf.toarray(), columns=vectorizer.get_feature_names_out())

print(df_tfidf.head())

Results:

    00  000   01   02   03   04   05        10  100  1000  ...     wrote  \
0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.000000  0.0   0.0  ...  0.000000   
1  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.000000  0.0   0.0  ...  0.000000   
2  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.000000  0.0   0.0  ...  0.047383   
3  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.024252  0.0   0.0  ...  0.000000   
4  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.000000  0.0   0.0  ...  0.000000

When I uncomment the line token_pattern=u'(?u)\b\w*[a-zA-Z]\w*\b' I'm getting error:

ValueError: After pruning, no terms remain. Try a lower min_df or a higher max_df.

So I need to comment the line: min_df=200, and still I'm getting strange values:

  a b d e f i k l n o p r  \
0   0.0   0.0   0.0   0.0   0.0   0.0   0.0   0.0   0.0   0.0   0.0   0.0   
1   0.0   0.0   0.0   0.0   0.0   0.0   0.0   0.0   0.0   0.0   0.0   0.0   
2   0.0   0.0   0.0   0.0   0.0   0.0   0.0   0.0   0.0   0.0   0.0   0.0   
3   0.0   0.0   0.0   0.0   0.0   0.0   0.0   0.0   0.0   0.0   0.0   0.0   
4   0.0   0.0   0.0   0.0   0.0   0.0   0.0   0.0   0.0   0.0   0.0   0.0

I have tried to use the answer from this post: How can I prevent TfidfVectorizer to get numbers as vocabulary but it didn't work.

How can I use TfidfVectorizer and ignore numbers and use min_df ?

python

scikit-learn

tfidfvectorizer

0 Answers

Your Answer

Posts

Questions

Blogs

Jobs