December 19, 2024
Implementing a Keyword Extractor in Python with TF-IDF
How do search engines and content recommendation systems find the most important words in a document? Do you know about it? No worries, I'm here to explain to you everything. Keyword extraction extracts keywords from the text. This article explains TF-IDF, one of the easiest and most used keyword extraction techniques. In this post, I'll teach you about TF-IDF and how to build a Python keyword extractor!
Understanding TF-IDF
The statistical approach TF-IDF compares the relevance of a word in a text to a "corpus." The TF-IDF score for each word combines two fundamental ideas:
- Term Frequency (TF): Calculates word frequency in a text/document. Frequent words in a paper score higher.
- Inverse Document Frequency (IDF): Measures word uniqueness across documents. Some words in fewer papers are more particular and informative; thus, they score better.
Combining this, TF-IDF gives high ratings to keywords that are common in a text but unusual elsewhere. This combination helps TF-IDF identify significant keywords that reflect a text's primary subjects.
Setting Up the Environment
Our keyword extractor requires scikit-learn, pandas, and nltk for text processing. To install them, use this command:
pip install scikit-learn pandas nltk
Step-by-Step Implementation of Keyword Extraction with TF-IDF
Data Preparation
Create or load a text dataset first. Here's a short sample for this example. To optimize keyword extraction, I will remove stop words and punctuation.
import pandas as pd
from sklearn.feature_extraction.text import TfidfVectorizer
from nltk.corpus import stopwords
import nltk
nltk.download('stopwords')
stop_words = set(stopwords.words('english'))
documents = [
"Python is a popular programming language.",
"Python can be used for web development, data science, and more.",
"Machine learning with Python has become essential for data scientists.",
]
# Remove stop words during TF-IDF calculation
Calculating TF-IDF
The scikit-learn TfidfVectorizer class will calculate TF-IDF scores for each document word. This automates tokenization, stop word removal, and TF-IDF scores.
vectorizer = TfidfVectorizer(stop_words=stop_words)
tfidf_matrix = vectorizer.fit_transform(documents)
# Display the TF-IDF matrix as a pandas DataFrame for readability
tfidf_df = pd.DataFrame(tfidf_matrix.toarray(), columns=vectorizer.get_feature_names_out())
print(tfidf_df)
This code prints a matrix with rows for documents and columns for terms' TF-IDF scores. Documents with higher scores have more relevant keywords.
Extracting Keywords
Identify the keywords with the best TF-IDF scores to extract the top terms from each document:
def extract_keywords(tfidf_row, terms, top_n=3):
sorted_indices = tfidf_row.argsort()[-top_n:][::-1]
return [terms[i] for i in sorted_indices]
terms = vectorizer.get_feature_names_out()
for i, row in enumerate(tfidf_matrix.toarray()):
keywords = extract_keywords(row, terms)
print(f"Top keywords for document {i+1}: {keywords}")
Choose the top N keywords from each document's TF-IDF scores using this function. To personalize keyword extraction, try different top_n values.
Conclusion and Applications
Just a simple TF-IDF keyword extractor! This fast, interpretable keyword extraction method is perfect for text summarizing, content labeling, and SEO. NLP methods like word embeddings or context-capturing transformer models (e.g., BERT) can enhance keyword extraction. So, are you ready to explore text processing and keyword extraction line by line using this starting point?
50 views