blog bg

December 19, 2024

Implementing a Keyword Extractor in Python with TF-IDF

Share what you learn in this blog to prepare for your interview, create your forever-free profile now, and explore how to monetize your valuable knowledge.

 

How do search engines and content recommendation systems find the most important words in a document? Do you know about it? No worries, I'm here to explain to you everything. Keyword extraction extracts keywords from the text. This article explains TF-IDF, one of the easiest and most used keyword extraction techniques. In this post, I'll teach you about TF-IDF and how to build a Python keyword extractor!

 

Understanding TF-IDF

The statistical approach TF-IDF compares the relevance of a word in a text to a "corpus." The TF-IDF score for each word combines two fundamental ideas:

  • Term Frequency (TF): Calculates word frequency in a text/document. Frequent words in a paper score higher.
  • Inverse Document Frequency (IDF): Measures word uniqueness across documents. Some words in fewer papers are more particular and informative; thus, they score better.

Combining this, TF-IDF gives high ratings to keywords that are common in a text but unusual elsewhere. This combination helps TF-IDF identify significant keywords that reflect a text's primary subjects.

 

Setting Up the Environment

Our keyword extractor requires scikit-learn, pandas, and nltk for text processing. To install them, use this command:

pip install scikit-learn pandas nltk

 

Step-by-Step Implementation of Keyword Extraction with TF-IDF

 

Data Preparation

Create or load a text dataset first. Here's a short sample for this example. To optimize keyword extraction, I will remove stop words and punctuation.

import pandas as pd
from sklearn.feature_extraction.text import TfidfVectorizer
from nltk.corpus import stopwords
import nltk

nltk.download('stopwords')
stop_words = set(stopwords.words('english'))

documents = [
    "Python is a popular programming language.",
    "Python can be used for web development, data science, and more.",
    "Machine learning with Python has become essential for data scientists.",
]

# Remove stop words during TF-IDF calculation

 

Calculating TF-IDF

The scikit-learn TfidfVectorizer class will calculate TF-IDF scores for each document word. This automates tokenization, stop word removal, and TF-IDF scores.

vectorizer = TfidfVectorizer(stop_words=stop_words)
tfidf_matrix = vectorizer.fit_transform(documents)

# Display the TF-IDF matrix as a pandas DataFrame for readability
tfidf_df = pd.DataFrame(tfidf_matrix.toarray(), columns=vectorizer.get_feature_names_out())
print(tfidf_df)

 

This code prints a matrix with rows for documents and columns for terms' TF-IDF scores. Documents with higher scores have more relevant keywords. 

 

Extracting Keywords

Identify the keywords with the best TF-IDF scores to extract the top terms from each document:

def extract_keywords(tfidf_row, terms, top_n=3):
    sorted_indices = tfidf_row.argsort()[-top_n:][::-1]
    return [terms[i] for i in sorted_indices]

terms = vectorizer.get_feature_names_out()

for i, row in enumerate(tfidf_matrix.toarray()):
    keywords = extract_keywords(row, terms)
    print(f"Top keywords for document {i+1}: {keywords}")

 

Choose the top N keywords from each document's TF-IDF scores using this function. To personalize keyword extraction, try different top_n values.

 

Conclusion and Applications

Just a simple TF-IDF keyword extractor! This fast, interpretable keyword extraction method is perfect for text summarizing, content labeling, and SEO. NLP methods like word embeddings or context-capturing transformer models (e.g., BERT) can enhance keyword extraction. So, are you ready to explore text processing and keyword extraction line by line using this starting point?

48 views

Please Login to create a Question