November 20, 2024
Building a Text Summarizer Using Machine Learning
Are you tired of scanning and summarizing long documents manually? And want to use machine learning for automatically summarizing long papers or articles? I can help you in this. Text summarization is an important tool that saves your time by reducing long pieces of text to short ones. In this article, I'll show you how to use Python and machine learning to create a simple text summarizer. Let's dig into it:
Step-by-Step Guide to Building a Text Summarizer
Step 1: Data Collection
First diving into implementation phase first you need to collect a dataset of long articles, research papers, or documents. And as a sample, you can get the CNN/DailyMail Dataset, which contains articles and summaries. Otherwise you can create your own dataset from scratch too.
Step 2: Text Preprocessing
And in this phase you must have to clean your collected dataset to load in machine learning model. This includes removing stop words, punctuation, and converting everything to lowercase. Here's a code snippet for data preprocessing:
import re
import nltk
from nltk.corpus import stopwords
nltk.download('stopwords')
def preprocess_text(text):
# Remove special characters and numbers
text = re.sub(r'\W', ' ', text)
# Convert to lowercase
text = text.lower()
# Tokenization
tokens = text.split()
# Remove stopwords
tokens = [word for word in tokens if word not in stopwords.words('english')]
return ' '.join(tokens)
# Example usage
document = "Breaking news: AI is transforming industries across the globe."
print(preprocess_text(document))
Step 3: Feature Extraction
Once you're done with preprocessing, convert your cleaned text into numerical form using TF-IDF (Term Frequency-Inverse Document Frequency) or word embeddings like Word2Vec or GloVe. Here's a code example of feature extraction using TF-IDF:
from sklearn.feature_extraction.text import TfidfVectorizer
# Example documents
documents = ["AI is transforming industries", "Machine learning is the future of technology"]
# Initialize TF-IDF Vectorizer
tfidf_vectorizer = TfidfVectorizer(max_features=100)
X = tfidf_vectorizer.fit_transform(documents).toarray()
# Display feature matrix
print(X)
Step 4: Model Selection for Extractive Summarization Using K-Means Clustering
Extractive summarization helps to extract the most important and representative sentences and words of papers. Here, K-Means Clustering lets you combine phrases that are similar and then extract out the most representative ones.
from sklearn.cluster import KMeans
# Example sentences
sentences = [
"AI is revolutionizing the tech industry.",
"Natural Language Processing is a subfield of AI.",
"Machine learning models are widely used."
]
# Extract features using TF-IDF
X = tfidf_vectorizer.fit_transform(sentences).toarray()
# K-Means clustering
kmeans = KMeans(n_clusters=2)
kmeans.fit(X)
# Assign cluster labels to sentences
labels = kmeans.labels_
# Print the representative sentence from each cluster
for i in range(2):
cluster_sentences = [sentences[j] for j in range(len(labels)) if labels[j] == i]
print(f"Cluster {i+1} summary: {cluster_sentences[0]}")
Step 5: Sentence Ranking Using TextRank
Here comes the last phase, so to get better results, you can use the TextRank approach, which uses graphs to sort sentences by how important they are. Summaries use top-ranked sentences.
Code Example for TextRank Using networkx:
import networkx as nx
from sklearn.metrics.pairwise import cosine_similarity
# Compute similarity matrix
similarity_matrix = cosine_similarity(X)
# Build similarity graph
nx_graph = nx.from_numpy_array(similarity_matrix)
scores = nx.pagerank(nx_graph)
# Rank sentences by score
ranked_sentences = sorted(((scores[i], s) for i, s in enumerate(sentences)), reverse=True)
# Extract top-ranked sentences for summary
summary = " ".join([ranked_sentences[i][1] for i in range(2)])
print("Summary:", summary)
Conclusion
Machine learning technologies like TF-IDF, K-Means clustering, and TextRank can create a powerful text summarizer as you saw above. I hope the code examples used above will help you create your own summarizing tool. Happy coding!
102 views