November 20, 2024

Building a Text Summarizer Using Machine Learning

machine learning

python

text summarizer

Lily Chang

@lily-chang

Share what you learn in this blog to prepare for your interview, create your forever-free profile now, and explore how to monetize your valuable knowledge.

Are you tired of scanning and summarizing long documents manually? And want to use machine learning for automatically summarizing long papers or articles? I can help you in this. Text summarization is an important tool that saves your time by reducing long pieces of text to short ones. In this article, I'll show you how to use Python and machine learning to create a simple text summarizer. Let's dig into it:

Step-by-Step Guide to Building a Text Summarizer

Step 1: Data Collection

First diving into implementation phase first you need to collect a dataset of long articles, research papers, or documents. And as a sample, you can get the CNN/DailyMail Dataset, which contains articles and summaries. Otherwise you can create your own dataset from scratch too.

Step 2: Text Preprocessing

And in this phase you must have to clean your collected dataset to load in machine learning model. This includes removing stop words, punctuation, and converting everything to lowercase. Here's a code snippet for data preprocessing:

import re
import nltk
from nltk.corpus import stopwords
nltk.download('stopwords')

def preprocess_text(text):
    # Remove special characters and numbers
    text = re.sub(r'\W', ' ', text)
    # Convert to lowercase
    text = text.lower()
    # Tokenization
    tokens = text.split()
    # Remove stopwords
    tokens = [word for word in tokens if word not in stopwords.words('english')]
    return ' '.join(tokens)

# Example usage
document = "Breaking news: AI is transforming industries across the globe."
print(preprocess_text(document))

Step 3: Feature Extraction

Once you're done with preprocessing, convert your cleaned text into numerical form using TF-IDF (Term Frequency-Inverse Document Frequency) or word embeddings like Word2Vec or GloVe. Here's a code example of feature extraction using TF-IDF:

from sklearn.feature_extraction.text import TfidfVectorizer

# Example documents
documents = ["AI is transforming industries", "Machine learning is the future of technology"]

# Initialize TF-IDF Vectorizer
tfidf_vectorizer = TfidfVectorizer(max_features=100)
X = tfidf_vectorizer.fit_transform(documents).toarray()

# Display feature matrix
print(X)

Step 4: Model Selection for Extractive Summarization Using K-Means Clustering

Extractive summarization helps to extract the most important and representative sentences and words of papers. Here, K-Means Clustering lets you combine phrases that are similar and then extract out the most representative ones.

from sklearn.cluster import KMeans

# Example sentences
sentences = [
    "AI is revolutionizing the tech industry.",
    "Natural Language Processing is a subfield of AI.",
    "Machine learning models are widely used."
]

# Extract features using TF-IDF
X = tfidf_vectorizer.fit_transform(sentences).toarray()

# K-Means clustering
kmeans = KMeans(n_clusters=2)
kmeans.fit(X)

# Assign cluster labels to sentences
labels = kmeans.labels_

# Print the representative sentence from each cluster
for i in range(2):
    cluster_sentences = [sentences[j] for j in range(len(labels)) if labels[j] == i]
    print(f"Cluster {i+1} summary: {cluster_sentences[0]}")

Step 5: Sentence Ranking Using TextRank

Here comes the last phase, so to get better results, you can use the TextRank approach, which uses graphs to sort sentences by how important they are. Summaries use top-ranked sentences.

Code Example for TextRank Using networkx:

import networkx as nx
from sklearn.metrics.pairwise import cosine_similarity

# Compute similarity matrix
similarity_matrix = cosine_similarity(X)

# Build similarity graph
nx_graph = nx.from_numpy_array(similarity_matrix)
scores = nx.pagerank(nx_graph)

# Rank sentences by score
ranked_sentences = sorted(((scores[i], s) for i, s in enumerate(sentences)), reverse=True)

# Extract top-ranked sentences for summary
summary = " ".join([ranked_sentences[i][1] for i in range(2)])
print("Summary:", summary)

Conclusion

Machine learning technologies like TF-IDF, K-Means clustering, and TextRank can create a powerful text summarizer as you saw above. I hope the code examples used above will help you create your own summarizing tool. Happy coding!

388 views

Please Login to create a Question

Posts

Questions

Blogs

Jobs