November 19, 2024
Developing a Spam Email Classifier Using NLP and Machine Learning
Are you exhausted from getting spam emails in your inbox? I've a solution for your problem. What about a spam email classifier? No matter, you're creating this classifier for your personal needs or want to add this into a company's larger system, but creating a spam email classifier using NLP and machine learning is much simpler than you've ever imagined.
In this blog post, I'll describe the whole process step-by-step using Python, NLP, and machine learning techniques.
Step-by-Step Guide to Building a Spam Email Classifier
Step 1: Data Collection
In this first step, collect your dataset of emails, dataset should contain both spam and non-spam emails. You can use Enron Email Dataset or SpamAssassin Public Corpus dataset. And if you don't want it you can create your own email dataset too.
Step 2: Text Preprocessing
Okay, so before loading your dataset, you've to preprocess it. And preprocessing involves cleaning the emails text by removing stop words, punctuations marks, spaces, and other unnecessary elements.
import re
import nltk
from nltk.corpus import stopwords
nltk.download('stopwords')
def preprocess_email(text):
# Remove special characters and numbers
text = re.sub(r'\W', ' ', text)
# Convert to lowercase
text = text.lower()
# Tokenization
tokens = text.split()
# Remove stop words
tokens = [word for word in tokens if word not in stopwords.words('english')]
return ' '.join(tokens)
# Example usage
email = "Win $1000 by clicking this link now!"
print(preprocess_email(email))
Step 3: Feature Extraction
Next after preprocessing your dataset, now it is time to convert it into numerical features for your machine learning model to process. Two common methods for text feature extraction are Bag of Words (BoW) and TF-IDF (Term Frequency-Inverse Document Frequency). I'll use TF-IDF in the following example:
from sklearn.feature_extraction.text import TfidfVectorizer
# Example emails
emails = ["Win big money now", "Meeting at 3PM tomorrow", "Earn cash from home"]
# Initialize TF-IDF Vectorizer
tfidf_vectorizer = TfidfVectorizer(max_features=500)
X = tfidf_vectorizer.fit_transform(emails).toarray()
# Print the feature matrix
print(X)
Step 4: Model Selection
After feature extraction, now choose your machine learning model. Popular machine learning models are; Naive Bayes commonly used and efficient for spam classification, SVM (Support Vector Machine), and Random Forest. In this example, I'll use Naive Bayes model for its effectiveness with text data.
from sklearn.model_selection import train_test_split
from sklearn.naive_bayes import MultinomialNB
from sklearn.metrics import accuracy_score
# Sample dataset (X - features, y - labels)
X_train, X_test, y_train, y_test = train_test_split(X, [1, 0, 1], test_size=0.2, random_state=42)
# Initialize Naive Bayes classifier
nb_classifier = MultinomialNB()
nb_classifier.fit(X_train, y_train)
# Predictions
y_pred = nb_classifier.predict(X_test)
# Accuracy
print(f"Accuracy: {accuracy_score(y_test, y_pred)}")
Step 5: Model Evaluation
Once your model is trained now its time to test it and evaluate its measures; accuracy, recall, precision, and F1-score. These measures will tell you how accurate and well your model performs.
from sklearn.metrics import classification_report
# Print classification report
print(classification_report(y_test, y_pred))
Conclusion
So, you've used NLP and machine learning to create a spam email classifier in a few stages. Automate spam classification by preprocessing text, extracting features, and training a model. Moreover, you can use multiple datasets, models, and feature extraction methods to increase your spam classifier's accuracy and efficiency.
131 views