December 09, 2024

Building a Text Classifier with BERT in Python in Under 50 Lines of Code

bert

python

machine learning

Sophia Liu

@sophia-liu

Share what you learn in this blog to prepare for your interview, create your forever-free profile now, and explore how to monetize your valuable knowledge.

Hi Python lovers! Are you excited in learning how to use AI to classify text with just a few lines of code? BERT, Google's advanced NLP model, makes text classifier building easy for machine learning naives. In this post, I'll show you how to use BERT and Python to create a basic text classifier in under 50 lines! Today, you'll learn how to use BERT to easily classify reviews, tag emails, and organize your documents.

Setting Up the Environment

First you need to set up the environment. For this you will need two important libraries to install: Hugging Face's transformers and PyTorch's torch. These libraries handle BERT-related tasks, making complex NLP integration easy without extensive setup.

To install these libraries, run the following command:

pip install transformers torch

Loading a Pre-trained BERT Model

BERT already knows a lot of text, so you do not have to start from scratch. Hugging Face's transformers library will load a pre-trained BERT model and tokenizer. This converts text to a BERT-friendly format.

from transformers import BertTokenizer, BertForSequenceClassification

# Load tokenizer and model
tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')
model = BertForSequenceClassification.from_pretrained('bert-base-uncased', num_labels=2)

The above code loads BERT with a classification head for binary classification. This model now handles two categories (0 or 1), which is ideal for sentiment analysis and spam detection.

Preparing the Data

Let us create sample data first. For simplicity, we will use a small collection of text-label combinations with class labels (0 or 1). BERT can handle this data after tokenization.

# Sample data
texts = ["This product is great!", "I did not like this at all."]
labels = [1, 0]  # 1 = Positive, 0 = Negative

# Tokenize data
inputs = tokenizer(texts, padding=True, truncation=True, return_tensors="pt")
labels = torch.tensor(labels)

The tokenizer takes the raw text and turns it into tensors that BERT can use. Finally, the tokenized text is in the sources variable, and now it is time to train.

Building and Training the Classifier

Let us make a simple loop for training. We will make some small changes to BERT on our small dataset to make it work better for our classification task.

from torch.optim import Adam

# Set up optimizer
optimizer = Adam(model.parameters(), lr=1e-5)

# Training loop
model.train()
for epoch in range(2):
    optimizer.zero_grad()
    outputs = model(**inputs, labels=labels)
    loss = outputs.loss
    loss.backward()
    optimizer.step()
    print(f"Epoch {epoch + 1}, Loss: {loss.item()}")

In this loop, we feed tokenized data to BERT, compute loss, and then backpropagate to update model weights. If you run this loop for only a few epochs, BERT will have enough data for simple classification tasks.

Evaluating the Model

Pass additional data to the model to test your classification. To check its predictions, here's a short piece of code:

model.eval()
test_texts = ["Amazing quality!", "Not what I expected."]
test_inputs = tokenizer(test_texts, padding=True, truncation=True, return_tensors="pt")
with torch.no_grad():
    test_outputs = model(**test_inputs)
    predictions = test_outputs.logits.argmax(dim=1)
    print("Predictions:", predictions)

Conclusion

Congratulations! You created a BERT-based text classifier in around 50 lines. BERT makes complex NLP tasks possible with the least possible setup. Try bigger datasets or multi-class classification to know more.

531 views

Please Login to create a Question

Posts

Questions

Blogs