Text classfification using TextCNN

Hussain Wali
4 min readMar 17, 2023

If you’re looking to classify text data, TextCNN is a popular and effective architecture that can be used to categorize text into different classes. In this article, we’ll cover the basics of TextCNN and learn how to implement it in PyTorch for classifying addresses into categories such as educational institutes, commercial institutes, sports complexes, and more.

Text classification is an essential task in natural language processing (NLP), where the goal is to assign a label or category to a piece of text. With the increasing amount of text data generated every day, it’s crucial to develop efficient algorithms for text classification. TextCNN is one such algorithm that has been proven to work well.

To apply TextCNN to our problem of classifying addresses, we need a labeled dataset of addresses along with their corresponding categories. Let’s assume we have a dataset consisting of 10,000 addresses and four categories: educational institute, commercial institute, sports complex, and other.

The first step in implementing TextCNN is to preprocess the text data. This involves tokenizing the text, converting all words to lowercase, and removing stop words and punctuation marks. We’ll then represent each address as a fixed-length vector by padding or truncating the sequence of tokens to a fixed length. This fixed-length representation will be fed into the convolutional neural network.

The TextCNN architecture consists of three main components: the input layer, the convolutional layer, and the output layer. The input layer takes the fixed-length vector representation of the address as input. The convolutional layer applies several filters of varying widths over the input and generates feature maps. The output layer uses a softmax function to generate probabilities for each category.

In PyTorch, we can define the TextCNN architecture as follows:

import torch
import torch.nn as nn
from torch.utils.data import Dataset, DataLoader
import pandas as pd

# Define the TextCNN architecture
class TextCNN(nn.Module):
def __init__(self, vocab_size, embedding_dim, num_classes, kernel_sizes, num_filters):
super(TextCNN, self).__init__()
self.embedding = nn.Embedding(vocab_size, embedding_dim)
self.convs = nn.ModuleList([nn.Conv1d(embedding_dim, num_filters, kernel_size) for kernel_size in kernel_sizes])
self.fc = nn.Linear(num_filters * len(kernel_sizes), num_classes)
def forward(self, x):
x = self.embedding(x)
x = x.permute(0, 2, 1)
x = [torch.relu(conv(x)) for conv in self.convs]
x = [torch.max_pool1d(conv_feat, conv_feat.shape[2]).squeeze(2) for conv_feat in x]
x = torch.cat(x, dim=1)
out = self.fc(x)
return out



# Define a custom dataset for loading the text data
class TextDataset(Dataset):
def __init__(self, file_path, tokenizer):
self.data = pd.read_csv(file_path)
self.tokenizer = tokenizer
def __len__(self):
return len(self.data)
def __getitem__(self, idx):
text = self.data.iloc[idx]['address']
label = self.data.iloc[idx]['category']
tokenized_text = self.tokenizer(text)
return {'text': tokenized_text, 'label': label}


# Define a function to train the model
def train(model, train_loader, optimizer, criterion):
model.train()
epoch_loss = 0
for batch in train_loader:
optimizer.zero_grad()
text = batch['text']
label = batch['label']
output = model(text)
loss = criterion(output, label)
loss.backward()
optimizer.step()
epoch_loss += loss.item()
return epoch_loss / len(train_loader)


# Define a function to evaluate the model
def evaluate(model, val_loader, criterion):
model.eval()
epoch_loss = 0
with torch.no_grad():
for batch in val_loader:
text = batch['text']
label = batch['label']
output = model(text)
loss = criterion(output, label)
epoch_loss += loss.item()
return epoch_loss / len(val_loader)


# Load the data and create the dataloaders
tokenizer = lambda x: x.split() # Simple whitespace tokenizer
train_dataset = TextDataset('text.csv', tokenizer=tokenizer)
val_dataset = TextDataset('text.csv', tokenizer=tokenizer)
train_loader = DataLoader(train_dataset, batch_size=32, shuffle=True)
val_loader = DataLoader(val_dataset, batch_size=32, shuffle=False)


# Instantiate the model and define the optimizer and loss function
model = TextCNN(vocab_size=10000, embedding_dim=50, num_classes=4, kernel_sizes=[3, 4, 5], num_filters=100)
optimizer = torch.optim.Adam(model.parameters())
criterion = nn.CrossEntropyLoss()


# Train the model for several epochs
for epoch in range(10):
train_loss = train(model, train_loader, optimizer, criterion)
val_loss = evaluate(model, val_loader, criterion)
print(f"Epoch {epoch+1}: Training Loss = {train_loss:.4f}, Validation Loss = {val_loss:.4f}")


# Use the model for inference
test_text = "123 Main Street"
tokenized_text = tokenizer(test_text)
input_tensor = torch.tensor(tokenized_text).unsqueeze(0)
prediction = model(input_tensor)
predicted_class = prediction.argmax().item()
print(f"Predicted class: {predicted_class}")

This implementation uses an embedding layer to convert each word into a dense vector representation. We use three different kernel sizes in the convolutional layer to capture different n-grams. We concatenate the output of each filter and feed it into a fully connected output layer.

Once we have defined the model, we can train it on our labeled dataset using cross-entropy loss and stochastic gradient descent (SGD). We can evaluate the model using metrics such as accuracy, precision, recall, and F1-score.

TextCNN is a powerful architecture for text classification that can be used to categorize addresses and other types of text data into various categories. By following the steps outlined in this article, you can easily implement TextCNN in PyTorch and achieve accurate results. If you found this article helpful, please consider clapping and following our publications for more informative content on NLP and machine learning.

In our example, we’re using a simple whitespace tokenizer, but you could use more advanced tokenizers such as spaCy or NLTK depending on your specific use case. Additionally, you’ll need to adjust the hyperparameters and architecture of the model to achieve optimal performance on your specific task.

If you found this article helpful, please consider clapping and following our publications for more informative content on NLP and machine learning.

--

--

Hussain Wali

Software Engineer by profession. Data Scientist by heart. MS Data Science at National University of Science and Technology Islamabad.