Transforming Your Text Data with PyTorch

Hussain Wali
3 min readMar 17, 2023

In natural language processing (NLP), text data often needs preprocessing before it can be used as input for a model. Transformations are a powerful tool in PyTorch that allow you to preprocess your data on-the-fly, without having to modify the original data. This article will explain what transformations are and how they work in PyTorch.

What are Transformations?

A transformation is a function that takes an input and returns an output. In PyTorch, transformations are typically applied to tensors (multi-dimensional arrays) representing your data. For example, a tensor could represent a sentence of words encoded as integers. A transformation could then convert the integers into word embeddings, a common technique in NLP. The original tensor remains unchanged, and the transformed tensor is used as input for your model.

Transformation in PyTorch:

PyTorch provides the torchvision.transforms module for working with transformations. This module includes many useful transformations for image data, but we'll focus on the torchtext.data.transforms module for text data. This module includes several commonly used transformations for NLP, such as tokenization, numericalization, and padding.

Tokenization:

Tokenization is the process of splitting text data into individual tokens, which are usually words or subwords. The torchtext.data.utils.get_tokenizer() function provides several different tokenizers, including the popular NLTK tokenizer. Here's an example:

import torchtext
tokenizer = torchtext.data.utils.get_tokenizer('basic_english')
text = "This is a sentence."
tokens = tokenizer(text)
print(tokens)
# ['this', 'is', 'a', 'sentence', '.']

Numericalization:

Numericalization is the process of converting tokens into numerical representation. This is necessary because machine learning models cannot operate directly on text data. The torchtext.vocab.build_vocab_from_iterator() function provides a convenient way to build a vocabulary from your tokens, which maps each unique token to a unique integer ID. Here's an example:

vocab = torchtext.vocab.build_vocab_from_iterator([tokens])
ids = [vocab[token] for token in tokens]
print(ids)
# [19, 9, 6, 3051, 3]

Padding:

Padding is the process of adding zeros or ones to the end of sequences to make them all the same length. This is necessary because machine learning models generally require fixed-length inputs. The torchtext.data.functional.pad() function provides a way to pad sequences to a specified length. Here's an example:

from torchtext.data.functional import pad
padded_ids = pad(ids, (10,), pad_id=0)
print(padded_ids)
# [19, 9, 6, 3051, 3, 0, 0, 0, 0, 0]

Let’s say you have a dataset of movie reviews, where each review is a string of text. You want to train a sentiment analysis model on this data, but first you need to preprocess it using transformations. Here’s how you could apply the three transformations we’ve discussed:

import torchtext
from torchtext.data.functional import pad

# Load data
reviews = ['This movie was great!', 'This movie was terrible.']
tokenizer = torchtext.data.utils.get_tokenizer('basic_english')
vocab = torchtext.vocab.build_vocab_from_iterator(
tokenizer(review) for review in reviews
)
# Apply transformations
tokenized_reviews = [tokenizer(review) for review in reviews]
numericalized_reviews = [[vocab[token] for token in review] for review in tokenized_reviews]
padded_reviews = [pad(review, (10,), pad_id=vocab['<pad>']) for review in numericalized_reviews]
# Use transformed data in model
model_input = torch.tensor(padded_reviews)

Transformations are a powerful tool for preprocessing text data in NLP. PyTorch provides a convenient way to apply common transformations using the torchtext.data.transforms module. We've demonstrated how to apply tokenization, numericalization, and padding to a real-world example of movie reviews. By using transformations, you can easily prepare your data for use in a machine learning model. If you enjoyed this article, please give it a clap and follow for more NLP guides!

--

--

Hussain Wali

Software Engineer by profession. Data Scientist by heart. MS Data Science at National University of Science and Technology Islamabad.