Deep Learning with Keras – Part 6: Textual Data Preprocessing


Congratulations for going far in this Keras tutorial. After working with numeric, categorical and image data, it is time to handle textual data. Deep learning has proved itself in being very effective with Natural Language Processing (NLP) tasks. Before we dive into how to build NLP models, let us learn how to handle and preprocess textual data with Keras.

We already know that any machine learning model needs data to represented in a numeric format. Therefore, in this article we will learn how to load, preprocess, tokenize, convert and pad textual data into numeric sequences.


In the following coding exercises we will be playing with the Spam SMS dataset. The dataset is stored in a CSV file that contains two fields: Label and SMS. The SMS is the message text, while the label indicates whether the SMS is a spam or not. Kindly download from here, and load it as shown in the following code snippet:

import pandas as pd
data = pd.read_csv('../data/sms.csv')

Here is a look at the data:

Initial dataset


Normally the first step in textual data preprocessing is splitting sentences into words/tokens. This could easily be done using the Keras tokenizer class.

from keras.preprocessing.text import Tokenizer
tokenizer = Tokenizer()
data['Tokens'] = tokenizer.texts_to_sequences(data['SMS'])

Let us check the results:


See how words got replaced by numbers! Well each word now has an index that represents it in a sentence.

In order to get what each index stands for, we can use the tokenizer index_word property:

# output: go

We can go the other way around with the word_index property:

# output: 49

We have one more thing to take care of. See how each SMS has different number of words! This results in different sequence lengths; which may cause problems in some models. The solution is simple: Padding.

Sequence Padding

Padding means adding zero values to the left or the right of a sequence. We simply decide the maximum sequence length and the position of the padding. But what if the sequence length is already greater than the maximum length? Well here just truncate! By the end we will have sequences with equal lengths. Here is how to do it:

from keras.preprocessing.sequence import pad_sequences
sequences = pad_sequences(data['Tokens'], maxlen=40, padding='pre', truncating='post')
data['Padded'] = sequences.tolist()

Simple right? Here are the final results:

Padded sequences


Well, you are all set! These are the basic tools for handling textual data in Keras. Next we will learn about a new deep learning architecture for NLP and sequential models in general. We will put these processed data into action while building our first efficient spam classifier. Stay tuned…

I am a Data Scientist specialized in Deep Learning, Machine Learning and Big Data (Storage, Processing and Analysis). I have a strong research and professional background with a Ph.D. degree in Computer Science from Universitรฉ Paris Saclay and VEDECOM institute. I practice my skills through R&D, consultancy and by giving data science training.

๐Ÿ Join the Fastest Growing AI Research Newsletter Read by Researchers from Google + NVIDIA + Meta + Stanford + MIT + Microsoft and many others...