Data Pre-processing for Deep Learning models (Deep Learning with Keras – Part 2)

Link to Part 1.

Dealing with Data

Motivation

Training deep learning models requires data… A lot of data! Unfortunately, in most cases data comes messy, and our models are very sensitive towards this. Therefore, we need to be careful while preparing our data to achieve the best results.

Before starting directly with building and training models, we thought it is better to give you a complete overview of data processing techniques. In this post, we will show you how to deal with numeric, categorical and image datasets.

Get your laptops ready, we have a lot of preprocessing to do…

Working with Numerical Data

Numerical values are going to be the most frequent data types you are going to deal with. Even though they are already in a suitable format for calculations, we still need to do some work.

The main problem with numerical data is the different scales each feature holds. Consider a housing prices dataset with information about: house size, number of bedrooms, construction year, and price. Suppose our goal is to predict the price given the house size, number of bedrooms and construction year. Each of those features is presented on a different scale. A house size may range let us say between 100 and 500 meters squared, construction year is a 4 digits number that goes back around 200 years ago, and finally the number of bedrooms is a number between 1-4.

The issue here is that a model may give more attention to one feature based on its value. This way the house size may get more attention since its values are bigger, and other important features such as number of bedrooms may be neglected since their values are small.

Well, do not panic! We have two simple solutions for this problem, they are called Normalization and Standardization.

Normalization

Normalization simply scales the values in the range [0-1]. To apply it on a dataset you just have to subtract the minimum value from each feature and divide it with the range (max – min).

Normalization

Standardization

Standardization on the other hand transforms data to have a zero mean and one unit standard deviation.

Standardization

Implementation

Implementing the above techniques in Keras is easier than you think. We will show you an example using the Boston Housing dataset that can be easily loaded with Keras.

from keras.datasets import boston_housing
# data is returned as a tuple for the training and the testing datasets
(X_train, y_train), (X_test, y_test) = boston_housing.load_data()

Let us look at the first example in the training dataset:

print(X_train[0])
# Output
[  1.23247   0.        8.14      0.        0.538     6.142    91.7
   3.9769    4.      307.       21.      396.9      18.72   ]

See the different scales? To solve this we will use the popular Scikit-Learn library.

Use the MinMaxScaler for data normalization:

from sklearn.preprocessing import MinMaxScaler
scaler = MinMaxScaler()
X_normalized = scaler.fit_transform(X_train)
print(X_normalized[0])
# Output
[0.01378163 0.         0.28152493 0.         0.31481481 0.49980635
 0.91452111 0.29719123 0.13043478 0.22753346 0.89361702 1.
 0.46881898]

OR, use the StandardScaler to standardize:

from sklearn.preprocessing import StandardScaler
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X_train)
print(X_scaled[0])
# Output
[-0.27224633 -0.48361547 -0.43576161 -0.25683275 -0.1652266  -0.1764426
  0.81306188  0.1166983  -0.62624905 -0.59517003  1.14850044  0.44807713
  0.8252202 ]

Great! Let us see how we handle categorical data now…

Working with Categorical Data

Categorical data need special treatment because they can not be fed to a neural network in their own format (Since neural networks only accept numerical data types).

We will introduce two main techniques for handling categorical data: Indexing and OneHotEncoding.

Indexing

Indexing is simply replacing a category name with an index or a number.

import numpy as np

data = np.array(['small', 'medium', 'small', 'large', 'xlarge', 'large'])

from sklearn.preprocessing import LabelEncoder
encoder = LabelEncoder()
data_encoded = encoder.fit_transform(data)
print(data_encoded)

# Output
array([2, 1, 2, 0, 3, 0])

Values ‘small’, ‘medium’, ‘large’, and ‘xlarge’ where replaced by numbers from 0 to 3. To strictly specify the numbers used you may refer to OrdinalEncoder.

OneHotEncoding

OneHotEncoding is replacing each element by a list of boolean values with 1 in the present category index and 0 in the others.

import numpy as np

data = np.array(['red', 'blue', 'orange', 'white', 'red', 'orange', 'white', 'red'])

from sklearn.preprocessing import LabelBinarizer
encoder = LabelBinarizer()
data_encoded = encoder.fit_transform(data)

print(data_encoded)

# Output
array([[0, 0, 1, 0],
       [1, 0, 0, 0],
       [0, 1, 0, 0],
       [0, 0, 0, 1],
       [0, 0, 1, 0],
       [0, 1, 0, 0],
       [0, 0, 0, 1],
       [0, 0, 1, 0]])

Instead of replacing each color with a number, it was replaced with a list. The list represents the following: is_blue, is_orange, is_red, and is_white. The value 1 is added to the relative index of each color presence and 0 is added otherwise. Example: The color red is represented by: [0 0 1 0]. This technique is used to break the ordinal relation between number if present.

Working with Images

We will discuss only two simple image manipulations in this post. More advanced techniques will be introduced later.

In the first phase of this tutorial, we will deal with images as simple array of flat consecutive pixels. But images are normally 2 dimensional with 1 or 3 color channels. Therefore we need to reshape each image as an array before we use it. This is done via the reshape function in Numpy.

Suppose we have a list of 500 images each with 28 * 28 pixels and 3 color channels RGB. This list needs to be reshaped into (500, 2352) in order to be fed to the network. 2353 here is the size of each image after resize (28 * 28 * 3).

[Note: later we will work with images in their raw format using kernels]

Reshaping this list is very easy using Numpy:

data_reshaped = data.reshape(500, 28*28*3)

Simple! Now since our pixels are numeric values, we need to scale them as well. One simple scaling technique for images is to divide each pixel with 255 (the maximum value for each pixel).

images = images / 255.

That is it for images till now…

Conclusion

In this post we learned how to deal with data for deep learning models. We are now ready to handle numeric, categorical and image datasets. These techniques will be crucial for our next tutorial where we build our first neural network! See you then!


Note: This is a guest post, and opinion in this article is of the guest writer. If you have any issues with any of the articles posted at www.marktechpost.com please contact at asif@marktechpost.co

I am a Data Scientist specialized in Deep Learning, Machine Learning and Big Data (Storage, Processing and Analysis). I have a strong research and professional background with a Ph.D. degree in Computer Science from Université Paris Saclay and VEDECOM institute. I practice my skills through R&D, consultancy and by giving data science training.

🚀 The end of project management by humans (Sponsored)