Exploratory Data Analysis Tutorial: Analyzing the Food Culture of Bangalore

0
4092
Image by Gerd Altmann from Pixabay
-Advertisement-

Exploratory Data Analysis is a method of uncovering important relationships between the variables by using Graphs, plots, and tables. Exploratory Data Analysis (EDA) is a very useful technique especially when you are working with the large unknown dataset. It allows you to investigate the interesting relationships between the variables, study the different subsets of data to unlock the different patterns in the data.

In this blog post, we will discuss how to perform exploratory data analysis by creating awesome visualizations using matplotlib and seaborn by taking a real-world data set.

Import Libraries

For data visualization, we will using these two libraries:

  • matplotlib – Matplotlib is a Python 2D plotting library which produces publication quality figures in a variety of hardcopy formats and interactive environments across platforms.
  • seaborn – Seaborn is a Python data visualization library based on matplotlib. It provides a high-level interface for drawing attractive and informative statistical graphics.
#import the libraries
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline #to display graphs inline of jupyter notebook

DataSet

For this analysis, we will be using Zomato Bangalore Restaurants dataset present on kaggle. The dataset contains all the details of the restaurants listed on Zomato website as of 15th March 2019.

About Zomato

Zomato is an Indian restaurant search and discovery service founded in 2008 by Deepinder Goyal and Pankaj Chaddah. It currently operates in 24 countries. It provides information and reviews of restaurants, including images of menus where the restaurant does not have its own website and also online delivery.

Source: Zomato

Data Context

The basic idea of analyzing the Zomato dataset is to get a fair idea about the factors affecting the establishment of different types of restaurants at different places in Bengaluru. This Zomato data aims at analyzing demography of the location. Most importantly it will help new restaurants in deciding their theme, menus, cuisine, cost, etc for a particular location. It also aims at finding similarity between neighborhoods of Bengaluru on the basis of food.

1. Load the Data

We will use pandas to read the dataset.

import pandas as pd
#load the data
zomato_data = pd.read_csv("../input/zomato.csv")
zomato_data.head() #looking at first five rows of the data

2. Basic Data Understanding

Let’s start with basic data understanding by checking the data types of the columns in which we are interested to work with.

#get the datatypes of the columns
zomato_data.dtypes

Only the variable votes is read as an integer, remaining 16 columns are read as objects. So the variables like rating, approx_cost(for two people) should be changed to integer if we want to perform any analysis on them.

If you want to get the list of all the columns present in the dataset:

zomato_data.columns #get the list of all the columns

3. Data Cleaning & Data Manipulation

In this section, we will discuss some of the basic data cleaning techniques like checking for duplicate values & handling missing values. Apart from data cleaning, we will also discuss some of the manipulation techniques like changing the data type of the variables, dropping unwanted variables and renaming the columns for convenience.

#check for any duplicate values
zomato_data.duplicated().sum()

There are no duplicate values present in this dataset.

#check for missing values
pd.DataFrame(round(zomato_data.isnull().sum()/zomato_data.shape[0] * 100,3), columns = ["Missing"])
Missing Data in Percentages.

The variable dish_liked as more than 54 % of missing data. If we drop the missing data, we would lose more than 50% of the data. To simplify the analysis, we will drop some of the columns that are not very useful like url, address and phone.

zomato_data.drop(["url", "address",  "phone"], axis = 1, inplace = True)

Renaming few columns for convenience

zomato_data.rename(columns={"approx_cost(for two people)": "cost_two", "listed_in(type)":"service_type", "listed_in(city)":"serve_to"}, inplace = True)

As we have seen earlier that the variable cost_two has data type object which we need to convert to integer so that we can analyze the variable.

#converting the cost_two variable to int.
zomato_data.cost_two = zomato_data.cost_two.apply(lambda x: int(x.replace(',','')))
zomato_data.cost_two = zomato_data.cost_two.astype('int') 

To convert the variable to an integer we could simply use astype('int') but in this scenario, this method would not work because of the presence of a comma in between the numbers, eg. 2,500. To avoid this kind of problem, we are using lambda and replace function to replace comma (,) with nothing and then convert to integer.

4. Visualization

In this section, we will analyze the data by creating multiple visualizations using seaborn and matplotlib. The entire code discussed in the article is present in this kaggle kernel.

a. Count Plot

Countplot is essentially the same as the barplot except that it shows the count of observations in each category bin using bars. In our dataset, let’s check the count of each rating category present.

#plot the count of rating.
plt.rcParams['figure.figsize'] = 14,7
sns.countplot(zomato_data["rate"], palette="Set1")
plt.title("Count plot of rate variable")
plt.show()

The rate variable follows near normal distribution with mean equal to 3.7. The rating for the majority of the restaurants lies within the range of 3.5-4.2. Very few restaurants (~350) has rated more than 4.8.

b. Joint Plot

Jointplot allows us to compare the two different variables and see if there is any relationship between these two variables. By using the Joint plot we can do both bivariate and univariate analysis by plotting the scatterplot (bivariate) and distribution plot (univariate) of two different variables in a single plotting grid.

#joint plot for 'rate' and 'votes'
sns.jointplot(x = "rate", y = "votes", data = zomato_data, height=8, ratio=4, color="g")
plt.show()

From the scatter plot, we can infer that the restaurant with a high rating has more votes. The distribution plot of the variable votes on the right side indicates that the majority of votes pooled lie in the bucket of 1000-2500.

c. Bar Plot

Barplot is one of the most commonly used graphic to represent the data. Barplot represents data in rectangular bars with length of the bar proportional to the value of the variable. We will analyze the variable location and see in which area most of the restaurants are located in Bangalore.

#analyze the number of restaurants in a location
zomato_data.location.value_counts().nlargest(10).plot(kind = "barh")
plt.title("Number of restaurants by location")
plt.xlabel("Count")
plt.show()

Most of the restaurants are located in BTM Layout area, makes it one of the most popular residential and commercial places in Bangalore.

d. Correlation Heatmap

Correlation describes how strongly a pair of variables are related to each other.

#seaborn heatmap function to plot the correlation grid
sns.heatmap(zomato_data.corr(), annot = True, cmap = "viridis",linecolor='white',linewidths=1)
plt.show()

The correlation function corr calculates the Pearson correlation between the numeric variables, it has a value between +1 and −1, where 1 is a total positive linear correlation, 0 is no linear correlation, and −1 is a total negative linear correlation.

  • Restaurants with an online order facility have an inverse relationship with the average cost of two.
  • Restaurants which provide an option of booking table in advance has a high average cost.

Further Analysis

In the previous section, we have seen how to perform basic data analysis by creating simple visualizations. Let’s do some further analysis based on the data context.

Restaurant Listed in

Lets see to in which area most of the restaurants are listed in or deliver to.

#restaurants serve to
zomato_data.serve_to.value_counts().nlargest(10).plot(kind = "barh")
plt.title("Number of restaurants listed in a particular location")
plt.xlabel("Count")
plt.show()

As expected most of the restaurants listed_in (deliver to) BTM Layout because this area is home to over 4750 restaurants. Even though Koramangala 7th Block doesn’t have many restaurants, it stands second in terms of the number of restaurants that deliver to this location.

Online Order

Analyzing the restaurants based on availability of online order facility

#count plot for online_order analysis
sns.countplot(zomato_data["online_order"], palette = "Set2")
plt.show()

More than 60% of the restaurants listed in zomato provide an option of online order remaining restaurants has an option of dine-in only.

Does online order facility impacts the rating of the restaurant?

sns.countplot(hue = zomato_data["online_order"], palette = "Set1", x = zomato_data["rate"])
plt.title("Distribution of restaurant rating over online order facility")
plt.show()

Restaurants which provide online order facility has better ratings than the traditional restaurants. It makes sense because many software employees stay in Bangalore and they tend to order a lot of food through the online.

Biggest Restaurant Chain and Best Restaurant Chain

plt.rcParams['figure.figsize'] = 14,7
plt.subplot(1,2,1)
zomato_data.name.value_counts().head().plot(kind = "barh", color = sns.color_palette("hls", 5))
plt.xlabel("Number of restaurants")
plt.title("Biggest Restaurant Chain (Top 5)")

plt.subplot(1,2,2)
zomato_data[zomato_data['rate']>=4.5]['name'].value_counts().nlargest(5).plot(kind = "barh", color = sns.color_palette("Paired"))
plt.xlabel("Number of restaurants")
plt.title("Best Restaurant Chain (Top 5) - Rating More than 4.5")
plt.tight_layout()

Cafe Coffee Day chain has over 90 cafes across the city that are listed in Zomato. On the other hand, Truffles – a burger chain has the best fast food restaurants (rating more than 4.5 out of 100), quality over quantity.

Next time when you visit Bangalore or if you want to check out a good restaurant over a weekend don’t forget to try the food at Truffles, Hammered and Mainland China.

The code discussed in the article is present in this kaggle kernel. Fork this kernel and try to create awesome visualizations on the same dataset or another dataset.

Recommended Reading

Conclusion

In this article, we have discussed how to utilize matplotlib and seaborn API to create beautiful visualization for exploring the relationship between the variables. Apart from that, we learned about a few different types of plots that can be used to present your findings to the stakeholders in a project discussion. If you any issues or doubts while implementing the above code, feel free to ask them in the comment section below or send me a message in LinkedIn citing this article.


Note: This is a guest post, and opinion in this article is of the guest writer. If you have any issues with any of the articles posted at www.marktechpost.com please contact at asif@marktechpost.com  

LEAVE A REPLY

Please enter your comment!
Please enter your name here

This site uses Akismet to reduce spam. Learn how your comment data is processed.