Chicago Restaurant Week (hereafter, CRW) is an annual event during which diners explore prix-fixe menus at restaurants throughout Chicago. This year's CRW took place during Jan 24, 2020 – Feb 9, 2020.
Despite that information of all restaurants can be found on the official website, an in-depth data analysis of the restaurants that participated this event would be interesting in many ways. On a macro level, it helps understand Chicago's food culture, sheding light on restaurants' geographical distribution and its potential social implications. On a micro level, it enables customized recommendation for a specific user who wants to find a restaurant.
This project consists of 4 parts:
# Load modules
import pandas as pd
import numpy as np
import re, os
import json
import matplotlib.pyplot as plt
import seaborn as sns
The data was crawled from the website using Python (code). Since the website was available only during CRW, the code does not work now (but maybe it will work again in 2021!)
The data come from three sources: 1) the index page of CRW (code), 2) the detail page of each restaurant (code), and 3) Yelp API (code). Below is detailed introduction of each data source and their roles.
1) The index of CRW. From the index page of the official website we can get basic information of restaurants, such as name, address, cuisine style, and url to detail page of each restaurant. There are 444 restaurants that participated CRW in total.
Below are some examples:
df_Rindex = pd.read_csv("Rindex.txt", sep = "|")
print(df_Rindex.shape)
df_Rindex.head(n=5)
2) The detail page of each restaurant. Each restaurant has a detailed page, which can be found on the index page, and it provides more detailed information. Two columns (neighborhood 'neighbor' and description 'des') are incorporated in the index data frame, and the rest are stored separately (under the "./Details/" directory) because of large variation among restaurants.
Below are some examples:
df_Rdes = pd.read_csv("Rindex_w_des.txt", sep = "|")
print(df_Rdes.shape)
df_Rdes.head(n=5)
All of these restaurants provided neighborhood information, while 16 (4%) of them did not provide a description:
print(df_Rdes.loc[df_Rdes['neighbor'].isnull()].shape)
print(df_Rdes.loc[df_Rdes['des'].isnull()].shape)
As for detailed information, different restaurants provided very different information, ranging from 0 to 55 features. This piece of data is stored but not analyzed due to its complexity.
ex1 = pd.read_csv("./Details/18.txt", sep = "|")
print(ex1.shape)
ex1.sample(n=5)
ex2 = pd.read_csv("./Details/307.txt", sep = "|")
print(ex2.shape)
ex2
3) Yelp API. The goal is to get more information of the restaurants and especially their non-CRW, daily-routine information. If a restaurant is not found in Yelp, then it's filled with a certain set of values.
with open("yelp_best_match_json.txt") as rf1:
dic = json.load(rf1)
df_yelp = pd.DataFrame.from_dict(dic, orient = 'index')
print(df_yelp.shape)
df_yelp.head(n=5)
Combining these three sources of data, we get the final dataset.
with open("final_data.txt") as rf1:
dic = json.load(rf1)
df = pd.DataFrame.from_dict(dic, orient = 'index')
print(df.shape)
df.sample(n=5)
A "ratio_name" value and a "ratio_address" value are calculated to show to what extent the restaurant's information from the official website matches that from Yelp. These values range from 0 to 100, and higher values indicate better match. We consider a match as failed only when both values are below 50.
There are two restaurants that failed to find a match in Yelp, and we exclude them from relevant analysis (e.g. those involving price and rating).
df.loc[(df['ratio_address'] < 50) & (df['ratio_name'] < 50)]
# The distribution of ratio_name and ratio_address
grid = sns.JointGrid(df['ratio_name'], df['ratio_address'])
grid.plot_joint(sns.scatterplot, color="g")
grid.plot_marginals(sns.rugplot, height=1, color="g")
df[["latitude","longitude"]] = df["coordinates"].apply(pd.Series)
df_in_Yelp = df.loc[(df['ratio_address'] >= 50) | (df['ratio_name'] >= 50)]
# Visualize
g = sns.scatterplot(x = "longitude",
y = "latitude",
hue = 'neighbor',
marker = 'o',
data = df_in_Yelp)
g.legend_.remove()
from sklearn.covariance import EllipticEnvelope
X = df_in_Yelp[["longitude","latitude"]].values
clf = EllipticEnvelope(contamination = 0.1)
clf.fit(X)
y_pred = clf.predict(X)
# scatter
colors = np.array(['#377eb8', '#ff7f00'])
plt.scatter(X[:, 0], X[:, 1], s=10, color=colors[(y_pred + 1) // 2])
# contour
xx, yy = np.meshgrid(np.linspace(-88.4, -87.5, 150),
np.linspace(41.7, 42.3, 150))
Z = clf.predict(np.c_[xx.ravel(), yy.ravel()])
Z = Z.reshape(xx.shape)
plt.contour(xx, yy, Z, levels=[0], linewidths=2, colors='black')
# zoom in
df_downtown = df_in_Yelp.loc[y_pred == 1]
g = sns.scatterplot(x = "longitude",
y = "latitude",
hue = 'neighbor',
marker = 'o',
data = df_downtown)
g.legend_.remove()
# What neighbors are excluded?
dfm1 = df_downtown.groupby(['neighbor'])['ind'].count().reset_index().sort_values(by='ind',ascending=False).rename(columns = {'ind':'count_dt'})
dfm2 = df_in_Yelp.groupby(['neighbor'])['ind'].count().reset_index().sort_values(by='ind',ascending=False).rename(columns = {'ind':'count_yelp'})
dfm_diff = pd.merge(dfm2, dfm1, how = 'left').fillna(0)
dfm_diff['diff'] = dfm_diff['count_yelp'] - dfm_diff['count_dt']
dfm_diff.loc[dfm_diff['diff'] > 0]
dfm = df_downtown.groupby(['neighbor'])['ind'].count().reset_index().sort_values(by='ind',ascending=False).rename(columns = {'ind':'count'})
dfm
fig, ax = plt.subplots(figsize=(9,5))
ax.plot(dfm['neighbor'], dfm['count'], 'o-', color = '#6A5ACD')
plt.xlabel("Neighborhood", fontsize = 14, fontweight="bold")
plt.ylabel("count", fontsize = 14, fontweight="bold")
plt.xticks(fontsize=12, rotation = 90)
plt.yticks(fontsize=12)
plt.title("# Restaurants by neighborhood", fontsize = 14, fontweight="bold")
plt.grid(True)
for i in range(dfm.shape[0]):
plt.annotate(str(dfm.loc[i]['count']),
(dfm.loc[i]['neighbor'], dfm.loc[i]['count']*1.02),
fontsize = 10)
There are 3 neighborhoods with a lot of restaurants participating CRW: they are River North (93), West Loop (46), and Loop (45).
print(df_in_Yelp.groupby(['price'])['name'].count().reset_index().sort_values(by='name',ascending=False))
I care about restaurants that are usually expensive. There (at least) 18 restaurants. I'll check and see what these restaurants are.
df_in_Yelp.loc[df_in_Yelp['price'] == "$$$$", ['name', 'cuisines', 'neighbor', 'rating', 'review_count', 'meal_option']].sort_values(by = ['rating', 'review_count'], ascending=False)
We get a clear idea about which restaurants to go if we just care about the price.
Since a restaurant may have more than one cuisine style, we spread the cuisines column.
df_cuisine = pd.concat([pd.Series(row['ind'], row['cuisines'].lstrip('[').rstrip("]").replace(" ","").split(","))
for _, row in df_in_Yelp.iterrows()]).reset_index()
df_cuisine.columns = ["cuisines","ind"]
df_cuisine = df_cuisine.drop_duplicates()
print(df_cuisine.shape)
df_cuisine.groupby(['cuisines'])['ind'].count().reset_index().sort_values(by='ind',ascending=False)
American cuisines dominate the restaurants!
Most restaurants have a raing around 4, with few restaurants having ratings as low as 2.0.
print(df_in_Yelp.groupby(['rating'])['ind'].count().reset_index().sort_values(by='rating',ascending=False))
df_in_Yelp.groupby(['rating'])['ind'].count().reset_index().plot.bar(x='rating', y='ind', rot=0)
!jupyter nbconvert --execute --to html CRWnotebook.ipynb