Background

Chicago Restaurant Week (hereafter, CRW) is an annual event during which diners explore prix-fixe menus at restaurants throughout Chicago. This year's CRW took place during Jan 24, 2020 – Feb 9, 2020.

Despite that information of all restaurants can be found on the official website, an in-depth data analysis of the restaurants that participated this event would be interesting in many ways. On a macro level, it helps understand Chicago's food culture, sheding light on restaurants' geographical distribution and its potential social implications. On a micro level, it enables customized recommendation for a specific user who wants to find a restaurant.

This project consists of 4 parts:

  1. The data: Crawl restaurants' data from the official website
  2. A bird's-eye view of the data: descriptive statistics and visualization
  3. Aspect 1: What's new in the menu?: Analyzing menu texts (TODO)
  4. Aspect 2: Which restaurant to go? Price, cuisine, and recommendation (TODO)
In [1]:
# Load modules
import pandas as pd
import numpy as np
import re, os
import json
import matplotlib.pyplot as plt
import seaborn as sns

1. The data

The data was crawled from the website using Python (code). Since the website was available only during CRW, the code does not work now (but maybe it will work again in 2021!)

The data come from three sources: 1) the index page of CRW (code), 2) the detail page of each restaurant (code), and 3) Yelp API (code). Below is detailed introduction of each data source and their roles.

1) The index of CRW. From the index page of the official website we can get basic information of restaurants, such as name, address, cuisine style, and url to detail page of each restaurant. There are 444 restaurants that participated CRW in total.

Below are some examples:

In [2]:
df_Rindex = pd.read_csv("Rindex.txt", sep = "|")

print(df_Rindex.shape)
df_Rindex.head(n=5)
(444, 8)
Out[2]:
page pg_id name cuisines address url alt_option meal_option
0 1 1 1776 Restaurant ['American'] 397 W Virginia Street https://www.choosechicago.com/listing/1776-res... ['Gluten-free'] ['$36 Dinner']
1 1 2 20 East ['American'] 20 E. Delaware Pl. https://www.choosechicago.com/listing/20-east/ [] ['$36 Dinner']
2 1 3 312 Chicago ['Italian'] 136 N. LaSalle St. https://www.choosechicago.com/listing/312-chic... [] ['$36 Dinner']
3 1 4 676 Restaurant & Bar ['American'] 676 N. Michigan Ave. https://www.choosechicago.com/listing/676-rest... ['Vegetarian', 'Gluten-free'] ['$24 Lunch', '$36 Dinner', '$48 Dinner']
4 1 5 90th Meridian ['American Contemporary'] 231 S. LaSalle St. Ste. 108 https://www.choosechicago.com/listing/90th-mer... [] ['$24 Lunch']

2) The detail page of each restaurant. Each restaurant has a detailed page, which can be found on the index page, and it provides more detailed information. Two columns (neighborhood 'neighbor' and description 'des') are incorporated in the index data frame, and the rest are stored separately (under the "./Details/" directory) because of large variation among restaurants.

Below are some examples:

In [3]:
df_Rdes = pd.read_csv("Rindex_w_des.txt", sep = "|")

print(df_Rdes.shape)
df_Rdes.head(n=5)
(444, 10)
Out[3]:
page pg_id name cuisines address url alt_option meal_option neighbor des
0 1 1 1776 Restaurant ['American'] 397 W Virginia Street https://www.choosechicago.com/listing/1776-res... ['Gluten-free'] ['$36 Dinner'] Northwest Suburbs We believe fine dining is more than the very b...
1 1 2 20 East ['American'] 20 E. Delaware Pl. https://www.choosechicago.com/listing/20-east/ [] ['$36 Dinner'] River North At 20 East, creative menus featuring quality-d...
2 1 3 312 Chicago ['Italian'] 136 N. LaSalle St. https://www.choosechicago.com/listing/312-chic... [] ['$36 Dinner'] Loop NaN
3 1 4 676 Restaurant & Bar ['American'] 676 N. Michigan Ave. https://www.choosechicago.com/listing/676-rest... ['Vegetarian', 'Gluten-free'] ['$24 Lunch', '$36 Dinner', '$48 Dinner'] The Magnificent Mile We are an American Contemporary restaurant wit...
4 1 5 90th Meridian ['American Contemporary'] 231 S. LaSalle St. Ste. 108 https://www.choosechicago.com/listing/90th-mer... [] ['$24 Lunch'] Loop Our menu has something for all diets and occas...

All of these restaurants provided neighborhood information, while 16 (4%) of them did not provide a description:

In [4]:
print(df_Rdes.loc[df_Rdes['neighbor'].isnull()].shape)
print(df_Rdes.loc[df_Rdes['des'].isnull()].shape)
(0, 10)
(16, 10)

As for detailed information, different restaurants provided very different information, ranging from 0 to 55 features. This piece of data is stored but not analyzed due to its complexity.

In [5]:
ex1 = pd.read_csv("./Details/18.txt", sep = "|")
print(ex1.shape)
ex1.sample(n=5)
(49, 3)
Out[5]:
section feature value
10 Facility Meeting Space Details Max Capacity, Classroom Style 60.0
16 Facility Meeting Space Details On-Site Catering NaN
7 Facility Meeting Space Details Private Meeting Space Available NaN
11 Facility Meeting Space Details Max Capacity, Reception Style (Standing) 100.0
23 Facility Meeting Space Details Number of Meeting Rooms NaN
In [6]:
ex2 = pd.read_csv("./Details/307.txt", sep = "|")
print(ex2.shape)
ex2
(1, 3)
Out[6]:
section feature value
0 Certifications / Ratings Zagat Rated NaN

3) Yelp API. The goal is to get more information of the restaurants and especially their non-CRW, daily-routine information. If a restaurant is not found in Yelp, then it's filled with a certain set of values.

In [7]:
with open("yelp_best_match_json.txt") as rf1:
    dic = json.load(rf1)
    
df_yelp = pd.DataFrame.from_dict(dic, orient = 'index')
print(df_yelp.shape)
df_yelp.head(n=5)
(444, 6)
Out[7]:
Yname Yaddress price rating coordinates review_count
0 1776 Restaurant {'address1': '397 W Virginia St', 'address2': ... $$$ 4.0 {'latitude': 42.233316, 'longitude': -88.335639} 144
1 20 East {'address1': '20 E Delaware Pl', 'address2': '... $$ 4.0 {'latitude': 41.8994, 'longitude': -87.6275} 63
10 Ada Street {'address1': '1664 N Ada St', 'address2': '', ... $$$ 4.0 {'latitude': 41.9124416, 'longitude': -87.6620... 515
100 Coda Di Volpe {'address1': '3335 N Southport Ave', 'address2... $$ 4.0 {'latitude': 41.94265, 'longitude': -87.66354} 276
101 Cold Storage {'address1': '1000 W Fulton Market', 'address2... $$ 4.0 {'latitude': 41.88724, 'longitude': -87.65277} 181

Combining these three sources of data, we get the final dataset.

In [8]:
with open("final_data.txt") as rf1:
    dic = json.load(rf1)

df = pd.DataFrame.from_dict(dic, orient = 'index')
print(df.shape)
df.sample(n=5)
(444, 20)
Out[8]:
page pg_id name cuisines address url alt_option meal_option neighbor des ind Yname Yaddress price rating coordinates review_count ratio_name Yaddress1 ratio_address
43 1 44 Berghoff Restaurant ['American', 'German'] 17 W. Adams St. https://www.choosechicago.com/listing/berghoff... ['Gluten-free'] ['$24 Lunch', '$36 Dinner'] Loop unknown 43 The Berghoff Restaurant {'address1': '17 W Adams St', 'address2': '', ... $$ 3.5 {'latitude': 41.8793326, 'longitude': -87.6286... 1047.0 90 17 W Adams St 93
428 9 29 Vivere ['Italian'] 71 W. Monroe St. https://www.choosechicago.com/listing/vivere/ [] ['$24 Lunch', '$48 Dinner'] Loop This is where we get very contemporary at the ... 428 Vivere {'address1': '71 W Monroe St', 'address2': Non... $$$ 3.5 {'latitude': 41.8803520202637, 'longitude': -8... 142.0 100 71 W Monroe St 93
425 9 26 Victory Tap Chicago ['Italian'] 1416 S. Michigan Ave. https://www.choosechicago.com/listing/victory-... ['Vegetarian'] ['$36 Dinner', '$48 Dinner'] South Loop A sophisticated sibling to the larger Armand’s... 425 Victory Tap Chicago {'address1': '1416 S Michigan Ave', 'address2'... $$ 4.0 {'latitude': 41.8635, 'longitude': -87.62454} 270.0 100 1416 S Michigan Ave 95
364 8 15 Taste 222 ['American'] 222 N. Canal St. https://www.choosechicago.com/listing/taste-222/ ['Vegetarian', 'Gluten-free'] ['$24 Lunch', '$48 Dinner'] West Loop Taste 222 is an intimate, upscale-chic restaur... 364 Taste 222 {'address1': '222 N Canal St', 'address2': '',... $$ 4.0 {'latitude': 41.886363, 'longitude': -87.64001} 85.0 100 222 N Canal St 93
218 5 19 McCormick & Schmick’s Seafood & Steaks – Rosemont ['Seafood'] 5320 N. River Rd. https://www.choosechicago.com/listing/mccormic... [] ['$24 Lunch', '$48 Dinner'] Northwest Suburbs McCormick and Schmicks is the Nation’s premier... 218 McCormick & Schmick's Seafood & Steaks {'address1': '5320 N River Rd', 'address2': ''... $$$ 3.0 {'latitude': 41.973998, 'longitude': -87.862742} 296.0 85 5320 N River Rd 94

A "ratio_name" value and a "ratio_address" value are calculated to show to what extent the restaurant's information from the official website matches that from Yelp. These values range from 0 to 100, and higher values indicate better match. We consider a match as failed only when both values are below 50.

There are two restaurants that failed to find a match in Yelp, and we exclude them from relevant analysis (e.g. those involving price and rating).

In [9]:
df.loc[(df['ratio_address'] < 50) & (df['ratio_name'] < 50)]
Out[9]:
page pg_id name cuisines address url alt_option meal_option neighbor des ind Yname Yaddress price rating coordinates review_count ratio_name Yaddress1 ratio_address
139 3 40 Fogo de Chão ['Brazilian'] 1824 Abriter Ct. Ste. K200 https://www.choosechicago.com/listing/fogo-de-... [] ['$48 Dinner'] Southwest Suburbs TBD 139 Fogo de Chao Brazilian Steakhouse {'address1': '661 N Lasalle Blvd', 'address2':... $$$ 4.0 {'latitude': 41.89418, 'longitude': -87.63251} 1622.0 49 661 N Lasalle Blvd 27
207 5 8 Macello Cucina di Puglia ['Italian'] 1235 W. Lake St. https://www.choosechicago.com/listing/macello-... ['Vegetarian', 'Gluten-free'] ['$24 Lunch', '$48 Dinner'] West Loop unknown 207 unknown unknown unknown -1.0 {'latitude': 0, 'longitude': 0} -1.0 13 unknown 9
In [10]:
# The distribution of ratio_name and ratio_address
grid = sns.JointGrid(df['ratio_name'], df['ratio_address'])
grid.plot_joint(sns.scatterplot, color="g")
grid.plot_marginals(sns.rugplot, height=1, color="g")
Out[10]:
<seaborn.axisgrid.JointGrid at 0x10fe12d30>

2. A bird's-eye view of the data: descriptive statistics and visualization

Where are the restaurants?: Geographic distribution

In [11]:
df[["latitude","longitude"]] = df["coordinates"].apply(pd.Series)

df_in_Yelp = df.loc[(df['ratio_address'] >= 50) | (df['ratio_name'] >= 50)]

# Visualize
g = sns.scatterplot(x = "longitude",
                y = "latitude",
                hue = 'neighbor',
                marker = 'o',
                data = df_in_Yelp)
g.legend_.remove()
In [12]:
from sklearn.covariance import EllipticEnvelope

X = df_in_Yelp[["longitude","latitude"]].values
clf = EllipticEnvelope(contamination = 0.1)
clf.fit(X)
y_pred = clf.predict(X)

# scatter
colors = np.array(['#377eb8', '#ff7f00'])
plt.scatter(X[:, 0], X[:, 1], s=10, color=colors[(y_pred + 1) // 2])

# contour
xx, yy = np.meshgrid(np.linspace(-88.4, -87.5, 150),
                     np.linspace(41.7, 42.3, 150))
Z = clf.predict(np.c_[xx.ravel(), yy.ravel()])
Z = Z.reshape(xx.shape)
plt.contour(xx, yy, Z, levels=[0], linewidths=2, colors='black')
Out[12]:
<matplotlib.contour.QuadContourSet at 0x113734fd0>
In [13]:
# zoom in
df_downtown = df_in_Yelp.loc[y_pred == 1]
g = sns.scatterplot(x = "longitude",
                y = "latitude",
                hue = 'neighbor',
                marker = 'o',
                data = df_downtown)

g.legend_.remove()
In [14]:
# What neighbors are excluded?
dfm1 = df_downtown.groupby(['neighbor'])['ind'].count().reset_index().sort_values(by='ind',ascending=False).rename(columns = {'ind':'count_dt'})   
dfm2 = df_in_Yelp.groupby(['neighbor'])['ind'].count().reset_index().sort_values(by='ind',ascending=False).rename(columns = {'ind':'count_yelp'})   
dfm_diff = pd.merge(dfm2, dfm1, how = 'left').fillna(0)
dfm_diff['diff'] = dfm_diff['count_yelp'] - dfm_diff['count_dt'] 
dfm_diff.loc[dfm_diff['diff'] > 0]
Out[14]:
neighbor count_yelp count_dt diff
7 Northwest Suburbs 17 0.0 17.0
11 West Suburbs 15 2.0 13.0
13 North Suburbs 13 3.0 10.0
16 O'Hare 4 0.0 4.0
17 Southwest Suburbs 4 3.0 1.0
In [15]:
dfm = df_downtown.groupby(['neighbor'])['ind'].count().reset_index().sort_values(by='ind',ascending=False).rename(columns = {'ind':'count'})   
dfm 
Out[15]:
neighbor count
19 River North 93
28 West Loop 46
13 Loop 45
25 Streeterville 27
9 Lincoln Park 24
5 Gold Coast 22
8 Lakeview 20
31 Wicker Park / Bucktown 16
30 West Town 16
22 South Loop 16
26 The Magnificent Mile 14
12 Logan Square 12
11 Little Italy / University Village 4
15 Old Town 3
24 Southwest Suburbs 3
32 Wrigleyville 3
14 North Suburbs 3
10 Lincoln Square 3
7 Irving Park 3
21 Roscoe Village 3
27 Uptown 2
29 West Suburbs 2
16 Pilsen 2
1 Andersonville 2
6 Hyde Park 2
4 Edgewater 2
3 Bridgeport 2
2 Avondale 2
23 South Suburbs 1
20 Rogers Park 1
18 Ravenswood 1
17 Portage Park 1
0 Albany Park 1
In [16]:
fig, ax = plt.subplots(figsize=(9,5))
ax.plot(dfm['neighbor'], dfm['count'], 'o-', color = '#6A5ACD')
plt.xlabel("Neighborhood", fontsize = 14, fontweight="bold")
plt.ylabel("count", fontsize = 14, fontweight="bold")
plt.xticks(fontsize=12, rotation = 90)
plt.yticks(fontsize=12)
plt.title("# Restaurants by neighborhood", fontsize = 14, fontweight="bold")
plt.grid(True)

for i in range(dfm.shape[0]):
    plt.annotate(str(dfm.loc[i]['count']), 
                 (dfm.loc[i]['neighbor'], dfm.loc[i]['count']*1.02), 
                 fontsize = 10)

There are 3 neighborhoods with a lot of restaurants participating CRW: they are River North (93), West Loop (46), and Loop (45).

Price

In [17]:
print(df_in_Yelp.groupby(['price'])['name'].count().reset_index().sort_values(by='name',ascending=False))
     price  name
0       $$   240
1      $$$   142
3  unknown    42
2     $$$$    18

I care about restaurants that are usually expensive. There (at least) 18 restaurants. I'll check and see what these restaurants are.

In [18]:
df_in_Yelp.loc[df_in_Yelp['price'] == "$$$$", ['name', 'cuisines', 'neighbor', 'rating', 'review_count', 'meal_option']].sort_values(by = ['rating', 'review_count'], ascending=False)
Out[18]:
name cuisines neighbor rating review_count meal_option
317 RPM Steak ['Steakhouse'] River North 4.5 1407.0 ['$24 Lunch', '$48 Dinner']
288 Prime & Provisions ['Steakhouse'] Loop 4.5 1106.0 ['$24 Brunch', '$24 Lunch', '$48 Dinner']
376 The Capital Grille – Rosemont ['American', 'Steakhouse'] O'Hare 4.5 452.0 ['$24 Lunch', '$36 Dinner']
426 Vie ['American Contemporary'] West Suburbs 4.5 305.0 ['$48 Dinner']
215 Mastro’s Steakhouse ['Steakhouse'] River North 4.0 1252.0 ['$48 Dinner']
250 NoMI Kitchen ['French'] The Magnificent Mile 4.0 727.0 ['$24 Lunch', '$36 Dinner', '$48 Dinner']
236 Morton’s The Steakhouse – Chicago (The Original) ['Steakhouse'] Gold Coast 4.0 300.0 ['$48 Dinner']
179 Katana ['Asian Fusion'] River North 4.0 297.0 ['$24 Lunch', '$48 Dinner']
158 GT Prime ['Steakhouse'] River North 4.0 273.0 ['$48 Dinner']
255 Odyssey Lake Michigan ['American'] Streeterville 4.0 265.0 ['$48 Dinner']
42 Bellemore ['American Contemporary'] West Town 4.0 227.0 ['$24 Lunch', '$48 Dinner']
239 Morton’s The Steakhouse – Rosemont ['Steakhouse'] Northwest Suburbs 4.0 195.0 ['$48 Dinner']
240 Morton’s The Steakhouse – Schaumburg ['Steakhouse'] Northwest Suburbs 4.0 164.0 ['$48 Dinner']
61 Brindille ['French'] River North 4.0 156.0 ['$48 Dinner']
412 Topolobampo ['Mexican'] River North 3.5 997.0 ['$24 Lunch']
84 Chicago Chop House ['Steakhouse'] River North 3.5 745.0 ['$48 Dinner']
253 Oceanique Restaurant ['Seafood'] North Suburbs 3.5 236.0 ['$48 Dinner']
238 Morton’s The Steakhouse – Northbrook ['Steakhouse'] North Suburbs 3.5 132.0 ['$48 Dinner']

We get a clear idea about which restaurants to go if we just care about the price.

Cuisines

Since a restaurant may have more than one cuisine style, we spread the cuisines column.

In [19]:
df_cuisine = pd.concat([pd.Series(row['ind'], row['cuisines'].lstrip('[').rstrip("]").replace(" ","").split(","))              
                    for _, row in df_in_Yelp.iterrows()]).reset_index()
df_cuisine.columns = ["cuisines","ind"]
df_cuisine = df_cuisine.drop_duplicates()
In [20]:
print(df_cuisine.shape)
df_cuisine.groupby(['cuisines'])['ind'].count().reset_index().sort_values(by='ind',ascending=False)    
(470, 2)
Out[20]:
cuisines ind
1 'American' 94
22 'Italian' 64
2 'AmericanContemporary' 61
36 'Steakhouse' 47
32 'Seafood' 28
27 'Mexican' 24
15 'French' 21
23 'Japanese/Sushi' 16
16 'GastroTavern/Pub' 13
26 'Mediterranean' 9
25 'Latin' 9
4 'BBQ/Ribs' 8
31 'Pizza' 8
35 'Spanish/Tapas' 7
3 'AsianFusion' 6
19 'Indian' 4
39 'WineBar' 4
9 'Chinese' 4
17 'German' 3
18 'Greek' 3
34 'Southern' 3
11 'Eclectic' 3
7 'Brazilian' 3
38 'Vietnamese' 3
5 'Bakery/Café/Deli/Diner' 3
29 'PanAsian' 2
37 'Vegetarian/Vegan' 2
33 'SoulFood' 2
20 'InternationalFusion' 2
21 'Irish' 2
14 'FoodHall' 2
12 'Filipino' 2
28 'MiddleEastern' 1
30 'Peruvian' 1
24 'Korean' 1
13 'Fondue' 1
10 'DessertBar' 1
8 'Cajun/Creole' 1
6 'Bistro' 1
0 'African' 1

American cuisines dominate the restaurants!

Ratings

Most restaurants have a raing around 4, with few restaurants having ratings as low as 2.0.

In [21]:
print(df_in_Yelp.groupby(['rating'])['ind'].count().reset_index().sort_values(by='rating',ascending=False))
   rating  ind
6     5.0    1
5     4.5   65
4     4.0  240
3     3.5  105
2     3.0   25
1     2.5    5
0     2.0    1
In [22]:
df_in_Yelp.groupby(['rating'])['ind'].count().reset_index().plot.bar(x='rating', y='ind', rot=0)
Out[22]:
<matplotlib.axes._subplots.AxesSubplot at 0x113a7aac8>
In [ ]:
!jupyter nbconvert --execute --to html CRWnotebook.ipynb