Home Overview Exploratory Analysis Predictive Analysis Conclusion Contact

Data Collection:

Social media nowadays allow consumers to share their experiences with the public on designated platforms to help express their opinions; online reviews have become one of the most influential factors in restaurant selection. Yelp, a popular diner review site, is the primary source to collect data about restaurants in large cities. Moreover, the associated information based on the location of restaurants is web scrapped from city-data and extracted from Google API.

Part 1: Yelp API Data

Yelp Fusion API is a powerful tool providing access to Yelp Open data containing detailed business information and user reviews for all types of businesses like restaurants, hotels. Utilizing the Business Search options in Yelp Fusion API, about 5500 restaurants data are collected entries in 7 US big cities and metropolitan areas, including New York City Area, Washington DC, Boston, Seattle, Huston, San Francisco, and Los Angeles. One thing worth mentioning is that the data is collected in batches of neighborhoods in cities including over 700 neighborhoods in those cities. There are many duplicates in those batches, and the duplicate entries have been removed during the data collection phase by selecting unique Yelp Business ID. Table 0-1 illustrates the distributions of data in different cities. Also, Figure 1 shows a snapshot of the data coming from Yelp Data API.

Table 0-1: Distribution of Restaurant Data in Big Cities
Figure 0-1: Yelp API Data at a Glance

Part2: Yelp Web Data

Yelp API data provides a general overview of a restaurant business; however, no further details are given to describe the restaurant regarding its service, environment and so on. Additional information about the restaurants (such as payment method, parking availability, and so on) are web scrape from the Yelp website. Figure 0-2 shows a snapshot of the Yelp website and the list of information available.

Figure 0-2: Yelp Business Details

From: https://www.yelp.com/biz/il-canale-washington-2?osq=Restaurants

The Scrape program makes uses of HTML parser to extract information from the yelp page for all the restaurants that are retrieved from Yelp Data API. The summary of all the available features from the scrape program is shown in table 0-2.

Table 0-2: Additional Information Gathered from Web

Part 3: Google Maps Places API

Google Maps Places API provides a comprehensive function in retrieving nearby places data of the restaurants. Making use of the latitude and longitude pulled from Yelp API, a cycle with 500 meters in radius is drawn, list out all the places of interest within that cycle. The places of interest include bus stops, train stations, supermarkets and so on. The retrieved raw data is initially in the form of a tall table; then the table is pivoted into a wide table for data join. The transformed data contains the number of different places near the restaurants. Figure 3 and 4 show the raw data and pivoted data retrieved from Google Maps Places API respectively.

Figure 0-3: Raw Data From Google Map Places API
Figure 0-4: Pivoted Data From Google Maps API

Part 4: Data Joining and Additional Information

The three sets of data are joined together for data cleanness assessment. Among the three datasets, the join key is Yelp Business ID, a unique identifier of the restaurants. There are altogether 5133 rows data with 59 columns after the data joining. Moreover, 50 out of the 59 columns are considered as feature columns, and out of the 50 dimensions, 24 of them are considered as numeric columns before data cleaning and feature engineering.

Part 5: Demographic data from City-Data

Another dataset containing demographics information by zip codes of restaurants is web scraped from City-Data and joint to the combined dataset obtained previously. The target profiles of each zip code are the distribution of populations, race, education level, employment.

View the Data

The complete dataset after cleaning could be viewed from the bokeh app below.

1 2