Social media nowadays allow consumers to share their experiences with the public on designated platforms to help express their opinions; online reviews have become one of the most influential factors in restaurant selection. Yelp, a popular diner review site, is the primary source to collect data about restaurants in large cities. Moreover, the associated information based on the location of restaurants is web scrapped from city-data and extracted from Google API.
Yelp Fusion API is a powerful tool providing access to Yelp Open data containing detailed business information and user reviews for all types of businesses like restaurants, hotels. Utilizing the Business Search options in Yelp Fusion API, about 5500 restaurants data are collected entries in 7 US big cities and metropolitan areas, including New York City Area, Washington DC, Boston, Seattle, Huston, San Francisco, and Los Angeles. One thing worth mentioning is that the data is collected in batches of neighborhoods in cities including over 700 neighborhoods in those cities. There are many duplicates in those batches, and the duplicate entries have been removed during the data collection phase by selecting unique Yelp Business ID. Table 0-1 illustrates the distributions of data in different cities. Also, Figure 1 shows a snapshot of the data coming from Yelp Data API.
Yelp API data provides a general overview of a restaurant business; however, no further details are given to describe the restaurant regarding its service, environment and so on. Additional information about the restaurants (such as payment method, parking availability, and so on) are web scrape from the Yelp website. Figure 0-2 shows a snapshot of the Yelp website and the list of information available.
From: https://www.yelp.com/biz/il-canale-washington-2?osq=Restaurants
The Scrape program makes uses of HTML parser to extract information from the yelp page for all the restaurants that are retrieved from Yelp Data API. The summary of all the available features from the scrape program is shown in table 0-2.
Google Maps Places API provides a comprehensive function in retrieving nearby places data of the restaurants. Making use of the latitude and longitude pulled from Yelp API, a cycle with 500 meters in radius is drawn, list out all the places of interest within that cycle. The places of interest include bus stops, train stations, supermarkets and so on. The retrieved raw data is initially in the form of a tall table; then the table is pivoted into a wide table for data join. The transformed data contains the number of different places near the restaurants. Figure 3 and 4 show the raw data and pivoted data retrieved from Google Maps Places API respectively.
The three sets of data are joined together for data cleanness assessment. Among the three datasets, the join key is Yelp Business ID, a unique identifier of the restaurants. There are altogether 5133 rows data with 59 columns after the data joining. Moreover, 50 out of the 59 columns are considered as feature columns, and out of the 50 dimensions, 24 of them are considered as numeric columns before data cleaning and feature engineering.
Another dataset containing demographics information by zip codes of restaurants is web scraped from City-Data and joint to the combined dataset obtained previously. The target profiles of each zip code are the distribution of populations, race, education level, employment.
The complete dataset after cleaning could be viewed from the bokeh app below.