Exploratory Analysis 4

Association Rules Insights:

Association rules mining is to find items that frequently occur together in same transactions, it’s useful to find co-occurrence relationships in dataset. In order to explore relationships between factors and restaurants’ popularity/rating, utilization of this frequent itemset mining method is necessary to find meaningful elements among them. Also, Apriori algorithm can effectively improve accuracy and efficiency of the analysis process by reducing itemset from bottom to top.

This analysis could interpret relationships among different features of data. Binning method is used as a pre-processing step on the dataset to transfer numeric columns into categoric columns to guarantee the correctness of the algorithm. By changing different minimum support values and confidence values, rules of different credibility are generated. Through analyzing important rules, meaningful outcomes can be discovered. Some practical advices can be given to restaurant owners based on the analysis.

Data Binning:

In the data binning process, categorization is performed on some numeric factors, including the number of ATMs, number of banks, percentage of the white population and so on, to make sure data within some range could be recognized as the same level.

More specifically, data are divided into four parts according to its quantiles – values lower than Q1 would be replaced as ‘small values’; values between Q1 and Q2 are replaced by ‘median values’; values between Q2 and Q3 are replaced by ‘large values’, and values larger than Q3 are replaced by ‘super large values’. Therefore, some columns of table 3-1 and 3-2 are binned into four levels, which could be useful in the later association rules mining process.

Furthermore, categorization is also applied to ‘category_x’ factor since it contains too many different values that could be defined as the same meaning. To ensure features wouldn’t be filtered out because of low frequency, values with same or similar meanings are combined together and assigned as same value.

Applying Association Rules Analysis:

When applying the association rules mining method, msvs (minimum support values) and mcvs (minimum confidence values) for the Apriori method are selected differently. Starting with msvs=0.02 and mcvs=0.5, then increase minimum support threshold to 0.03, 0.04 and 0.05. As a result, four association rules datasets are achieved. Table 4-1 is a snapshot of association rules that we get when msvs=0.04 and mcvs=0.5.

Table 4-1: Association Rules Dataset with support = 0.04

In the dataset, rating and counts of reviews represent the quality and popularity of restaurants. Since the goal is to find factors leading to a good restaurant, only rules which related to the association between factors and the rating or reviews counts are valuable to keep, so that all irrelevant rules are filtered out.

Table 4-2: Filtered Association Rules Dataset with support = 0.03

Table 4-2 is a snapshot of the filtered association rules dataset with msvs=0.03 and mcvs=0.5. The dataset gives insights about useful relationships.

This table only contains rules relevant to restaurants’ rating and popularity. For example, the second rule illustrates that when a restaurant has a larger number of reviews and the average price of this restaurant is 20, then it is very likely to be rated as 4. Therefore it is reasonable to state that people prefer to write reviews if restaurants are good. Also, a lower price could gain more popularity. Moreover, from rule 5,6,7, the neighbourhood facilities have a significant impact on the restaurants’ score since restaurants with super many bars, beauty salons and gyms nearby are more likely to have rating 4.

In order to find the most important factors that may lead restaurants receive good rating and high popularity from both internal and external factors, data are filtered more detailed.

Table 4-3: Filtered Association Rules For Internal Factors

Table 4-3 shows the result after filtered data for internal factors. The internal factor ‘low noise level’ always occurs together with ‘high restaurant rating’. And these rules are all strong rules since they have high confidence values and their lift are all greater than 1. That means noise level may have a strong correlation with restaurant ratings. Therefore, correlation analysis is applied on these two variables in next step to verify this conclusion.

Plotly 4-1: Heatmap For Noise Level and Rating

Plotly 4-2: Linear Regression Graph For Noise Level and Rating

Plotly 4-1 and 4-2 are heatmap and linear regression plot for noise level and restaurants’ rating. These two graphs show that these two variables are obviously negatively correlated with each other. Especially when noise levels are low, restaurants tend to have good ratings. That illustrates that the internal factor ‘noise level’ does has a significant impact on restaurant rating.

This result may be somewhat unexpected, since noise level is not usually considered as an very important factor when customers rating restaurants. However, above analysis shows that noise level is a key factor for customer’s satisfaction and they do pay a lot attention to restaurants environment. Therefore, for restaurant owners, creating a quiet dining environment for customers is an important way to get good ratings.

Table 4-4: Filtered Association Rules For External Factors

For external factors, table 4-4 shows the result after filtered association rules data. ‘Super many_gym’ and ‘Many_gym’ are always co-occurrence with high restaurant ratings. And these rules are also strong. Correlation analysis is applied again to see the relationship between these two variables.

Plotly 4-3: Heatmap For Gym Number and Rating

Plotly 4-4: Linear Regression Graph For Gym Number and Rating

Plotly 4-3 and 4-4 show that gym number has a positive correlation with restaurant rating. Large number of gym numbers is always related to high restaurant rating, that can be seen from the heatmap. Therefore, that illustrates gym number is a key external factor that can lead to good restaurant rating.

This conclusion may be surprised for many people, since working out and eating are usually considered opposite to each other. The reasons for this result can be interpreted from many perspectives. One possible reason is people may have a better appetite after exercise, so they tend to think food are more delicious. Or most of the fitness people belong to middle or high-income group and they prefer to choose a higher-end restaurant to eat. Therefore it is a good idea for restaurant owners to open their restaurant in gym areas.

Applied association on dataset, rules with important relationships are discovered. In order to find most important factors from both internal and external, additional analysis are conducted. For internal factors, noise level has a considerable impact on restaurant rating. For external factors, gym number around restaurants is obviously positive correlated with restaurant rating. Two practical advices are given to restaurant owners based on these analysis.

1 2 3 4