Based on the exploratory data analysis and heuristics, several hypotheses have been brought out to extract more insights from the dataset:
First, two of the most significant attributes in the restaurant data set are review counts and price. The former is a great indicator of the popularity of the restaurant, and the latter is an important metric for the customer to rate the restaurant. Therefore, the first hypothesis being brought up is whether the review count distributions are the sample among different price groups. The null hypothesis is that the distribution of review counts are the sample among all 4 price groups, namely ‘$’,’$$’,’$$$’ and ‘$$$$’ To verify the hypothesis, Analysis of Variance(ANOVA) method will be performed. Since there are multiple levels in the price features. T-test will also be applied to support the hypothesis further.
Second, intuitively, popularity and ratings of the restaurants should have a strong association. Popularity can be measured by review counts. The second hypothesis is that the review count and ratings of the restaurant have a linear relationship. This can be tested using a linear regression model.
The third hypothesis states as the good rating restaurants, moderate rating restaurants, and good restaurants can be well-separated using the features listed in the dataset. In other words, the hypothesis states that there is a clear decision boundary between different classes of the restaurants. Testing this hypothesis is the processing of building a multi-class classification model. To verify the hypothesis, logistics regression and other data-driven machine learning models will be applied.
One of the most important tasks prior to any supervised classification task is to make sure the data is properly labeled. The objective in this dataset is to predict whether a restaurant is good, and the most direct metric is the rating provided by Yelp. However, one issue with this label is that the number of reviews will potentially influence this rating. To compensate for the bias introduced by review count, a new fusion metric has been introduced. The new rating is calculated by treating the original rating minus 0.5 as the base score and the review count z-score will penalize within its original base score group. For example, if a restaurant has a rating of 4, its base score will be 3.5 and the final score is calculated by 0.5 times the ratio of the review count of this restaurant to the maximum review count of restaurants whose rating are also 4.
After the final score has been calculated, the class label is generated by binning the final score. Three classes have been created, namely 0, 1 and 2, they represent poor rating restaurant, moderate rating restaurant and good rating restaurant respectively. The objective of the classification task is to correctly classify each restaurant to the appropriate class using supervised classification techniques. Table 5-1 shows the class distribution in the class label.
ANOVA test is used to test the first hypothesis, which states that there is no significant difference in the number of reviews(review_counts) and price range. Figure 5-1 illustrated a two way ANOVA test results. Moreover, Figure 5-2 shows the Quantile-Quantile Plot of theoretical quantiles and sample quantiles in the test.
As illustrated above, there is a significant sum of square error, an F score of 38.37, and a near 0 p-value. All those statistics show strong evidence against the null hypothesis; therefore, the hypothesis is not valid. There are significant differences in review counts among different price groups.
To further verify the results, the t-test is conducted among pair-wisely among different price groups. Figure 5-3 shows an example of price group ‘$’ against other price groups. Again, this is evidence against the proposed hypothesis.
Linear regression is one of the most effective ways in examining the linear relationship between two variables. To verify the second hypothesis, a linear model has been fitted into the data and figure 5-4 shows the results of the linear regression model. The R-score obtained by this model is 0.16, which suggest that the linear relationship between these two variables is weak.
In next portion, six data-driven predictive models are applied to test the third hypothesis. Also, all six methods utilize the same set of training and testing data. K-fold cross-validation method has been used to test the robustness of the model, and ROC-AUC plot and confusion matrix for each model will be generated separately.
Logistics Regression is a widely used statistic model which utilize logistic function to model binary/multiclass dependent variables, in this case, the restaurant classes. Figure 5-5 and 5-6 illustrated the Receiver Operating Characteristic (ROC) curve for all class label and the confusion matrix heat map.
Decision tree-based classifier is one of the most-easy-to-understand classification techniques. The properties that the results of a decision tree classifier can be easily interpreted has make it popular. Figure 5-7 and 5-8 illustrated the Receiver Operating Characteristic (ROC) curve for all class label and the confusion matrix heat map.
Naïve Bayes classifier is a probabilistic classifier based on Bayes theorem by assuming the data feature are independent. Gaussian Naïve Bayes is used here to deal with the continuous values in this data set. Figure 5-9 and 5-10 illustrated the ROC curve for all class label and the confusion matrix heat map.
As a lazy learner, KNN classifier is an instance-based learning technique by applying majority vote principle in its neighborhood data point, while the neighborhood is obtained the pairwise distance measure. Feature 5-11 and 5-12 illustrate the ROC curve for all class label and the confusion matrix heat map.
Support Vector Machine is a popular supervised technique to deal with non-linear classification or regression problems. An SVM model is a representation of the examples as points in space, mapped so that the examples of the separate categories are divided by an apparent gap that is as wide as possible. In this case, a radial basis function kernel is applied. Figure 5-13 and 5-14 illustrate the ROC curve for all class label and the confusion matrix heat map.
Employing bagging principle, the random forest is an ensemble-tree based machine learning technique in classification and regression problems. Figure 5-15 and 5-16 illustrate the ROC curve for all class label and the confusion matrix heat map.
One important output of the random forest model is feature importance. Feature importance provides important information about how each features contributes towards the final model and quantify this contribution based on average information gain for each learner in the forest. Following graph shows the feature importance in a subset of data.
Based on above plot, the top three important features listed above are average price, noise_level and gym. The top features correspond well with the association rule mining and exploratory data analysis. However, another observation is that there is no strong determining factors in drawing the classification decision boundary.