Exploratory Analysis 3

Cluster Analysis Insights:

Clustering is an essential process of discovering data distribution. Machine learning techniques identify how data are related or unrelated. This section illustrates the procedures of conducting three cluster analyses on our dataset including a hierarchical clustering method, K-means method, and the dbscan clustering analysis. By applying these methods, a comprehensive interpretation of data distribution can be drawn.

Data Binning:

One of essential limitation of the clustering algorithm is that it can be only applied to the numeric data; thus columns with categoric values should be binned into numeric in our dataset before applying the clustering algorithm. Table 3-1 and Table 3-2 shows a snapshot of the cleaned dataset. Features like ‘price’, ‘category_x’, ‘Alcohol’, ‘Parking’ are categorical, so that these columns need to be binned.

For ‘price’ columns, different numbers of ‘$’ represent the average price of restaurants. Values in this column are binned into 1 to 4 corresponding to the ‘$’ to ‘$$$$’. For ‘category_x’, ‘Alcohol’, ‘Parking’ columns, values represent the richness of restaurants. For example, if a restaurant has ‘breweries, trad American, beer bar’ category, it can provide three kinds of food. For those three columns, it should be binned according to the number of values each cell has. If a restaurant’s ‘Parking’ feature has three values like ‘Garage, Street, Validated’, the binning value for this feature should be 3 to represent the richness of parking ways it can provide.

After applying binning method on categorical columns, all the values in the dataset are numeric and can be applied clustering algorithms now.

Table 3-1: Full Dataset Part 1

Table 3-2: Full Dataset Part 2

Selecting Homogenous Properties:

At first, three different clustering algorithms are applied to the whole dataset. However, the silhouette score for each algorithm is very low which indicates the poor performance of clustering methods. Figure 3-1 to 3-6 show the clustering results for three algorithms applied to all valuable columns. It is also hard to interpret the meaning of clustering result based on these graphs.

Figure 3-1: 2D graph for k-means method for All Valuable Columns with n=6

Figure 3-2: 3D graph for k-means method for All Valuable Columns with n=6

Figure 3-3: 2D graph for hierarchical method for All Valuable Columns with n=6

Figure 3-4: 3D graph for hierarchical method for All Valuable Columns with n=6

Figure 3-5: 2D graph for dbscan method for All Valuable Columns

Figure 3-6: 3D graph for dbscan method for All Valuable Columns

This result illustrates that it is meaningless to apply clustering algorithms on the whole dataset since different features have a diverse range of values. For neighborhood features like ‘bank’, ‘school’, the range of these values are about 10 to 30. For features like ‘review count’, most of its values are larger than 100. Therefore, mistakes would occur if applying clustering technique on the above features since the values of these features represent different meaning.

The next step needed to be done is to select features have the same meaning. By looking at the dataset, some features with homogeneous properties can be easily found. There are five features represent the population composition near the restaurants, they are ‘White population’, ‘Black population’, ‘American Indian population’, ‘Asian population’, ‘Hispanic or Latino population’. For the surrounding facilities near the restaurants, there are eight features measure it: ‘atm’, ‘bank’, ‘bar’, ‘beauty_salon’, ‘bus_station’, ‘cafe’, ‘gym’, ‘school’. Moreover, feathers like ‘Accept_Credit_Card’, ‘Outdoor_Seating’, ‘Take_out’, ‘Takes_Reservations’, ‘WIFI’ represent the internal factors of restaurants. ‘High school or higher’, ‘Graduate or professional degree’ and ‘Unemployed’ features represent the people’s education level in the neighborhood of restaurants. Category_x’, ‘Alcohol’, ‘Parking’ can be used to represent the internal richness of restaurants.

According to above analysis, clustering algorithms will be applied on five subsets of the full dataset including ‘Population composition’, ‘Neighborhood’, ‘Internal Factors’, ‘Internal Richness’ and ‘Education Level’.

Applying Clustering Algorithms for Neighborhood Subset:

In order to get the most accurate clustering result, three different n values are used for k-means and the hierarchical algorithms. For dbscan method, three different eps values and minimum sample data values are used. The most accurate method for each subset of the dataset is found according to the silhouette score.

Take ‘Neighbourhood’ subset as an example, both k-means and hierarchical algorithms with n=4, 6 and 8 are implemented respectively. From the k-means algorithm, the average silhouette score raises from 0.1759 to 0.1775, and then decreases to 0.1708, so it’s apparent that n=6 is the most suitable value. For the hierarchical algorithm, the average silhouette score drops from 0.1374 to 0.1370 and finally 0.1271, so n=4 should fit this algorithm best.

Figure 3-7 and 3-8 show the clustering results for the k-means method with n=6. Figure 3-9 and 3-10 show the clustering results for the hierarchical method with n=4.

Figure 3-7: 2D graph for k-means method for Neighbourhood with n=6

Figure 3-8: 3D graph for k-means method for Neighbourhood with n=6

Figure 3-9: 2D graph for hierarchical method for Neighbourhood with n=4

Figure 3-10: 3D graph for hierarchical method for Neighbourhood with n=4

As for dbscan method, average silhouette score with the different pair of eps values and msdv (minimum sample data values) are calculated. When eps is 0.2 and msdv is 100, average silhouette score equals to 0.0281; when eps is 0.25 and msdv is 100, average silhouette score raises to 0.1880; when eps is 0.3 and msdv is 100, average silhouette score increases to 0.2582. As a result, the third pair indicates the best clustering result. Figure 3-11 and 3-12 show the clustering result for dbscan methods with eps=0.3, msdv=100.

Figure 3-11: 2D graph for dbscan method for Neighbourhood

Figure 3-12: 3D graph for dbscan method for Neighbourhood

From all above analysis of the clustering results with three different result, the most reasonable and effective result is from the k-means algorithm. Since one sample contains eight columns, the clustering group should not be as small as two or four. Meanwhile, with six groups, those restaurants are divided into different levels easily. Hence it is an effective way to evaluate the neighborhood’s environment.

Moreover, implementation of the same algorithms to different subsets are achieved, including ‘population composition’, ‘internal factors’, ‘internal richness’ and ‘education level’. A detailed version of those analyses is attaching in a document named ‘clustering results.doc’.

Applying three different clustering algorithms on five subsets of the dataset can comprehensively interpret the distribution of data from multiple angles. To get the best clustering result, three different n values are applied for k-means and hierarchical algorithm respectively and three eps values are applied for DBscan algorithm. 2D and 3D PCA graph are plotted to visualization the clustering result for each algorithm.

1 2 3 4