Data Cleaning and feature engineering are essential steps prior to any statistical testing or data analytics in a data science project. This section illustrates the procedures and steps taken for data cleaning and feature transformation including basis statistics analysis, dealing with missing data, outlier detection and feature categorization.
There are overall four aspects in the restaurant dataset. The first dimension is the basic information of restaurants, including the number of reviews, ratings, and category of the results and so on. The second dimension is the internal attribute of the restaurants, like WIFI option, Ambience, and Alcohol Availability. The third dimension is information about surrounding facilities near the restaurants, like the number of schools, shopping malls or bus stops. The last dimension consists of the demographical information of the restaurants nearby, including the proportion of white people, unemployment rate, education levels and so on. In summary, there are altogether 5133 rows of data, extracting from 7 major cities and metropolitan areas, and 59 dimensions representing different aspects of the restaurants. Among the 79 features, there are 56 numerical features and 23 categorical features. Table 1-1 shows the physical meaning and types of each column in the data.
Table 1-2 shows a snapshot of basic statistics of numerical columns, including distinctive count, mean, min, max and quantiles. Table 1-3 shows the mode of some categorical features in the dataset. As shown in the following tables, there are missing values in some columns like ‘bank’, ’bar’. Therefore, dealing with the missing values will be the first step for data cleaning.
As shown in table 1-4, the missing rate varies significantly over different features. Therefore, different missing value handling techniques need to be applied. The missing values are mainly due to information missing in the Yelp website and Google Map, as not all restaurant has all the features above. In general, three different techniques have been applied to deal with NA values in this dataset.
First, for all features with more than 40% of missing rate are dropped off, as there will never be a fair imputation method to fill in the blanks. Second, for the majority of the remaining features, imputation by median/mode to fill in the blanks. The rationale behind is that variance in some of the numerical columns is large, imputation by median will introduce less variance towards the dataset. Thirdly, for missing data in columns like ‘LowPrice’, the values can be inferred by the price range column in the dataset.
Although the dimensions in this dataset all have physical meanings and extreme values happen occasionally, outlier detection and handling are still needed as outliers will create barriers for obtaining high accuracy machine learning models. Three outlier detection methods are applied to detect extreme values in this dataset.
The most common outlier detection method is to make use of box plot and z-score to flag out any value that is far away from the population mean in a univariate manner. Figure 1-1 illustrates the boxplot for 5 numerical features in the dataset, where there are some high spikes exist in those columns. A z-score threshold of 3 is applied to flag out any point far away from its mean.
Local outlier factor is an outlier detector for finding anomalous data point by measuring the local deviation of a given data point with respect to its neighborhood. Figure 1-2 illustrates the results for applying LOF methods on the 5 features shown above in a multi-dimensional way. The following plot is plotted using PCA decomposition, and the radius of the cycle denotes the outlier score. The larger the radius of the cycle, the higher the chance the data point is an outlier.
Employing decision tree-based technique, isolation forest detects the outliers by assuming that outliers are rare and different from the main population, and therefore it is easier to be split out using shallow decision trees. Figure 1-3 illustrates the results of isolation forest using the same feature stated above, the dot in red are treated as outliers.
Data point will be treated as outlier if all three of the methods repost certain data point as outlier. In this case, there is one common data point being flagged by all three methods, and it will be removed from the analysis.
As shown in table 1-2, some of the features will have a large standard deviation, and the distribution is highly skewed. Data binning or categorization is a useful method to deal with such a situation. In this dataset, one crucial data issue is that the review counts are highly variant, and it adds difficulty in the classification task. Figure 1-4 shows an effective binning method dealing with high variance review count column. The binning strategy here is to make sure the frequency in each category is comparable, in order to facilitate later data analytics.
In addition to the stated data cleaning and feature generation method, there several other steps being taken to transform the data features to a more user-friendly format. For example, any Boolean column with value ‘Yes’ or ‘No’ will be filled by ‘1’ and ‘0’ so that it can be used directly in any numerical analysis. Also, instead of using absolute population number, the ratio of each race has been calculated against the total population within a neighborhood.