Table of Contents
1. Scaling:
In most cases, the numerical features of the dataset do not have a certain range and they differ from each other. In order for a symmetric dataset, scaling is required.
1.1 Normalization:
Normalization (or min-max normalization) scales all values in a fixed range between 0 and 1. This transformation does not change the distribution of the feature and due to the decreased standard deviations, the effects of the outliers increases. Therefore, before normalization, it is recommended to handle the outliers
Normalization is a good technique to use when you do not know the distribution of your data or when you know the distribution is not Gaussian (a bell curve). Normalization is useful when your data has varying scales and the algorithm you are using does not make assumptions about the distribution of your data, such as k-nearest neighbors and artificial neural networks.
1.2 Standardization:
Standardization: (or z-score normalization) scales the values while taking into account standard deviation. If the standard deviation of features is different, their range also would differ from each other. This reduces the effect of the outliers in the features.
Standardizations are involved majorly where there is distance involved in Gradient Descent (Linear Regression, KNN, etc.) or in ANN for faster convergence while Normalization is involved in places of classification or CNN (for scaling down the pixel values).
Standardization assumes that your data has a Gaussian (bell curve) distribution. This does not strictly have to be true, but the technique is more effective if your attribute distribution is Gaussian. Standardization is useful when your data has varying scales and the algorithm you are using does make assumptions about your data having a Gaussian distribution, such as linear regression, logistic regression, and linear discriminant analysis.
Standardisation is more robust to outliers, and in many cases, it is preferable over Max-Min Normalisation. Link
Applied standardization to numerical columns and not the other One-Hot Encoded features. Standardizing the One-Hot encoded features would mean assigning a distribution to categorical features. You don’t want to do that!
But why did I not do the same while normalizing the data? Because One-Hot encoded features are already in the range between 0 to 1. So, normalization would not affect their value.
Note: If an algorithm is not distance-based, feature scaling is unimportant, including Naive Bayes, Linear Discriminant Analysis, and Tree-Based models (gradient boosting, rf, etc.).
2. Missing Value/ Imputation
https://www.displayr.com/5-ways-deal-missing-data-cluster-analysis/ In Case of Clustering
3. Handling Outliers
3.1 Find Outlier
http://r-statistics.co/Outlier-Treatment-With-R.html
- Univariate approach: For a given continuous variable, outliers are those observations that lie outside 1.5 * IQR, where IQR, the ‘Inter Quartile Range’ is the difference between 75th and 25th quartiles. Look at the points outside the whiskers in below box plot.
- Bivariate approach: Visualize in box-plot of the X and Y, for categorical X’s
- Multivariate Model Approach: Cooks Distance: Cook’s distance is a measure computed with respect to a given regression model and therefore is impacted only by the X variables included in the model. But, what does cook’s distance mean? It computes the influence exerted by each data point (row) on the predicted outcome. In general use, those observations that have a cook’s distance greater than 4 times the mean may be classified as influential. This is not a hard boundary.
3.2 Treating the outliers
- Imputation: Imputation with mean / median / mode. This method has been dealt with in detail in the discussion about treating missing values.
- Capping: For missing values that lie outside the 1.5 * IQR limits, we could cap it by replacing those observations outside the lower limit with the value of 5th %ile and those that lie above the upper limit, with the value of 95th %ile.