Multivariate analysis is a more complex form of statistical analysis technique and used when more than two variables have to be analyzed simultaneously.
Unlike in univariate or bivariate analysis, it is a hard task for the human brain to visualize a relationship among more than 2 variables in a graph. For three variables, we can create a 3-D model to study the relationship among variable. However, anything above the third dimension is difficult to visualize and interpret. Thus multivariate analysis is used to study more complex patterns present in the data set.
Let’s see an example of multivariate analysis. A doctor has collected a dataset of diabetic patients. The data set has attributes such as, blood pressure, glucose, age and body mass index (BMI). Now, if doctor wants to investigate the relationship between the all of these health attributes and diabetes level, he/she have to use multivariate analysis.
Commonly used multivariate analysis technique are:
The purpose of cluster analysis is to reduce a large data set to meaningful subgroups on the basis of similarity.
Cluster Analysis classifies data set into various clusters in a way that:
- the similarity between two objects from the same group is maximum, and
- the similarity between two different group should be minimal
- As clustering is a distance based technique, outliers in the data set are a problem.
- the variables used in clustering should be uncorrelated
There are three main clustering methods: hierarchical, which is a treelike process appropriate for smaller data sets; nonhierarchical, which requires specification of the number of clusters a priori; and a combination of both. There are four main rules for developing clusters: the clusters should be different, they should be reachable, they should be measurable, and the clusters should be profitable (big enough to matter). This is a great tool for market segmentation.
In general business scenarios, the data belongs to different types of entities. Using a single model for all these different entities might not be the best thing to do. For example, in a retail dataset, the customers might belong to different state, age groups and income groups; which leads to different spending behaviors. If we build a single model using data set having all these customers, we would be comparing apples to oranges and not fair. So, clustering provides a good way to first segment data into relevant clusters and therefore avoid this problem. Also, clustering also enables us to visualize, understand and compare the attributes of the segments formed.
Popular clustering algorithms are K-means clustering, the DBSCAN algorithm, Partitioning Around Medoids (PAM) algorithm, etc.
Principal Component Analysis
Principal Components Analysis (or PCA) is used to reduce the dimensionality of a data set having large number of interrelated features. Although any machine learning algorithm prefer more features to predict the output, however there can be times when the number of these features in the data set are too large. Such data set will be difficult to analyze and train. Trained models will also be susceptible to overfitting. Therefore, we try to reduce the number of variables.
Other Popular Multivariate Analysis Techniques
Other popular multivariate analysis techniques include:
- Correspondence Analysis
- Factor Analysis
- Variance Analysis
- Discriminant Analysis
- Multidimensional Scaling
- Redundancy Analysis