In R or python Dummy variables (or binary variables) are used where we have categorical variables(factors) and we have to convert them into numbers that can be used in machine learning models such as, Linear Regression, K nearest neighbors (KNN), or in statistical analyses or descriptive statistics.
When a column has multiple categorical values and we want to split them into multiple columns with values as 1 where that categorical variable value is present is called dummy variables. In short, a dummy column is one in which value is 1 (one) when a categorical event occurs, and a 0 (zero) when it doesn’t occur.
In most cases, the column has a feature of the event/ person/ object being described. For example, the data has a dummy variable was for the animal being cat or not. When the answer is yes, new column will get a value of 1, and when it is no, the column will get a value of 0.
Let’s look at an example with R code in action. We will use dummies library to convert the dummy column.
We have a data set about fruits. One of the columns in the data is whether the fruit is banana or apple.
We want to convert the above data into this format. To make dummy columns from this data, we will need to produce two new columns. One will indicate if the fruit is an apple, and the other will indicate if the fruit is a banana. The rows will get a value of 1 in the column indicating which fruit they are in their respective column.
R Code to Create Dummy Variables
Let’s convert the categorical variable column to dummy variable.
We use the
# install the dummies library install.packages("dummies") # load the dummies library library(dummies) # Creating the data frame id <- c(1:4) fruit <- c('banana','apple','banana','apple') data <- data.frame(id, fruit) data <- cbind(data, dummy(data$fruit,sep="_"))
dummy() function creates dummies for all the factors in the data frame. We are using
cbind() to join the dummy variable to the original data frame. We can use the
sep argument to specify the character to separate the new column name e.g. we have used the “_” (underscore) in the column “data_banana”.
In this tutorial, we have learned how to create dummy variable in R or R Studio.