R | Dummy Variables

R dummy variable

In R or python Dummy variables (or binary variables) are used where we have categorical variables(factors) and we have to convert them into numbers that can be used in machine learning models such as, Linear Regression, K nearest neighbors (KNN), or in statistical analyses or descriptive statistics.

When a column has multiple categorical values and we want to split them into multiple columns with values as 1 where that categorical variable value is present is called dummy variables. In short, a dummy column is one in which value is 1 (one) when a categorical event occurs, and a 0 (zero) when it doesn’t occur.

In most cases, the column has a feature of the event/ person/ object being described. For example, the data has a dummy variable was for the animal being cat or not. When the answer is yes, new column will get a value of 1, and when it is no, the column will get a value of 0.

Let’s look at an example with R code in action. We will use dummies library to convert the dummy column.

We have a data set about fruits. One of the columns in the data is whether the fruit is banana or apple.

idfruit
1banana
2apple
3banana
4apple

We want to convert the above data into this format. To make dummy columns from this data, we will need to produce two new columns. One will indicate if the fruit is an apple, and the other will indicate if the fruit is a banana. The rows will get a value of 1 in the column indicating which fruit they are in their respective column.

idfruitdata_appledata_banana
1banana01
2apple10
3banana01
4apple10

R Code to Create Dummy Variables

Let’s convert the categorical variable column to dummy variable.

We use the dummies library.

R Code:

# install the dummies library
install.packages("dummies")

# load the dummies library
library(dummies)

# Creating the data frame
id <- c(1:4)
fruit <- c('banana','apple','banana','apple')
data <- data.frame(id, fruit)

data <- cbind(data, dummy(data$fruit,sep="_"))

Output:

idfruitdata_appledata_banana
1banana01
2apple10
3banana01
4apple10

The dummy() function creates dummies for all the factors in the data frame. We are using cbind() to join the dummy variable to the original data frame. We can use the sep argument to specify the character to separate the new column name e.g. we have used the “_” (underscore) in the column “data_banana”.

Conclusion

In this tutorial, we have learned how to create dummy variable in R or R Studio.

1 thought on “R | Dummy Variables”

Leave a Comment

Keytodatascience Logo

Connect

Subscribe

Join our email list to receive the latest updates.

© 2022 KeyToDataScience