In R or python Dummy variables (or binary variables) are used where we have categorical variables(factors) and we have to convert them into numbers that can be used in machine learning models such as, Linear Regression, K nearest neighbors (KNN), or in statistical analyses or descriptive statistics.
When a column has multiple categorical values and we want to split them into multiple columns with values as 1 where that categorical variable value is present is called dummy variables. In short, a dummy column is one in which value is 1 (one) when a categorical event occurs, and a 0 (zero) when it doesn’t occur.
In most cases, the column has a feature of the event/ person/ object being described. For example, the data has a dummy variable was for the animal being cat or not. When the answer is yes, new column will get a value of 1, and when it is no, the column will get a value of 0.
Let’s look at an example with R code in action. We will use dummies library to convert the dummy column.
We have a data set about fruits. One of the columns in the data is whether the fruit is banana or apple.
id | fruit |
---|---|
1 | banana |
2 | apple |
3 | banana |
4 | apple |
We want to convert the above data into this format. To make dummy columns from this data, we will need to produce two new columns. One will indicate if the fruit is an apple, and the other will indicate if the fruit is a banana. The rows will get a value of 1 in the column indicating which fruit they are in their respective column.
id | fruit | data_apple | data_banana |
---|---|---|---|
1 | banana | 0 | 1 |
2 | apple | 1 | 0 |
3 | banana | 0 | 1 |
4 | apple | 1 | 0 |
R Code to Create Dummy Variables
Let’s convert the categorical variable column to dummy variable.
We use the dummies
library.
R Code:
# install the dummies library
install.packages("dummies")
# load the dummies library
library(dummies)
# Creating the data frame
id <- c(1:4)
fruit <- c('banana','apple','banana','apple')
data <- data.frame(id, fruit)
data <- cbind(data, dummy(data$fruit,sep="_"))
Output:
id | fruit | data_apple | data_banana |
---|---|---|---|
1 | banana | 0 | 1 |
2 | apple | 1 | 0 |
3 | banana | 0 | 1 |
4 | apple | 1 | 0 |
The dummy()
function creates dummies for all the factors in the data frame. We are using cbind()
to join the dummy variable to the original data frame. We can use the sep
argument to specify the character to separate the new column name e.g. we have used the “_” (underscore) in the column “data_banana”.
Conclusion
In this tutorial, we have learned how to create dummy variable in R or R Studio.
Hi I want to learn python the easiest way if you could help me please