Kaggle Titanic Machine Learning from Disaster is considered as the first step into the realm of Data Science. We will cover an easy solution of Kaggle Titanic Solution in python for beginners.
This article is written for beginners who want to start their journey into Data Science, assuming no previous knowledge of machine learning. If you got a laptop/computer and 20 odd minutes, you are good to go to build your first machine learning model.
We will go through an interesting example of the classification problem (explained here) and it will give an overall idea of steps to create a machine learning model.
Titanic Disaster Problem: Aim is to build a machine learning model on the Titanic dataset to predict whether a passenger on the Titanic would have been survived or not using the passenger data. If you want more details then click on link.
So the data has information about passengers on the Titanic, such as name, sex, age, survival, economic status (class), etc.
Table of Contents
Prerequisites:
- Install Anaconda from here: https://www.anaconda.com/distribution/
- Go to start, search and open Jupyter Notebook.
- To get some hands-on Python you can follow these basic lessons here. These will give you an intro to Python and its libraries (e.g. Pandas).
Let’s Create Your First Machine Learning Model
Steps involved in a machine learning model:
- Gathering Data
- Data Pre-processing (Cleaning and Preparing Data)
- Cleaning of data e.g conversion of data, missing value imputation
- EDA and Feature Engineering (In next tutorial)
- Train/Test split
- Choosing and training a model
- Evaluating the model
- Hyperparameter tuning (In next tutorial)
- Prediction
1. Gathering Data
import numpy as np
import pandas as pd
#The Machine learning alogorithm
from sklearn.ensemble import RandomForestClassifier
from sklearn.linear_model import LogisticRegression
We start by importing important libraries. Such as Pandas and Numpy are data manipulation libraries. For machine learning we will use classification algorithm Random Forest or Logistic Regression.
# Test train split
from sklearn.model_selection import train_test_split
We use train_test_split function to split the data into train/ test to check and avoid overfitting. Overfitting is when the model learns the training data so well that it fails to generalize the model for the test data or unseen data. Therefore, we have very good accuracy in train data but very poor accuracy in the test data.
import os
#Setting the Directory/Folder
#Enter path where the downloaded files are present
os.chdir("D:\Titanic")
This os command will set a default path to the folder in which you have downloaded the files. We will now read the csv file in Pandas.
train = pd.read_csv("train.csv")
train.head()
train.head()
will show the first 5 rows of the data. There are few NaN values in the data which we have to impute but let’s leave it for the next advanced tutorial (Missing Value Imputation).
2. Data Pre-processing
train.describe()
Describe is a good command to get to know the data in a summarized way.
This is a case of supervised learning in which the model needs inputs and output to learn. Well in this case ‘Survived’ Column is output column and rest all are input columns.
So there are 11 input columns. However, not all columns are always important for the model to learn. If you remember the Titanic movie, you will know that the rich were more likely to survive. Also, the preference was given to children, women and aged persons.
So according to our hypothesis, older rich women and children were the most likely to survive and poor middle-aged men were the least likely to survive.
Age and Sex are directly provided in the data. We can presume whether a person is rich or poor by looking at Passenger class (Pclass).
So these are the 3 inputs to our machine learning algorithm: Passenger class, age and sex.
Point to be noted: The algorithms in Sklearn (the library we are using), does not work missing values, so lets first check the data for missing values. Also, they work only work with numbers.
#Checking the missing values
train.isnull().sum()
PassengerId | 0 |
Survived | 0 |
Pclass | 0 |
Name | 0 |
Sex | 0 |
Age | 177 |
SibSp | 0 |
Parch | 0 |
Ticket | 0 |
Fare | 0 |
Cabin | 687 |
Embarked | 2 |
dtype: int64 |
We can see that Age has 177 missing values out of 891. Thus we can do the missing values imputation. However, let’s leave it for the next advanced tutorial. For now, let’s not take the Age column. So for model input, we will have only Passenger class and Sex. The output is the Survived field.
2.1 Data Preparation
# Selecting only 2 columns for ease
train_x = train[["Pclass", "Sex"]]
train_x.head()
Pclass | Sex | |
0 | 3 | male |
1 | 1 | female |
2 | 3 | female |
3 | 1 | female |
4 | 3 | male |
Let’s extract selected 2 input columns into a new dataframe train_x.
# Selecting the output/ target variable
train_y = train[["Survived"]]
train_y.head()
Survived | |
0 | 0 |
1 | 1 |
2 | 1 |
3 | 1 |
4 | 0 |
Similarly, extract the output to train_y.
As mentioned above, the algorithms in Sklearn only work with numbers. That means we can not pass the sex as male or female.
2.2 Cleaning of data
# Making Male/ Female to integer numbers calles Label Encoding
train_x["Sex"].replace("male", 1, inplace = True)
train_x["Sex"].replace("female", 0, inplace = True)
train_x.head()
Pclass | Sex | |
0 | 3 | 1 |
1 | 1 | 0 |
2 | 3 | 0 |
3 | 1 | 0 |
4 | 3 | 1 |
We have updated the sex column.
2.3 Train/Test Split to prevent overfitting
We split our data into a train set and a cross-validation set. The training set is used to train the machine learning algorithm. While the cross-validation set is used to find the model accuracy (as we have the actual output for the cross-validation set). Accuracy is calculated by comparing the actual output with the predicted output. Refer this link on how accuracy is calculated in classification problem. Test set is the data for which we do not have the Output variable (Survived in this problem). You can find test csv file in the downloaded folder.
We will use the train_test_split function to create the test/ train (cross-validation) split. We will use 70% of the data to train and model and 30% of the data to check accuracy.
# Making dataset for validation
tr_x, cv_x, tr_y, cv_y = train_test_split(train_x, train_y, test_size = 0.30)
print(tr_x.head())
print(tr_y.head())
Pclass | Sex | Survived | |||
452 | 1 | 1 | 452 | 0 | |
168 | 1 | 1 | 168 | 0 | |
742 | 1 | 0 | 742 | 1 | |
413 | 2 | 1 | 413 | 0 | |
392 | 3 | 1 | 392 | 0 |
tr_x & tr_y are the training input and output and cv_x & cv_y are cross-validation input and output.
3. Choosing and training a model
3.1 Random Forest Model
# Call the Machine Learning Algorithm
rf = RandomForestClassifier()
This will create a Random Forest machine learning algorithm instance rf.
# Fitting and training the above called algorithm
rf.fit(tr_x, tr_y)
This simple fit() function is used to train our algorithm. This function takes our input dataframe (tr_x) and learns the expected output (tr_y). That’s why we narrowed the input columns so that the algorithm is not confused by the noise.
There is a popular saying in the analytics community “Garbage in Garbage out”. It describes the concept that flawed, or nonsense input data produces nonsense output or “garbage”. For avoiding this we use feature engineering and feature selection which we will cover in the next tutorial.
4. Evaluating the model
The instance has now “learned” how to predict Titanic survivors as the model is fitted. Now we can check how accurate our algorithm is on cross-validation data:
Accuracy_RandomForest = rf.score(cv_x, cv_y)
print("Accuracy = {}%".format(Accuracy_RandomForest * 100))
Accuracy = 76.11940298507463%
The score() function takes the cross-validation input and finds out the accuracy by comparing our predictive output and the known test outputs.
Additional: Logistic Regression Model (Training & Evaluation)
This is just to show how easy it is to implement other machine learning classification models using sklearn library in python.
lgr = LogisticRegression()
lgr.fit(tr_x, tr_y)
LogicticRegression model is fitted and we can check the accuracy on cross-validation data.
Accuracy_LogisticRegression = lgr.score(cv_x, cv_y)
print("Accuracy = {}%".format(Accuracy_LogisticRegression * 100))
Accuracy = 82.08955223880598%
6. Prediction
DIY: We have a test file in the downloaded folder. Using our trained model we will predict for this test file. Upload this file having predictions on Kaggle (here) to find out the accuracy and rank on the leaderboard. I hope you will be able to complete this part, in case of any doubt feel free to leave a comment or just find the code for this part here. Just load the test file, convert sex column to integer and predict using rf.predict() function.
If you have followed this article till here, congratulation on your first machine learning tutorial using Python.
Conclusion
As this is a beginner’s model, so I tried to keep this tutorial as simple as possible. After this, I will write another follow-up advance tutorial solution to solve the Kaggle titanic disaster problem in python.
Below is the snippet of the code in Jupyter notebook. To download the Part1 notebook click here.
hello in
train_y = data[[“Survived”]]
train_y.head()
there’s an error , said name ‘data’ is not defined, how can i define it ?
Hi yas, Thanks for pointing out the mistake. Updated the code to
train_y = train[["Survived"]]
thank you, btw i wanna ask more ,
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead
its the error i got in input[26], know any ways to fix this problem ? thanks before
This is just a warning not error introduced in pandas version 0.21.0 Link.
Either you can ignore the warning by using:
import warnings
warnings.filterwarnings("ignore")
or,
Use suggested .loc method given below:
test_x.loc[(test_x['Sex']=="male"),"Sex"] = 1
test_x.loc[(test_x['Sex']=="female"),"Sex"] = 0
thank you for your help, all the problem is now solved
I blog often and I seriously thank you for your information. This article has truly
peaked my interest. I am going to book mark your blog and keep
checking for new details about once per week.
I subscribed to your RSS feed too.