Start with Machine Learning – Kaggle Titanic Solution Python (Easy)

Kaggle Titanic Machine Learning from Disaster is considered as the first step into the realm of Data Science. We will cover an easy solution of Kaggle Titanic Solution in python for beginners.

This article is written for beginners who want to start their journey into Data Science, assuming no previous knowledge of machine learning. If you got a laptop/computer and 20 odd minutes, you are good to go to build your first machine learning model.

We will go through an interesting example of the classification problem (explained here) and it will give an overall idea of steps to create a machine learning model.

Titanic Disaster Problem: Aim is to build a machine learning model on the Titanic dataset to predict whether a passenger on the Titanic would have been survived or not using the passenger data. If you want more details then click on link.

So the data has information about passengers on the Titanic, such as name, sex, age, survival, economic status (class), etc.

Prerequisites:

  1. Install Anaconda from here: https://www.anaconda.com/distribution/
  2. Go to start, search and open Jupyter Notebook.
  3. To get some hands-on Python you can follow these basic lessons here. These will give you an intro to Python and its libraries (e.g. Pandas).

Let’s Create Your First Machine Learning Model

Steps involved in a machine learning model:

  1. Gathering Data
  2. Data Pre-processing (Cleaning and Preparing Data)
    • Cleaning of data e.g conversion of data, missing value imputation
    • EDA and Feature Engineering (In next tutorial)
    • Train/Test split
  3. Choosing and training a model
  4. Evaluating the model
  5. Hyperparameter tuning (In next tutorial)
  6. Prediction

1. Gathering Data

We start by importing important libraries. Such as Pandas and Numpy are data manipulation libraries. For machine learning we will use classification algorithm Random Forest or Logistic Regression.

We use train_test_split function to split the data into train/ test to check and avoid overfitting. Overfitting is when the model learns the training data so well that it fails to generalize the model for the test data or unseen data. Therefore, we have very good accuracy in train data but very poor accuracy in the test data.

This os command will set a default path to the folder in which you have downloaded the files. We will now read the csv file in Pandas.

Kaggle Titanic Data Head Pyhton
train.head()

train.head() will show the first 5 rows of the data. There are few NaN values in the data which we have to impute but let’s leave it for the next advanced tutorial (Missing Value Imputation).

2. Data Pre-processing

Kaggle Titanic Data Python Describe
train.describe()

Describe is a good command to get to know the data in a summarized way.

This is a case of supervised learning in which the model needs inputs and output to learn. Well in this case ‘Survived’ Column is output column and rest all are input columns.

So there are 11 input columns. However, not all columns are always important for the model to learn. If you remember the Titanic movie, you will know that the rich were more likely to survive. Also, the preference was given to children, women and aged persons.

So according to our hypothesis, older rich women and children were the most likely to survive and poor middle-aged men were the least likely to survive.

Age and Sex are directly provided in the data. We can presume whether a person is rich or poor by looking at Passenger class (Pclass).

So these are the 3 inputs to our machine learning algorithm: Passenger class, age and sex.

Point to be noted: The algorithms in Sklearn (the library we are using), does not work missing values, so lets first check the data for missing values. Also, they work only work with numbers.

PassengerId0
Survived0
Pclass0
Name0
Sex0
Age177
SibSp0
Parch0
Ticket0
Fare0
Cabin687
Embarked2
dtype: int64

We can see that Age has 177 missing values out of 891. Thus we can do the missing values imputation. However, let’s leave it for the next advanced tutorial. For now, let’s not take the Age column. So for model input, we will have only Passenger class and Sex. The output is the Survived field.

2.1 Data Preparation

PclassSex
03male
11female
23female
31female
43male

Let’s extract selected 2 input columns into a new dataframe train_x.

Survived
00
11
21
31
40

Similarly, extract the output to train_y.

As mentioned above, the algorithms in Sklearn only work with numbers. That means we can not pass the sex as male or female.

2.2 Cleaning of data

PclassSex
031
110
230
310
431

We have updated the sex column.

2.3 Train/Test Split to prevent overfitting

We split our data into a train set and a cross-validation set. The training set is used to train the machine learning algorithm. While the cross-validation set is used to find the model accuracy (as we have the actual output for the cross-validation set). Accuracy is calculated by comparing the actual output with the predicted output. Refer this link on how accuracy is calculated in classification problem. Test set is the data for which we do not have the Output variable (Survived in this problem). You can find test csv file in the downloaded folder.

We will use the train_test_split function to create the test/ train (cross-validation) split. We will use 70% of the data to train and model and 30% of the data to check accuracy.

PclassSexSurvived
452114520
168111680
742107421
413214130
392313920

tr_x & tr_y are the training input and output and cv_x & cv_y are cross-validation input and output.

3. Choosing and training a model

3.1 Random Forest Model

This will create a Random Forest machine learning algorithm instance rf.

This simple fit() function is used to train our algorithm. This function takes our input dataframe (tr_x) and learns the expected output (tr_y). That’s why we narrowed the input columns so that the algorithm is not confused by the noise.

There is a popular saying in the analytics community “Garbage in Garbage out”. It describes the concept that flawed, or nonsense input data produces nonsense output or “garbage”. For avoiding this we use feature engineering and feature selection which we will cover in the next tutorial.

4. Evaluating the model

The instance has now “learned” how to predict Titanic survivors as the model is fitted. Now we can check how accurate our algorithm is on cross-validation data:

Accuracy = 76.11940298507463%

The score() function takes the cross-validation input and finds out the accuracy by comparing our predictive output and the known test outputs.

Additional: Logistic Regression Model (Training & Evaluation)

This is just to show how easy it is to implement other machine learning classification models using sklearn library in python.

LogicticRegression model is fitted and we can check the accuracy on cross-validation data.

Accuracy = 82.08955223880598%

6. Prediction

DIY: We have a test file in the downloaded folder. Using our trained model we will predict for this test file. Upload this file having predictions on Kaggle (here) to find out the accuracy and rank on the leaderboard. I hope you will be able to complete this part, in case of any doubt feel free to leave a comment or just find the code for this part here. Just load the test file, convert sex column to integer and predict using rf.predict() function.

If you have followed this article till here, congratulation on your first machine learning tutorial using Python.

Conclusion

As this is a beginner’s model, so I tried to keep this tutorial as simple as possible. After this, I will write another follow-up advance tutorial solution to solve the Kaggle titanic disaster problem in python.

Below is the snippet of the code in Jupyter notebook. To download the Part1 notebook click here.

5 thoughts on “Start with Machine Learning – Kaggle Titanic Solution Python (Easy)”

  1. hello in
    train_y = data[[“Survived”]]
    train_y.head()

    there’s an error , said name ‘data’ is not defined, how can i define it ?

    Reply
      • thank you, btw i wanna ask more ,
        A value is trying to be set on a copy of a slice from a DataFrame.
        Try using .loc[row_indexer,col_indexer] = value instead
        its the error i got in input[26], know any ways to fix this problem ? thanks before

        Reply
        • This is just a warning not error introduced in pandas version 0.21.0 Link.
          Either you can ignore the warning by using:
          import warnings
          warnings.filterwarnings("ignore")

          or,
          Use suggested .loc method given below:
          test_x.loc[(test_x['Sex']=="male"),"Sex"] = 1
          test_x.loc[(test_x['Sex']=="female"),"Sex"] = 0

          Reply

Leave a Comment