Import important libraries and Dataset.
import numpy as np
import pandas as pd
#The Machine learning alogorithm
from sklearn.ensemble import RandomForestClassifier
from sklearn.linear_model import LogisticRegression
# Test train split
from sklearn.model_selection import train_test_split
Set path to the folder where files are downloaded.
import os
os.chdir("D:\Personal\Site\Articles\Titanic") #Enter path where the files are present
train = pd.read_csv("train.csv")
train.head()
Cleaning of data e.g conversion of data, missing value imputation
EDA and Feature Engineering (In next tutorial)
Train/Test split
train.describe()
train.isnull().sum()
train_x = train[["Pclass", "Sex"]]
train_x.head()
train_y = train[["Survived"]]
train_y.head()
train_x["Sex"].replace("male", 1, inplace = True)
train_x["Sex"].replace("female", 0, inplace = True)
train_x.head()
tr_x, cv_x, tr_y, cv_y = train_test_split(train_x, train_y, test_size = 0.30)
print(tr_x.head())
print(tr_y.head())
Random Forest Model
rf = RandomForestClassifier()
rf.fit(tr_x, tr_y)
accuracy = rf.score(cv_x, cv_y)
print("Accuracy = {}%".format(accuracy * 100))
Logistic Regression Model
lgr = LogisticRegression()
lgr.fit(tr_x, tr_y)
accuracy = lgr.score(cv_x, cv_y)
print("Accuracy = {}%".format(accuracy * 100))
Please note that there is no Survived column in the test set. Upload this file on Kaggle to get the Accuracy.
test = pd.read_csv("test.csv")
test.head()
test_x = test[["Pclass", "Sex"]]
test_x.head()
test_x["Sex"].replace("male", 1, inplace = True)
test_x["Sex"].replace("female", 0, inplace = True)
test_x.head()
test_x.head()
prd = rf.predict(test_x)
prd
Check the format of the Submission file, there are 2 columns PassengerId and Survived. So lets convert our predicted output in the same format.
op = test[['PassengerId']]
op['Survived'] = prd
op.head()
op.to_csv("Submission.csv",index=False)