AnalyticsDojo

Introduction to R - Titanic Baseline

rpi.analyticsdojo.com

Running Code using Kaggle Notebooks

  • Kaggle utilizes Docker to create a fully functional environment for hosting competitions in data science.
  • You could download/run kaggle/python docker image from GitHub and run it as an alternative to the standard Jupyter Stack for Data Science we have been using.
  • Kaggle has created an incredible resource for learning analytics. You can view a number of toy examples that can be used to understand data science and also compete in real problems faced by top companies.
train <- read.csv('../../input/train.csv', stringsAsFactors = F)
test  <- read.csv('../../input/test.csv', stringsAsFactors = F)

train and test set on Kaggle

  • The train file contains a wide variety of information that might be useful in understanding whether they survived or not. It also includes a record as to whether they survived or not.
  • The test file contains all of the columns of the first file except whether they survived. Our goal is to predict whether the individuals survived.
head(train)

PassengerIdSurvivedPclassNameSexAgeSibSpParchTicketFareCabinEmbarked
1 0 3 Braund, Mr. Owen Harris male 22 1 0 A/5 21171 7.2500 S
2 1 1 Cumings, Mrs. John Bradley (Florence Briggs Thayer)female 38 1 0 PC 17599 71.2833 C85 C
3 1 3 Heikkinen, Miss. Laina female 26 0 0 STON/O2. 3101282 7.9250 S
4 1 1 Futrelle, Mrs. Jacques Heath (Lily May Peel) female 35 1 0 113803 53.1000 C123 S
5 0 3 Allen, Mr. William Henry male 35 0 0 373450 8.0500 S
6 0 3 Moran, Mr. James male NA 0 0 330877 8.4583 Q
head(test)

PassengerIdPclassNameSexAgeSibSpParchTicketFareCabinEmbarked
892 3 Kelly, Mr. James male 34.5 0 0 330911 7.8292 Q
893 3 Wilkes, Mrs. James (Ellen Needs) female 47.0 1 0 363272 7.0000 S
894 2 Myles, Mr. Thomas Francis male 62.0 0 0 240276 9.6875 Q
895 3 Wirz, Mr. Albert male 27.0 0 0 315154 8.6625 S
896 3 Hirvonen, Mrs. Alexander (Helga E Lindqvist)female 22.0 1 1 3101298 12.2875 S
897 3 Svensson, Mr. Johan Cervin male 14.0 0 0 7538 9.2250 S

Baseline Model: No Survivors

  • The Titanic problem is one of classification, and often the simplest baseline of all 0/1 is an appropriate baseline.
  • Even if you aren’t familiar with the history of the tragedy, by checking out the Wikipedia Page we can quickly see that the majority of people (68%) died.
  • As a result, our baseline model will be for no survivors.
test["Survived"] <- 0

submission <- test[,c("PassengerId", "Survived")]

head(submission)

PassengerIdSurvived
8920
8930
8940
8950
8960
8970
# Write the solution to file
write.csv(submission, file = 'nosurvivors.csv', row.names = F)

The First Rule of Shipwrecks

  • You may have seen it in a movie or read it in a novel, but women and children first has at it’s roots something that could provide our first model.
  • Now let’s recode the Survived column based on whether was a man or a woman.
  • We are using conditionals to select rows of interest (for example, where test[‘Sex’] == ‘male’) and recoding appropriate columns.
#Here we can code it as Survived, but if we do so we will overwrite our other prediction. 
#Instead, let's code it as PredGender

test[test$Sex == "male", "PredGender"] <- 0
test[test$Sex == "female", "PredGender"] <- 1

submission = test[,c("PassengerId", "PredGender")]
#This will Rename the survived column
names(submission)[2] <- "Survived"
head(submission)

PassengerIdSurvived
8920
8931
8940
8950
8961
8970
names(submission)[2]<-"new"
submission

PassengerIdnew
8920
8931
8940
8950
8961
8970
8981
8990
9001
9010
9020
9030
9041
9050
9061
9071
9080
9090
9101
9111
9120
9130
9141
9150
9161
9170
9181
9190
9200
9210
12800
12810
12820
12831
12840
12850
12860
12871
12880
12891
12900
12910
12921
12930
12941
12950
12960
12970
12980
12990
13001
13011
13021
13031
13041
13050
13061
13070
13080
13090
write.csv(submission, file = 'womensurvive.csv', row.names = F)