Question
Predicting Survivors of the Titanic The sinking of the RMS Titanic is one of the most infamous shipwrecks in history. On April 15, 1912, during
Predicting Survivors of the Titanic
The sinking of the RMS Titanic is one of the most infamous shipwrecks in history. On April 15,
1912, during its maiden voyage, the Titanic sank after colliding with an iceberg, killing 1,502 out
of 2,224 passengers and crew. This sensational tragedy shocked the international community and
led to better safety regulations for ships. One of the reasons that the shipwreck led to such loss
of life was that there were not enough lifeboats for the passengers and crew. Although there was
some element of luck involved in surviving the sinking, some groups of people were more likely to
survive than others, such as women, children, and the upper-class.
In this problem, we develop a logistic regression model to predict which passengers survived
from the tragedy. The dataset titanic.csv consists of 10 variables, described in Table 1.
Variable Description
Survival
Survival, 0 = No, 1 = Yes
Pclass
Ticket class, 1 = 1st, 2 = 2nd, 3 = 3rd
Name
Name
Sex
male or female
Age
Age in years
Sibsp
Number of siblings and spouses aboard the Titanic
ParCh
Number of parents and children aboard the Titanic
Ticket
Ticket number
Fare
Passenger fare
Embarked
Port of Embarkation, C = Cherbourg, Q = Queenstown, and S = Southampton
Table 1: Variables in the dataset Titanic.csv.
(a) Which variable should be modeled as a dependent variable for the logistic regression model,
and why?
(b) In the dataset, you may see that Age values are missing. In general, one can fifill the missing
values with the average (or most common value for the case of categorical variables) of non
missing values. In this problem, we will just remove the observations with missing Age values
and build regression model on the refifined dataset. In R, this can be done by using a function
that identififies rows with missing values. Implement this and defifine a new dataset without
missing values.
1(c) There are some variables that should not be included as independent variables in your logistic
regression model. Identify these variables and explain your reasoning.
(d) The variable Pclass has three outcomes, 1, 2, and 3. Do you think it should be modeled as
continuous variable, or categorical variable? Explain your reasoning.
(e) Based on steps from (a) to (d), develop a logistic regression model and interpret your results.
Using the regression coeffiffifficients, write the expression or probabilities P(Y = 1|X) where Y is
your dependent variable and X is a set of independent variables (Age, Sex, Fare, etc.). Does
it match with your intuition? Interpret your results.
Note: If you want to treat Pclass as a categorical variable, then you need to run the following
line before using glm() function: titanic$Pclass = as.factor(titanic$Pclass)
(f) Use your logistic regression model to the test dataset Titanic test.csv and compute prob
abilities that an individual survived for each observation. Using fifive threshold values t =
0, 0.25, 0.5, 0.75, 1, draw the test set ROC curve. Draw m
Step by Step Solution
There are 3 Steps involved in it
Step: 1
Get Instant Access to Expert-Tailored Solutions
See step-by-step solutions with expert insights and AI powered tools for academic success
Step: 2
Step: 3
Ace Your Homework with AI
Get the answers you need in no time with our AI-driven, step-by-step assistance
Get Started