Answered step by step
Verified Expert Solution
Link Copied!

Question

1 Approved Answer

Predicting Survivors of the Titanic The sinking of the RMS Titanic is one of the most infamous shipwrecks in history. On April 15, 1912, during

Predicting Survivors of the Titanic

The sinking of the RMS Titanic is one of the most infamous shipwrecks in history. On April 15,

1912, during its maiden voyage, the Titanic sank after colliding with an iceberg, killing 1,502 out

of 2,224 passengers and crew. This sensational tragedy shocked the international community and

led to better safety regulations for ships. One of the reasons that the shipwreck led to such loss

of life was that there were not enough lifeboats for the passengers and crew. Although there was

some element of luck involved in surviving the sinking, some groups of people were more likely to

survive than others, such as women, children, and the upper-class.

In this problem, we develop a logistic regression model to predict which passengers survived

from the tragedy. The dataset titanic.csv consists of 10 variables, described in Table 1.

Variable Description

Survival

Survival, 0 = No, 1 = Yes

Pclass

Ticket class, 1 = 1st, 2 = 2nd, 3 = 3rd

Name

Name

Sex

male or female

Age

Age in years

Sibsp

Number of siblings and spouses aboard the Titanic

ParCh

Number of parents and children aboard the Titanic

Ticket

Ticket number

Fare

Passenger fare

Embarked

Port of Embarkation, C = Cherbourg, Q = Queenstown, and S = Southampton

Table 1: Variables in the dataset Titanic.csv.

(a) Which variable should be modeled as a dependent variable for the logistic regression model,

and why?

(b) In the dataset, you may see that Age values are missing. In general, one can fifill the missing

values with the average (or most common value for the case of categorical variables) of non

missing values. In this problem, we will just remove the observations with missing Age values

and build regression model on the refifined dataset. In R, this can be done by using a function

that identififies rows with missing values. Implement this and defifine a new dataset without

missing values.

1(c) There are some variables that should not be included as independent variables in your logistic

regression model. Identify these variables and explain your reasoning.

(d) The variable Pclass has three outcomes, 1, 2, and 3. Do you think it should be modeled as

continuous variable, or categorical variable? Explain your reasoning.

(e) Based on steps from (a) to (d), develop a logistic regression model and interpret your results.

Using the regression coeffiffifficients, write the expression or probabilities P(Y = 1|X) where Y is

your dependent variable and X is a set of independent variables (Age, Sex, Fare, etc.). Does

it match with your intuition? Interpret your results.

Note: If you want to treat Pclass as a categorical variable, then you need to run the following

line before using glm() function: titanic$Pclass = as.factor(titanic$Pclass)

(f) Use your logistic regression model to the test dataset Titanic test.csv and compute prob

abilities that an individual survived for each observation. Using fifive threshold values t =

0, 0.25, 0.5, 0.75, 1, draw the test set ROC curve. Draw m

Step by Step Solution

There are 3 Steps involved in it

Step: 1

blur-text-image

Get Instant Access to Expert-Tailored Solutions

See step-by-step solutions with expert insights and AI powered tools for academic success

Step: 2

blur-text-image

Step: 3

blur-text-image

Ace Your Homework with AI

Get the answers you need in no time with our AI-driven, step-by-step assistance

Get Started

Recommended Textbook for

Beginning Databases With PostgreSQL From Novice To Professional

Authors: Richard Stones, Neil Matthew

2nd Edition

1590594789, 978-1590594780

More Books

Students also viewed these Databases questions

Question

How many Tables Will Base HCMSs typically have? Why?

Answered: 1 week ago

Question

What is the process of normalization?

Answered: 1 week ago