Answered step by step
Verified Expert Solution
Link Copied!

Question

1 Approved Answer

Predicting Survivors of the Titanic The sinking of the RMS Titanic is one of the most infamous shipwrecks in history. On April 15, 1912, during

Predicting Survivors of the Titanic The sinking of the RMS Titanic is one of the most infamous shipwrecks in history. On April 15, 1912, during its maiden voyage, the Titanic sank after colliding with an iceberg, killing 1,502 out of 2,224 passengers and crew. This sensational tragedy shocked the international community and led to better safety regulations for ships. One of the reasons that the shipwreck led to such loss of life was that there were not enough lifeboats for the passengers and crew. Although there was some element of luck involved in surviving the sinking, some groups of people were more likely to survive than others, such as women, children, and the upper-class. In this problem, we develop a logistic regression model to predict which passengers survived from the tragedy. The dataset titanic.csv consists of 10 variables, described in Table 1. Variable Description Survival Survival, 0 = No, 1 = Yes Pclass Ticket class, 1 = 1st, 2 = 2nd, 3 = 3rd Name Name Sex male or female Age Age in years Sibsp Number of siblings and spouses aboard the Titanic ParCh Number of parents and children aboard the Titanic Ticket Ticket number Fare Passenger fare Embarked Port of Embarkation, C = Cherbourg, Q = Queenstown, and S = Southampton Table 1: Variables in the dataset Titanic.csv. (a) Which variable should be modeled as a dependent variable for the logistic regression model, and why? (b) In the dataset, you may see that Age values are missing. In general, one can fill the missing values with the average (or most common value for the case of categorical variables) of nonmissing values. In this problem, we will just remove the observations with missing Age values and build regression model on the refined dataset. In R, this can be done by using a function that identifies rows with missing values. Implement this and define a new dataset without missing values. 1 (c) There are some variables that should not be included as independent variables in your logistic regression model. Identify these variables and explain your reasoning. (d) The variable Pclass has three outcomes, 1, 2, and 3. Do you think it should be modeled as continuous variable, or categorical variable? Explain your reasoning. (e) Based on steps from (a) to (d), develop a logistic regression model and interpret your results. Using the regression coefficients, write the expression or probabilities P(Y = 1|X) where Y is your dependent variable and X is a set of independent variables (Age, Sex, Fare, etc.). Does it match with your intuition? Interpret your results. Note: If you want to treat Pclass as a categorical variable, then you need to run the following line before using glm() function: titanic$Pclass = as.factor(titanic$Pclass) (f) Use your logistic regression model to the test dataset Titanic test.csv and compute probabilities that an individual survived for each observation. Using five threshold values t = 0, 0.25, 0.5, 0.75, 1, draw the test set ROC curve. Draw m

Step by Step Solution

There are 3 Steps involved in it

Step: 1

blur-text-image

Get Instant Access to Expert-Tailored Solutions

See step-by-step solutions with expert insights and AI powered tools for academic success

Step: 2

blur-text-image

Step: 3

blur-text-image

Ace Your Homework with AI

Get the answers you need in no time with our AI-driven, step-by-step assistance

Get Started

Recommended Textbook for

Joe Celkos Data And Databases Concepts In Practice

Authors: Joe Celko

1st Edition

1558604324, 978-1558604322

More Books

Students also viewed these Databases questions

Question

What do Dimensions represent in OLAP Cubes?

Answered: 1 week ago