Answered step by step
Verified Expert Solution
Question
1 Approved Answer
Background Cirrhosis results from prolonged liver damage, leading to extensive scarring, often due to conditions like hepatitis or chronic alcohol consumption. The data provided is
Background
Cirrhosis results from prolonged liver damage, leading to extensive scarring, often due to conditions like hepatitis or chronic alcohol consumption. The data provided is a subset sourced from a Mayo Clinic study on primary biliary cirrhosis PBC of the liver carried out from to
This is a dataset to develop and validate machine learning algorithms for predicting the survival status of the collected patients. There are patients in the data set for train and for test and each patient has collected features. The aim of this task is to utilize clinical features for predicting survival state of patients with liver cirrhosis. The survival states include D death C censored CL censored due to liver transplantation
Specifically, the problem you are going to solve is: Can you
Accurately predict the survival status given the labelled data?
Well explain your prediction and the associated findings? For example, identify the key factors which are strongly associated with the response variable, ie survival status.
Data set
The training data contains rows and the test data contains rows, each of which have columns excluding the ID column: the NDays attribute is the number of days between registration and the earlier of death, transplantation, or study analysis time in July the status attribute is the target variable that we will predict, and the rest columns can be used as the input features. The details of the original data set can be found and downloaded in the original UCI repository. The values of the status column in the test set is leaved with empty to simulate real world predictions.
Evidence of Learning:
Execute your code into a jupyter notebook ipynb file and keep the output, write a report pdf file to answer the following questions, and submit your code and report to OnTrack.
Load and explore the training and test dataset, do necessary preprocessing.
a
Show both training and test dataset size.
b
Based on the training and test data, show the feature types, and indicate which feature has missing values.
c
Use an appropriate method to deal with the missing values for both the training and test set.
d
Do necessary encoding for the categorical features.
e
Show the label distribution based on the training data, is it a balanced training set?
Based on the preprocessed training data from question create three supervised machine learning ML models for predicting Status
a
Use an appropriate validation method, report performance score using a suitable metric. Is it possible that the presented result is an underfitted or overfitted one? Justify.
b
Justify different design decisions for each ML model used to answer this question.
c
Have you optimised any hyperparameters for each ML model? What are they? Why have you done that? Explain.
d
What can you do with the label imbalance issue?
e
Finally, make a model recommendation based on the reported results and justify it
Use the best model that you get from question do prediction on the preprocessed test set. Save your prediction the prediction should contain two columns only: testID and Status and submit it to the specific Kaggle inclass platform, do a screenshot of your model performance and report it
Please answer all of question I'll then use it for question and
Step by Step Solution
There are 3 Steps involved in it
Step: 1
Get Instant Access to Expert-Tailored Solutions
See step-by-step solutions with expert insights and AI powered tools for academic success
Step: 2
Step: 3
Ace Your Homework with AI
Get the answers you need in no time with our AI-driven, step-by-step assistance
Get Started