Answered step by step
Verified Expert Solution
Link Copied!

Question

1 Approved Answer

Background Cirrhosis results from prolonged liver damage, leading to extensive scarring, often due to conditions like hepatitis or chronic alcohol consumption. The data provided is

Background
Cirrhosis results from prolonged liver damage, leading to extensive scarring, often due to conditions like hepatitis or chronic alcohol consumption. The data provided is a subset sourced from a Mayo Clinic study on primary biliary cirrhosis (PBC) of the liver carried out from 1974 to 1984.
This is a dataset to develop and validate machine learning algorithms for predicting the survival status of the collected patients. There are 312 patients in the data set (224 for train and 88 for test), and each patient has 17 collected features. The aim of this task is to utilize 17 clinical features for predicting survival state of patients with liver cirrhosis. The survival states include 0= D (death),1= C (censored),2= CL (censored due to liver transplantation)
Specifically, the problem you are going to solve is: Can you
Accurately predict the survival status given the labelled data?
Well explain your prediction and the associated findings? For example, identify the key factors which are strongly associated with the response variable, i.e., survival status.
Data set
The training data contains 224 rows and the test data contains 88 rows, each of which have 19 columns (excluding the ID column): the N_Days attribute is the number of days between registration and the earlier of death, transplantation, or study analysis time in July 1986, the status attribute is the target variable that we will predict, and the rest 17 columns can be used as the input features. The details of the original data set can be found and downloaded in the original UCI repository. The values of the status column in the test set is leaved with empty to simulate real world predictions.
Evidence of Learning:
Execute your code into a jupyter notebook (.ipynb file) and keep the output, write a report (.pdf file) to answer the following questions, and submit your code and report to OnTrack.
1.
Load and explore the training and test dataset, do necessary pre-processing.
a.
Show both training and test dataset size.
b.
Based on the training and test data, show the feature types, and indicate which feature has missing values.
c.
Use an appropriate method to deal with the missing values for both the training and test set.
d.
Do necessary encoding for the categorical features.
e.
Show the label distribution based on the training data, is it a balanced training set?
2.
Based on the pre-processed training data from question 1, create three supervised machine learning (ML) models for predicting Status.
a.
Use an appropriate validation method, report performance score using a suitable metric. Is it possible that the presented result is an underfitted or overfitted one? Justify.
b.
Justify different design decisions for each ML model used to answer this question.
c.
Have you optimised any hyper-parameters for each ML model? What are they? Why have you done that? Explain.
d.
What can you do with the label imbalance issue?
e.
Finally, make a model recommendation based on the reported results and justify it
Use the best model that you get from question 2, do prediction on the pre-processed test set. Save your prediction (the prediction should contain two columns only: testID and Status), and submit it to the specific Kaggle in-class platform, do a screenshot of your model performance and report it.
Please answer all of question 1, I'll then use it for question 2 and 3

Step by Step Solution

There are 3 Steps involved in it

Step: 1

blur-text-image

Get Instant Access to Expert-Tailored Solutions

See step-by-step solutions with expert insights and AI powered tools for academic success

Step: 2

blur-text-image

Step: 3

blur-text-image

Ace Your Homework with AI

Get the answers you need in no time with our AI-driven, step-by-step assistance

Get Started

Recommended Textbook for

Intelligent Information And Database Systems Asian Conference Aciids 2012 Kaohsiung Taiwan March 19 21 2012 Proceedings Part 3 Lnai 7198

Authors: Jeng-Shyang Pan ,Shyi-Ming Chen ,Ngoc-Thanh Nguyen

2012th Edition

3642284922, 978-3642284922

More Books

Students also viewed these Databases questions

Question

Draft a proposal for a risk assessment exercise.

Answered: 1 week ago