Question

1 Approved Answer

Posted on Sep 25, 2024

In this question, we work with a dataset from the great textbook of An Introduction to Statistical Learning. (A) Read the dataset file Hearts_s.csv and

In this question, we work with a dataset from the great textbook of "An Introduction to Statistical Learning." (A) Read the dataset file Hearts_s.csv and assign it to a Pandas DataFrame. (B) Check out the dataset. As you see, the dataset contains a number of features including both contextual and biological factors (e.g. age, gender, vital signs, ). The last column AHD is the label with Yes meaning that a human subject has Heart Disease, and No meaning that the subject does not have Heart Disease. (C) As you see, there are at least 3 categorical features in the dataset (Gender, ChestPain, Thal). Lets ignore these categorical features for now, only keep the numerical features and build your feature matrix and label vector. (D) Split the dataset into testing and training sets with the following parameters: test_size=0.25, random_state=4. (E) Use KNN (with k=3), Decision Tree (with random_state=5), and Logistic Regression Classifiers to predict Heart Disease based on the training/testing datasets that you built in part (d). Then check, compare, and report the accuracy of these 3 classifiers. Which one is the best? Which one is the worst? (F) Now, we want to use the categorical features as well! To this end, we have to perform a feature engineering process called OneHotEncoding for the categorical features. To do this, each categorical feature should be replaced with dummy columns in the feature table (one column for each possible value of a categorical feature), and then encode it in a binary manner such that only one of the dummy columns can take 1 at a time (and zero for the rest). For example, Gender can take two values m and f. Thus, we need to replace this feature (in the feature table) by 2 columns titled m and f. Wherever we have a male subject, we can put 1 and 0 in the columns m and f. Wherever we have a female subject, we can put 0 and 1 in the columns m and f. (Hint: you will need 4 columns to encode ChestPain and 3 columns to encode Thal).

(G) Repeat parts (d) and (e) with the new dataset that you built in part (f). How does the prediction accuracy change for each method?

(H) Now, repeat part (e) with the new dataset that you built in part (f), but this time using Cross-Validation. Thus, rather than splitting the dataset into testing and training, use 10-fold Cross-Validation (as we learned in Lab4) to evaluate the classification methods and report the final prediction accuracy.

The Hearts_s.csv you can download from this link => https://drive.google.com/open?id=1OWOR-qbyBhHc-Mr6tq5rbxGhINLYNhDm