Answered step by step
Verified Expert Solution
Link Copied!

Question

1 Approved Answer

2. Practicum Problems It is suggested that a Jupyter/IPython notebook be used for the programmatic components. 2.1 Problem 1 Load the iris sample dataset from

2. Practicum Problems

It is suggested that a Jupyter/IPython notebook be used for the programmatic components.

2.1 Problem 1

Load the iris sample dataset from sklearn (load_iris()) into Python using a Pandas dataframe. Induce a set of binary Decision Trees with a minimum of 2 instances in the leaves, no splits of subsets below 5, and an maximal tree depth from 1 to 5 (you can leave the majority parameter to 95%). Which depth values result in the highest Recall? Why? Which value resulted in the lowest Precision? Why? Which value results in the best F1 score? Explain the difference between the micro/macro/weighted methods of score calculation.

2.2 Problem 2

Load the Breast Cancer Wisconsin (Diagnostic) sample dataset from the UCI Machine Learning Repository (The discrete version at: breast-cancerwisconsin.data) into Python using a Pandas dataframe. Induce a binary Decision Tree with a minimum of 2 instances in the leaves, no splits of subsets below 5, and a maximal tree depth of 2 (use the default Gini criterion). Calculate the Entropy, Gini, and Misclassification Error of the first split - what is the Information Gain? What is the feature selected for the first split, and what value determines the decision boundary?

2.3 Problem 3

Load the Breast Cancer Wisconsin (Diagnostic) sample dataset from the UCI Machine Learning Repository (The continuous version at: wdbc.data) into Python using a Pandas dataframe. Induce the same binary Decision Tree as above (now using the continuous data) but perform a PCA dimensionality reduction beforehand. Using only the first principal component of the data for a model fit, what is the F1, Precision, and Recall of the PCA-based single factor model compared to the original (continuous) data? Repeat using the first and second principal components. Using the Confusion Matrix, what are the values for FP and TP as well as FPR/TPR? Is using continuous data in this case beneficial within the model? How?

2.4 Problem 4

Simulate a binary classification dataset with a single feature using a mixture of normal distributions with NumPy (Hint: Generate two data frames with the random number and a class label, and combine them together). The normal distribution parameters (np.random.normal) should be (5,2) and (-5,2) for the pair of samples. Induce a binary Decision Tree of maximum depth 2, and obtain the threshold value for the feature in the first split. How does this value compare to the empirical distribution of the feature?

Step by Step Solution

There are 3 Steps involved in it

Step: 1

blur-text-image

Get Instant Access to Expert-Tailored Solutions

See step-by-step solutions with expert insights and AI powered tools for academic success

Step: 2

blur-text-image_2

Step: 3

blur-text-image_3

Ace Your Homework with AI

Get the answers you need in no time with our AI-driven, step-by-step assistance

Get Started

Recommended Textbook for

Mastering Big Data Interview 751 Comprehensive Questions And Expert Answers

Authors: Mr Bhanu Pratap Mahato

1st Edition

B0CLNT3NVD, 979-8865047216

More Books

Students also viewed these Databases questions

Question

Compare the different types of employee separation actions.

Answered: 1 week ago

Question

Assess alternative dispute resolution methods.

Answered: 1 week ago

Question

Distinguish between intrinsic and extrinsic rewards.

Answered: 1 week ago