Answered step by step
Verified Expert Solution
Link Copied!

Question

1 Approved Answer

l. Imbalanced data refers to a classification problem where the number of observations per class is not equally distributed . In this question we subsarnple

image text in transcribed
l. Imbalanced data refers to a classification problem where the number of observations per class is not equally distributed . In this question we subsarnple the IMDB data to crcmte imbalanced data. To subsample the data, keep all the negative observations, but only keep the first 4000 (out of the 12500} of the positive observations. Do this separately for both train and test. We end with 12500 +4000 2 16500 observations separately for training and testing . Use the p = 2500 most frequent words as predictors. (a) Fit a LASSO logistic regression model to the training data. and use 10fold cross validation using the AUC as a measure of error to tune A. For the optimal A answer the following questions. i. What are the top 5 words associated with positive reviews ? (2 points ) ii. 'What are the top 5 words associated with negative reviews ? [2 points } iii. In the training set, for each observation, using logistic regression, calculate Pr[y = 1|X = m] for the model tted using the A found from 10fold CV. For a sequence of thresholds 3 = U,0.01,U.02, 0.03 , - - - ,1, calculate the the TPR and FPR, and using these plot the ROC curve and calculate the AUC. Note that to calculate the AUC you need the area under the ROC curve. Repeat the same for the test set. Plot the R00 for the train and the test on the same graph . Also in the graph report the train and test AUG . In other words , one figure should show the ROC of the train and test, and values of the AUC. Use color coding and make sure to label the horizontal and vertical axes. (2 points ) iv. For 3 = (1.5, what is the type [and type [I error ? (2 points } v. For what 3, the type Ierror is equal {as much as possible :1 to the type II error ? (2 points :1 (b) Fit a ridge logistic regression model to the training data and use 10fold cross - validation using the AUC as a measure of error to tune A. For the optimal A answer the same questions (i., ii., ...) as asked for LASSO (8 points }. (c) Fit an Elastic net logistic regression model to the training data and use 10fold cross validation using the AUC as a measure of error to tune A. For the optimal A answer the same questions (L, ii., ..) as asked for LASSO (8 points }

Step by Step Solution

There are 3 Steps involved in it

Step: 1

blur-text-image

Get Instant Access to Expert-Tailored Solutions

See step-by-step solutions with expert insights and AI powered tools for academic success

Step: 2

blur-text-image

Step: 3

blur-text-image

Ace Your Homework with AI

Get the answers you need in no time with our AI-driven, step-by-step assistance

Get Started

Recommended Textbook for

Real Analysis Foundations And Functions Of One Variable

Authors: Miklos Laczkovich, Vera T Sós

1st Edition

1493927663, 9781493927661

More Books

Students also viewed these Mathematics questions

Question

Why might the term average cost be misleading?

Answered: 1 week ago

Question

What is the purpose of the staffing practice called Two-in-aBox?

Answered: 1 week ago

Question

What would you do?

Answered: 1 week ago