Question
You are applying for a position at the data science team of USDA and you are given data associated with determining appropriate parasite treatment of
You are applying for a position at the data science team of USDA and you are given data associated with determining appropriate parasite treatment of canines. The suggested treatment options are determined based on a logistic regression model that predicts if the canine is infected with a parasite. The data is given in the site: https://data.world/ehales/grls-parasite-study/workspace/file?filename=CBC_data.csv and more specifically in the CBC_data.csv file. Login using you University Google account to access the data and the description that includes a paper on the study (you dont need to read the paper to solve this problem). Your target variable column is titled parasite_status.
Question 1 - Feature Engineering (5 points) In this step you outline the following as potential features (this is a limited example - we can have many features as in your programming exercise below). Write the posterior probability expressions for logistic regression for the problem you are given to solve. (=1|,)= (=0|,)=
Question 2 - Decision Boundary (5 points) Write the expression for the decision boundary assuming that (=1)=(=0) . The decision boundary is the line that separates the two classes.
Type Markdown and LaTeX: 2
Question 3 - Loss function (5 points) Write the expression of the loss as a function of that makes sense for you to use in this problem. NOTE: The loss will be a function that will include this function: ()=11+ =
Question 4 - Gradient (5 points) Write the expression of the gradient of the loss with respect to the parameters - show all your work. =
Question 5 - Imbalanced dataset (10 points) You are now told that in the dataset (=0)>>(=1) Can you comment if the accuracy of Logistic Regression will be affected by such imbalance?
Type Markdown and LaTeX: 2
Question 6 - SGD (15 points) The interviewer was impressed with your answers and wants to test your programming skills. Use the dataset to train a logistic regressor that will predict the target variable . Report the harmonic mean of precision (p) and recall (r) i.e the metric called 1 score that is calculated as shown below using a test dataset that is 20% of each group. Plot the 1 score vs the iteration number . 1=21+1 Your code includes hyperparameter optimization of the learning rate and mini-batch size.
Step by Step Solution
There are 3 Steps involved in it
Step: 1
Get Instant Access to Expert-Tailored Solutions
See step-by-step solutions with expert insights and AI powered tools for academic success
Step: 2
Step: 3
Ace Your Homework with AI
Get the answers you need in no time with our AI-driven, step-by-step assistance
Get Started