Answered step by step
Verified Expert Solution
Link Copied!

Question

1 Approved Answer

Date Set Description: We'll be using the mammographic masses public dataset from the UCI repository. This data contains 961 instances of masses detected in mammograms,

Date Set Description:

We'll be using the "mammographic masses" public dataset from the UCI repository. This data contains 961 instances of masses detected in mammograms, and contains the following attributes:

1. BI-RADS assessment: 1 to 5 (ordinal)

2. Age: patient's age in years (integer)

3. Shape: mass shape: round=1 oval=2 lobular=3 irregular=4 (nominal)

4. Margin: mass margin: circumscribed=1 microlobulated=2 obscured=3 ill-defined=4 spiculated=5 (nominal)

5. Density: mass density high=1 iso=2 low=3 fat-containing=4 (ordinal)

6. Severity: benign=0 or malignant=1 (binominal) (The dependent variable)

BI-RADS is an assessment of how confident the severity classification is; it is not a "predictive" attribute and so we will discard it. The age, shape, margin, and density attributes are the features that we will build our model with, and "severity" is the classification we will attempt to predict based on those attributes.

Although "shape" and "margin" are nominal data types, which sklearn typically doesn't deal with well, they are close enough to ordinal that we shouldn't just discard them. The "shape" for example is ordered increasingly from round to irregular.

Note: You may use pandas, numpy, sklearn or any essential python package you need.

[1.5 points] Read the given mammographic_masses.data.txt data set into masses_data

The output should be like this:

[1.5 points] Provide a plot to show the counts of both classes in severity dependent variable.

The output should be like this:

Question 2: [2 points]

For data preprocessing:

[1 points] list all rows that contain missing values in the data set:

The output should be like this:

[1 points] Impute the missing values in the data set using the median value for each column containing missing value.

[.5 points] Separate the columns of messes_data into X and y, where X contains the columns BI-RADS, age, shape, margin, and density and y contain severity only.

[.5 points] Use train_test_split command to prepare X_train, X_test, y_train, and y_test.

Use logistic regression algorithm to model the prepared training data.

Use logistic regression algorithm to predict the test data.

Use the sklearn accuracy_score to find the accuracy of your developed model on testing data.

Print the obtained accuracy from previous question.

Step by Step Solution

There are 3 Steps involved in it

Step: 1

blur-text-image

Get Instant Access to Expert-Tailored Solutions

See step-by-step solutions with expert insights and AI powered tools for academic success

Step: 2

blur-text-image

Step: 3

blur-text-image

Ace Your Homework with AI

Get the answers you need in no time with our AI-driven, step-by-step assistance

Get Started

Students also viewed these Databases questions

Question

What is a group and how do groups form?

Answered: 1 week ago

Question

Compare the different types of employee separation actions.

Answered: 1 week ago

Question

Assess alternative dispute resolution methods.

Answered: 1 week ago

Question

Distinguish between intrinsic and extrinsic rewards.

Answered: 1 week ago