Answered step by step

Verified Expert Solution

Link Copied!

Question

1 Approved Answer

Posted on Aug 08, 2024

Load the data into a DataFrame df = pd . read _ csv ( ' seeds . csv ' ) Correlation Analysis Calculate the correlation

Load the data into a DataFrame

=

.

read

_

csv

('

seeds

.

csv

')

Correlation Analysis

Calculate the correlation values

Reduce theNf x Nf

f correlation DataFrame to non

-

redundant, non

-

identical feature pairs with the corresponding correlation value

Examine and show the features pairings with a correlation value greater than

0.7

Partition Data

Extract the features into a new DataFrame named X

Extract the target labels into a DataFrame named y

Make sure the target labels are in a DataFrame, not an array or Series

Model predictions will be saved to this DataFrame

Use double brackets: df

[['

target

']]

Identify best k with elbow method

Construct a function that produces the plot of SSE versus k The function should use a feature set X as input and return only the plot.

Consider the following pseudo code:

from sklearn.cluster import KMeans

def calculate

_

sse

_

_

(

)

## instantiate a list of array to hold the k ans SSE values at each iteration

sse

_

_

= []

## iterate over values of k

for k in np

.

arange

(1, 21, 1)

## instantiate and fit KMeans with the value of k and fixed random

_

state

## add the k value and SSE

(

inertia

_)

to the list

## plot the resulting sse versus k pairs

plt

.

figure

(

figsize

= (9, 5))

plt

.

scatter

(

=

sse

_

_

[

, 0],

=

sse

_

_

[

, 1])

## show as points

plt

.

plot

(

sse

_

_

[

, 0],

sse

_

_

[

, 1])

plt

.

xlabel

('

Cluster number $k$

')

plt

.

ylabel

('

SSE

(

Inertia

)')

plt

.

xticks

(

ticks

=

.

arange

(1, 21, 1))

plt

.

show

()

Show the SSE verus k plot for unscaled data

What is the optimal value of k when clustering unscaled data?

Now, instantiate a StandardScaler, scale and transform x

,

and show the SSE versus k for scaled data

-

What is the optimal value of k when clustering scaled data?

With optimal k values found, extract the cluster labels from KMeans Clustering

#### Unscaled data

-

Instantiate KMeans with the optimal k value found from the unscaled data. Be sure to employ the same random

_

state that was used in the calculate

_

sse

_

_

k function above.

-

Fit this KMeans with the unscaled data

-

Extract the labels

_

from this KMeans and add these as a new column named

'

_

label' to the target DataFrame

-

You may need to align the predicted labels to the actual labels

#### Scaled Data

-

Instantiate KMeans with the optimal k value found from the scaled data. Be sure to employ the same random

_

state that was used in the calculate

_

sse

_

_

k function above

-

Fit this KMeans with the scaled data

-

Extract the labels

_

from this KMeans and add these as a new column named

'

_

_

label' to the target DataFrame

-

You may need to align the predicted labels to the actual labels

With optimal k values found, extract the cluster labels from Agglomerative Clustering

#### Unscaled data

-

Instantiate AgglomerativeClustering with the optimal k value found from the unscaled data, and linkage

=

'complete'.

-

Fit this AgglomerativeClustering with the unscaled data

-

Extract the labels

_

from this AgglomerativeClustering and add these as a new column named 'agg

_

label' to the target DataFrame

-

You may need to align the predicted labels to the actual labels

#### Scaled Data

-

Instantiate AgglomerativeClustering with the optimal k value found from the scaled data, and linkage

=

'complete'.

-

Fit this AgglomerativeClustering with the scaled data

-

Extract the labels

_

from this AgglomerativeClustering and add these as a new column named 'agg

_

_

label' to the target DataFrame

-

You may need to align the predicted labels to the actual labels

## Compare the clustering results to k

-

Nearest Neighbors Classifier

-

Instantiate KNeighborsClassifier from sklearn.neighbors with default settings, fit to the unscaled data

-

Tip: The target dataset y has additional columns from clustering predictions. Be sure to call y

.

target

-

Add the predictions from the k

-

NN classifier and the unscaled data to the target DataFrame as knn

_

label

-

Repeat for scaled data. Add those predictions as knn

_

_

label

## Calculate the Accuracy Score and show the Confusion Matrix for all Predictors

-

There are six predictions to consider

-

KMeans clustering with unscaled and scaled data

-

Agglomerative clustering with unscaled and scaled data

-

-

NN Classifier with unscaled and scaled data

-

Which predictor performed the best?

-

Was there any case when the predictor with unscaled data outperformed the same predictor with scaled data?

Step by Step Solution

There are 3 Steps involved in it

Step: 1

Get Instant Access to Expert-Tailored Solutions

See step-by-step solutions with expert insights and AI powered tools for academic success

Step: 2

Step: 3

Ace Your Homework with AI

Get the answers you need in no time with our AI-driven, step-by-step assistance

Get Started

Recommended Textbook for

Modern Database Management

Authors: Heikki Topi, Jeffrey A Hoffer, Ramesh Venkataraman

13th Edition

★★★★★

Discuss the relationship between job analysis and HRM processes.

Answered: 1 week ago

Previous Question Next Question