Answered step by step
Verified Expert Solution
Link Copied!

Question

1 Approved Answer

Load the data into a DataFrame df = pd . read _ csv ( ' seeds . csv ' ) Correlation Analysis Calculate the correlation

Load the data into a DataFrame
df = pd.read_csv('seeds.csv')
Correlation Analysis
Calculate the correlation values
Reduce theNf x Nf f correlation DataFrame to non-redundant, non-identical feature pairs with the corresponding correlation value
Examine and show the features pairings with a correlation value greater than 0.7
Partition Data
Extract the features into a new DataFrame named X
Extract the target labels into a DataFrame named y
Make sure the target labels are in a DataFrame, not an array or Series
Model predictions will be saved to this DataFrame
Use double brackets: df[['target']]
Identify best k with elbow method
Construct a function that produces the plot of SSE versus k The function should use a feature set X as input and return only the plot.
Consider the following pseudo code:
from sklearn.cluster import KMeans
def calculate_sse_vs_k(X):
## instantiate a list of array to hold the k ans SSE values at each iteration
sse_v_k =[]
## iterate over values of k
for k in np.arange(1,21,1):
## instantiate and fit KMeans with the value of k and fixed random_state
## add the k value and SSE (inertia_) to the list
## plot the resulting sse versus k pairs
plt.figure(figsize =(9,5))
plt.scatter(x= sse_v_k[:,0], y= sse_v_k[:,1]) ## show as points
plt.plot(sse_v_k[:,0],sse_v_k[:,1])
plt.xlabel('Cluster number $k$')
plt.ylabel('SSE (Inertia)')
plt.xticks(ticks= np.arange(1,21,1))
plt.show()
Show the SSE verus k plot for unscaled data
What is the optimal value of k when clustering unscaled data?
Now, instantiate a StandardScaler, scale and transform x, and show the SSE versus k for scaled data
- What is the optimal value of k when clustering scaled data?
With optimal k values found, extract the cluster labels from KMeans Clustering
#### Unscaled data
- Instantiate KMeans with the optimal k value found from the unscaled data. Be sure to employ the same random_state that was used in the calculate_sse_vs_k function above.
- Fit this KMeans with the unscaled data
- Extract the labels_ from this KMeans and add these as a new column named 'km_label' to the target DataFrame
- You may need to align the predicted labels to the actual labels
#### Scaled Data
- Instantiate KMeans with the optimal k value found from the scaled data. Be sure to employ the same random_state that was used in the calculate_sse_vs_k function above
- Fit this KMeans with the scaled data
- Extract the labels_ from this KMeans and add these as a new column named 'km_ss_label' to the target DataFrame
- You may need to align the predicted labels to the actual labels
With optimal k values found, extract the cluster labels from Agglomerative Clustering
#### Unscaled data
- Instantiate AgglomerativeClustering with the optimal k value found from the unscaled data, and linkage= 'complete'.
- Fit this AgglomerativeClustering with the unscaled data
- Extract the labels_ from this AgglomerativeClustering and add these as a new column named 'agg_label' to the target DataFrame
- You may need to align the predicted labels to the actual labels
#### Scaled Data
- Instantiate AgglomerativeClustering with the optimal k value found from the scaled data, and linkage= 'complete'.
- Fit this AgglomerativeClustering with the scaled data
- Extract the labels_ from this AgglomerativeClustering and add these as a new column named 'agg_ss_label' to the target DataFrame
- You may need to align the predicted labels to the actual labels
## Compare the clustering results to k-Nearest Neighbors Classifier
- Instantiate KNeighborsClassifier from sklearn.neighbors with default settings, fit to the unscaled data
- Tip: The target dataset y has additional columns from clustering predictions. Be sure to call y.target
- Add the predictions from the k-NN classifier and the unscaled data to the target DataFrame as knn_label
- Repeat for scaled data. Add those predictions as knn_ss_label
## Calculate the Accuracy Score and show the Confusion Matrix for all Predictors
- There are six predictions to consider
- KMeans clustering with unscaled and scaled data
- Agglomerative clustering with unscaled and scaled data
- k-NN Classifier with unscaled and scaled data
- Which predictor performed the best?
- Was there any case when the predictor with unscaled data outperformed the same predictor with scaled data?

Step by Step Solution

There are 3 Steps involved in it

Step: 1

blur-text-image

Get Instant Access to Expert-Tailored Solutions

See step-by-step solutions with expert insights and AI powered tools for academic success

Step: 2

blur-text-image

Step: 3

blur-text-image

Ace Your Homework with AI

Get the answers you need in no time with our AI-driven, step-by-step assistance

Get Started

Recommended Textbook for

Modern Database Management

Authors: Heikki Topi, Jeffrey A Hoffer, Ramesh Venkataraman

13th Edition

0134773659, 978-0134773650

More Books

Students also viewed these Databases questions

Question

=+d) Perform the ANOVA and report your conclusions.

Answered: 1 week ago