Question
Here is the project: [ Overview and Rationale Data mining is used to reveal hard to see and hidden patterns and relationships in Big Data
Here is the project:
[Overview and Rationale
Data mining is used to reveal hard to see and hidden patterns and relationships in Big Data datasets. Data mining helps to classify data for further examination or create models to predict outcomes for a different set of data. As data miners, you should be able to explain how the code used to mine the data is functioning and be able to analyze and interpret the results of the mining. This allows you to summarize and clarify the results for stakeholders.
Assignment Description
Many people forage for mushrooms and sell them to restaurants or use them for their own consumption. These are experts who know their mushroom. However, as a novice, it is important to be able to spot a poisonous mushroom.
In this assignment, you will use the data set provided to mine the data using the methods presented in this module. You will document in a report the results of each step of the mining process, analyze and interpret the results. Suggest the characteristics to use when determining if a mushroom is safe to eat. Make recommendations for additional analysis and variables to examine to build other classifications such as use of the mushrooms that are not poisonous.
mushrooms.xlsx Download mushrooms.xlsx
Instructions
The report should include the following:
- Code walk through: in this section provide a step by step explanation of how the code is interacting with and/or transforming the data. Provide examples from the output to support your explanations.
- Analysis: Based on the output, analyze the data and the relationships revealed about the variables of interest. Explains the insights provided by the output. Use visualizations to support your analysis.
- Interpretation and Recommendations: Interpret the results of your analysis and explain what the results mean for the data owner. Provide recommendations for actions to be taken based on your interpretation. Support those with the data. Explain why and what explicit variables you suggest incorporating. For example, median income by city and state from the census.gov website might be useful for examining home ownership.
]
here what I did:
Step 1: Import the necessary libraries
import pandas as pd import numpy as np import seaborn as sns import matplotlib.pyplot as plt
Step 2: Load the dataset into a Pandas dataframe
mushrooms = pd.read_excel('/content/mushrooms.xlsx', header=None)
Step 3: Explore the data
# view the first few rows of the data mushrooms.head() # check the dimensions of the dataset mushrooms.shape # check the data types of each variable mushrooms.dtypes
Step 4: Clean and preprocess the data
# check for missing values mushrooms.isnull().sum() # encode the categorical variables as numerical variables from sklearn.preprocessing import LabelEncoder encoder = LabelEncoder() for col in mushrooms.columns: mushrooms[col] = encoder.fit_transform(mushrooms[col])
Step 5: Visualize the data
# visualize the distribution of each variable mushrooms.hist(figsize=(20,20)) # visualize the correlation between variables sns.heatmap(mushrooms.corr())
Step 6: Train and evaluate models
# split the data into training and testing sets from sklearn.model_selection import train_test_split X = mushrooms.drop(columns=['class']) y = mushrooms['class'] X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42) # train a decision tree model from sklearn.tree import DecisionTreeClassifier tree = DecisionTreeClassifier() tree.fit(X_train, y_train) # evaluate the model on the testing set from sklearn.metrics import accuracy_score y_pred = tree.predict(X_test) accuracy_score(y_test, y_pred)
But I cant create a decision tree. I want something like this created: