Question
PLEASE HELP ME COMPLETE THESE PYTHON PROGRAMMING ACTIVITIES Activity 1: Create the Dummy Dataset In this activity, you have to create a dummy dataset for
PLEASE HELP ME COMPLETE THESE PYTHON PROGRAMMING ACTIVITIES
Activity 1: Create the Dummy Dataset
In this activity, you have to create a dummy dataset for multiclass classification.
The steps to be followed are as follows:
1. Create a dummy dataset having two columns representing two independent variables and a third column representing the target.
The number of records should be divided into 6 random groups like [500, 2270, 1900, 41, 2121, 272] such that the target columns has 6 different labels [0, 1, 2, 3, 4, 5].
Recall:
To create a dummy data-frame, use the make_blob() function of the sklearn.datasets module which will return two arrays feature_array and the target_array. The syntax for the make_blob() function is as follows:
Syntax: make_blobs(n_samples, centers, n_features, random_state, cluster_std)
[ ]
# Instruction to remove warning messages
[ ]
# Create two arrays using the 'make_blobs()' function and store them in the 'features_array' and 'target_array' variables.
Hint:
In the make_blobs() function use n_samples=[500, 2270, 1900, 41, 2121, 272] and center=None for the division of target label into seven groups.
2. Print the object-type of the arrays created by the make_blob() function and also print the number of rows and columns in them:
[ ]
# Find out the object-type of the arrays created by the 'make_blob()' function and the number of rows and columns in them. # Print the type of 'features_array' and 'target_array' # Print the number of rows and column of 'features_array' # Print the number of rows and column of 'target_array'
Q: How many rows are created in the feature and target columns?
A:
3. Create a DataFrame from the two arrays using a Python dictionary.
Steps: (Learned in "Logistic Regression - Decision Boundary" lesson)
Create a dummy dictionary.
Add the feature columns as keys col 1, col 2, and target column as the target.
Add the values from the feature and target columns one by one respectively in the dictionary using List Comprehension.
Convert the dictionary into a DataFrame
Print the first five rows of the DataFrame.
[ ]
# Create a Pandas DataFrame containing the items from the 'features_array' and 'target_array' arrays. # Import the module # Create a dummy dictionary # Convert the dictionary into DataFrame # Print first five rows of the DataFrame
Hint:
Use function from_dict() to convert Python Dictionary to DataFrame.
Syntax: pd.DataFrame.from_dict(some_dictionary)
After this activity, the DataFrame should be created with two independent features columns and one dependent target column.
Activity 2: Dataset Inspection
In this activity, you have to look into the distribution of the labels in the target column of the DataFrame.
1. Print the number of occurrences of each label in the target column:
[ ]
# Display the number of occurrences of each label in the 'target' column.
2. Print the percentage of the samples for each label in the target column:
[ ]
# Get the percentage of count of each label samples in the dataset.
Q: How many unique labels are present in the DataFrame? What are they?
A:
Q: Is the DataFrame balanced?
A:
3. Create a scatter plot between the columns col 1 and col 2 for all the labels to visualize the clusters of every label (or points):
[ ]
# Create a scatter plot between 'col 1' and 'col 2' columns separately for all the labels in the same plot. # Import the module # Define the size of the graph # Create a for loop executing for every unique label in the `target` column. # Plot the scatter plot for 'col 1' and 'col 2' where 'target ==i" # Plot the x and y lables # Display the legends and the graph
Hint: Revise the lesson "Logistic Regression - Decision Boundary".
After this activity, the labels to be predicted that is the target variables and their distribution should be known.
Activity 3: Train-Test Split
We need to predict the value of the target variable, using other variables. Thus, the target is the dependent variable and other columns are the independent variables.
1. Split the dataset into the training set and test set such that the training set contains 70% of the instances and the remaining instances will become the test set.
2. Set random_state = 42.
[ ]
# Import 'train_test_split' module # Create the features data frame holding all the columns except the last column # and print first five rows of this dataframe # Create the target series that holds last column 'target' # and print first five rows of this series # Split the train and test sets using the 'train_test_split()' function.
3. Print the number of rows and columns in the training and testing set:
[ ]
# Print the shape of all the four variables i.e. 'X_train', 'X_test', 'y_train' and 'y_test'
After this activity, the features and target data should be split into training and testing data.
Activity 4: Apply SMOTE
In this activity, if the data is imbalanced, oversample the data for the minority classes in the following way:
Create an object for the SMOTE using SMOTE() function.
Synthesize the data for the minority class using fit_sample() function by passing the feature and target training variable.
Save the output of the above step, artificial data, in the new feature and target training variables.
[ ]
# Write the code to apply oversample the data. # Import the 'SMOTE' module from the 'imblearn.over_sampling' library. # Call the 'SMOTE()' function and store it in the a variable. # Call the 'fit_sample()' function.
Print the number of rows and columns in the original and artificial feature and target data:
[ ]
# Print the number of rows and columns in the original and resampled data.
Q: How many rows and columns are there in the original features data?
A:
Q: How many rows and columns are there in the artificially generated features data?
A:
Print the number of occurrences of labels in the artificially generated target data:
[ ]
# Display the number of occurrences of each label in the artificially target data.
Q. Are the number of occurrances equal for all the labels?
A
After this activity, the training feature and target data should have the synthetic data such that all the labels have equal occurrances and the data is balanced
Step by Step Solution
There are 3 Steps involved in it
Step: 1
Get Instant Access to Expert-Tailored Solutions
See step-by-step solutions with expert insights and AI powered tools for academic success
Step: 2
Step: 3
Ace Your Homework with AI
Get the answers you need in no time with our AI-driven, step-by-step assistance
Get Started