Answered step by step
Verified Expert Solution
Question
1 Approved Answer
Task You are to import and clean the same HealthCareData _ 2 0 2 4 . csv , that was used in the previous assignment.
Task
You are to import and clean the same HealthCareDatacsv that was used in the
previous assignment. Then run, tune and evaluate two supervised ML algorithms each
with two types of training data to identify the most accurate way of classifying
malicious events.
Part General data preparation and cleaning
a Import the HealthCareDatacsv into R Studio. This version is the same as
Assignment
b Write the appropriate code in R Studio to prepare and clean the
HealthCareData dataset as follows:
i Clean the whole dataset based on the feedback received for Assignment
ii For the feature NetworkInteractionType, merge the Regular and
Unknown categories together to form the category Others Hint: use the
forcats:: fctcollapse function.
iii. Select only the complete cases using the naomit function, and name the
dataset dat.cleaned.
Briefly outline the preparation and cleaning process in your report and why you
believe the above steps were necessary.
c Use the code below to generated two training datasets one unbalanced
mydata.ubtrain and one balanced mydata.btrain along with the testing set
mydatatest Make sure you enter your student ID into the command
set.seed
# Separate samples of normal and malicious events
dat.class dat.cleaned filterClassification "Normal" # normal
dat.class dat.cleaned filterClassification "Malicious" # malicious
# Randomly select nonmalicious and malicious samples using your student
ID then combine them to form a working data set
set.seedEnter your Student ID
rows.train sample:nrowdatclass size replace FALSE
rows.train sample:nrowdatclass size replace FALSE
# Your unbalanced training samples
train.class dat.classrowstrain # Nonmalicious samples
train.class dat.classrowstrain # Malicious samples
mydata.ubtrain rbindtrainclass train.class
# Your balanced training samples, ie normal and malicious samples e
ach.
set.seedEnter your Student ID
P a g e
train.class train.classsample:nrowtrainclass size
replace TRUE
mydata.btrain rbindtrainclass train.class
# Your testing samples
test.class dat.classrows.train
test.class dat.classrows.train
mydata.test rbindtestclass test.class
Note that in the master data set, the percentage of malicious events is
approximately This distribution is roughly represented by the unbalanced
data. The balanced data is generated based on upsampling of the minority class
using bootstrapping. The idea here is to ensure the trained model is not biased
towards the majority class, ie normal events.
Part Compare the performances of different ML algorithms
a Randomly select two supervised learning modelling algorithms to test against
one another by running the following code. Make sure you enter your student ID
into the command set.seed Your ML approaches are given by myModels.
set.seedEnter your student ID
models.list cLogistic Ridge Regression",
"Logistic LASSO Regression",
"Logistic ElasticNet Regression"
models.list cClassification Tree",
"Bagging Tree",
"Random Forest"
myModels csamplemodelslist size
samplemodelslist size
myModels data.frame
Step by Step Solution
There are 3 Steps involved in it
Step: 1
Get Instant Access to Expert-Tailored Solutions
See step-by-step solutions with expert insights and AI powered tools for academic success
Step: 2
Step: 3
Ace Your Homework with AI
Get the answers you need in no time with our AI-driven, step-by-step assistance
Get Started