Answered step by step

Verified Expert Solution

Link Copied!

Question

1 Approved Answer

Posted on Jul 30, 2024

Task You are to import and clean the same HealthCareData _ 2 0 2 4 . csv , that was used in the previous assignment.

Task

You are to import and clean the same HealthCareData

_2024 .

csv

,

that was used in the

previous assignment. Then run, tune and evaluate two supervised ML algorithms

(

each

with two types of training data

)

to identify the most accurate way of classifying

malicious events.

Part

1

General data preparation and cleaning

)

Import the HealthCareData

_2024 .

csv into R Studio. This version is the same as

Assignment

1 .

)

Write the appropriate code in R Studio to prepare and clean the

HealthCareData

_2024

dataset as follows:

.

Clean the whole dataset based on the feedback received for Assignment

1 .

.

For the feature NetworkInteractionType, merge the

Regular

and

Unknown

categories together to form the category

Others

.

Hint: use the

forcats:: fct

_

collapse

(.)

function.

iii. Select only the complete cases using the na

.

omit

(.)

function, and name the

dataset dat.cleaned.

Briefly outline the preparation and cleaning process in your report and why you

believe the above steps were necessary.

)

Use the code below to generated two training datasets

(

one unbalanced

mydata.ub

.

train and one balanced mydata.b

.

train

)

along with the testing set

(

mydata

.

test

) .

Make sure you enter your student ID into the command

set.seed

(.) .

# Separate samples of normal and malicious events

dat.class

0 < -

dat.cleaned

% > %

filter

(

Classification

= =

"Normal"

)

# normal

dat.class

1 < -

dat.cleaned

% > %

filter

(

Classification

= =

"Malicious"

)

# malicious

# Randomly select

9600

non

-

malicious and

400

malicious samples using your student

,

then combine them to form a working data set

set.seed

(

Enter your Student ID

)

rows.train

0 < -

sample

(1

:nrow

(

dat

.

class

0),

size

= 9600,

replace

=

FALSE

)

rows.train

1 < -

sample

(1

:nrow

(

dat

.

class

1),

size

= 400,

replace

=

FALSE

)

# Your

10000

unbalanced

training samples

train.class

0 < -

dat.class

0 [

rows

.

train

0,]

# Non

-

malicious samples

train.class

1 < -

dat.class

1 [

rows

.

train

1,]

# Malicious samples

mydata.ub

.

train

< -

rbind

(

train

.

class

0,

train.class

1)

# Your

19200

balanced

training samples, i

.

. 9600

normal and malicious samples e

ach.

set.seed

(

Enter your Student ID

)

6 |

P a g e

train.class

1_2 < -

train.class

1 [

sample

(1

:nrow

(

train

.

class

1),

size

= 9600,

replace

=

TRUE

),]

mydata.b

.

train

< -

rbind

(

train

.

class

0,

train.class

1_2)

# Your testing samples

test.class

0 < -

dat.class

0 [-

rows.train

0,]

test.class

1 < -

dat.class

1 [-

rows.train

1,]

mydata.test

< -

rbind

(

test

.

class

0,

test.class

1)

Note that in the master data set, the percentage of malicious events is

approximately

4 % .

This distribution is roughly represented by the unbalanced

data. The balanced data is generated based on up

-

sampling of the minority class

using bootstrapping. The idea here is to ensure the trained model is not biased

towards the majority class, i

.

.

normal events.

Part

2

Compare the performances of different ML algorithms

)

Randomly select two supervised learning modelling algorithms to test against

one another by running the following code. Make sure you enter your student ID

into the command set.seed

(.) .

Your

2

ML approaches are given by myModels.

set.seed

(

Enter your student ID

)

models.list

1 < -

("

Logistic Ridge Regression",

"Logistic LASSO Regression",

"Logistic Elastic

-

Net Regression"

)

models.list

2 < -

("

Classification Tree",

"Bagging Tree",

"Random Forest"

)

myModels

< -

(

sample

(

models

.

list

1,

size

= 1),

sample

(

models

.

list

2,

size

= 1))

myModels

% > %

data.frame

Step by Step Solution

There are 3 Steps involved in it

Step: 1

Get Instant Access to Expert-Tailored Solutions

See step-by-step solutions with expert insights and AI powered tools for academic success

Step: 2

Step: 3

Ace Your Homework with AI

Get the answers you need in no time with our AI-driven, step-by-step assistance

Get Started

Recommended Textbook for

Pro SQL Server Wait Statistics

Authors: Enrico Van De Laar

1st Edition

★★★★★

=+3. You are working with a boutique agency specializing in entertainment and social media practices. You have been asked by your

Answered: 1 week ago

Previous Question Next Question