Answered step by step

Verified Expert Solution

Link Copied!

Question

1 Approved Answer

Posted on Aug 08, 2024

Advanced machine learning models are beginning to revolutionise the medical sciences, where they are finding use in the detection and diagnosis of disease. The use

Advanced machine learning models are beginning to revolutionise the medical sciences, where they are

finding use in the detection and diagnosis of disease. The use of algorithms to diagnose and inform on

conditions including dementia, diabetes and various forms of cancer have shown considerable promise in

recent years, offering hope for the millions of sufferers of these conditions.

The Wisconsin Breast Cancer dataset is a widely used collection of data which contains information relating

to the nuclei of cells extracted from breast cancer tumours. The data were collected by Dr

.

William H

.

Wolberg at the University of Wisconsin Hospitals in the United States, in collaboration with W

.

Nick Street

and Olvi L

.

Mangasarian.

You are supplied with a modified version of this dataset with the filename MS

4

16_

Dataset.csv

.

You

must use this version of the dataset and not one from elsewhere on the internet!

The dataset is composed of thirty

-

two different columns. The target variable, diagnosis, indicates whether

the tumour is malignant

(

)

or benign

(

) .

Another column, id

,

gives a unique identifier for each sample.

The remaining thirty columns are the features, each pertaining to a particular physical attribute of the cell

nuclei visualised in the medical image from each observation. There are ten unique measurements, and for

each of these the mean, standard error, and

worst

(

the average of the three largest values in each

observation

)

are given, resulting in the thirty feature columns. The ten measurements that can be found in

the dataset are summarised below:

Measurement Name Measurement Description

radius The mean of distances from the centre of the cell

nucleus to the edge

texture The apparent texture of the cell nucleus, expressed as a

standard deviation of greyscale values

perimeter The length of the edge of the cell nucleus

area The area of the cell nucleus

smoothness A measure of the local variation in the length of radii at

different sections of the cell nucleus

compactness Defined as

(

perimeter

2 / (

area

1))

concavity A measure of the severity of concave portions of the

contour of the cell nucleus

concave points The number of concave portions

symmetry A measure of the symmetry of the cell nucleus

fractal

_

dimension Defined as the

coastline approximation

1

PUBLIC

/

CYHOEDDUS

For more information on the dataset and its features, go to: Breast Cancer Wisconsin

(

Diagnostic

) -

UCI

Machine Learning Repository. Remember, however, that you must use the version of the dataset supplied

on Blackboard.

It is hoped that one day, with enough high

-

quality data, machine learning will be able to successfully

identify individuals that display symptoms of disease as accurately as

(

or even more accurately than

)

clinicians, which may aid in the treatment and prognosis of life

-

threatening conditions like cancer.

However, until that data and biological understanding is available, it is imperative that healthcare

professionals and machine learning scientists

/

engineers work together to unlock the incredible potential of

AI in healthcare, whilst being mindful of the myriad ethical implications and potentially damaging

limitations and side effects of using historical medical and genetic data.

Aims of the Coursework

To solve this coursework, you may wish to use any of the techniques that we have learned in the first half

of the module, as well as using other concepts and ideas you find through independent study.

You are, of course, welcome to look for help and inspiration online, but refrain from copying code verbatim

from other sources and always remember to reference your sources

failure to do so is plagiarism!

The marks of the coursework constitute

50 %

of the mark of this module:

The four sections are summarised below:

1 .

Pre

-

process the dataset and perform an Exploratory Data Analysis

(

EDA

)

of the data.

(35 %)

This should include:

Splitting the dataset into a Training Set and a Test Set;

Taking care of any missing, duplicated or outlier values;

Transforming data, where appropriate to do so;

Encoding categorical features;

Performing feature engineering techniques such as feature extraction and selection;

Producing appropriate and informative plots and tables for an exploratory analysis;

Assessing statistical assumptions and inferences.

PUBLIC

/

CYHOEDDUS

2 .

Utilising features and attributes derived from the pre

-

processing and EDA stage, conduct an

unsupervised machine learning analysis with the aim of gaining further insights into the data via

either clustering or dimensionality reduction.

(25 %)

To do this, you may consider:

Clustering using different appropriate algorithms, e

.

.

-

means, hierarchical, DBScan;

Performing a dimensionality reduction to see if this improves your clustering, or to see if a smaller

number of features can adequately explain the observations;

Evaluating the utility of the different

Step by Step Solution

There are 3 Steps involved in it

Step: 1

Get Instant Access to Expert-Tailored Solutions

See step-by-step solutions with expert insights and AI powered tools for academic success

Step: 2

Step: 3

Ace Your Homework with AI

Get the answers you need in no time with our AI-driven, step-by-step assistance

Get Started

Recommended Textbook for

Modern Database Management

Authors: Heikki Topi, Jeffrey A Hoffer, Ramesh Venkataraman

13th Edition

★★★★★

Identify external recruitment sources.

Answered: 1 week ago

Previous Question Next Question