Answered step by step
Verified Expert Solution
Link Copied!

Question

1 Approved Answer

Advanced machine learning models are beginning to revolutionise the medical sciences, where they are finding use in the detection and diagnosis of disease. The use

Advanced machine learning models are beginning to revolutionise the medical sciences, where they are
finding use in the detection and diagnosis of disease. The use of algorithms to diagnose and inform on
conditions including dementia, diabetes and various forms of cancer have shown considerable promise in
recent years, offering hope for the millions of sufferers of these conditions.
The Wisconsin Breast Cancer dataset is a widely used collection of data which contains information relating
to the nuclei of cells extracted from breast cancer tumours. The data were collected by Dr. William H.
Wolberg at the University of Wisconsin Hospitals in the United States, in collaboration with W. Nick Street
and Olvi L. Mangasarian.
You are supplied with a modified version of this dataset with the filename MS4S16_Dataset.csv. You
must use this version of the dataset and not one from elsewhere on the internet!
The dataset is composed of thirty-two different columns. The target variable, diagnosis, indicates whether
the tumour is malignant (M) or benign (B). Another column, id, gives a unique identifier for each sample.
The remaining thirty columns are the features, each pertaining to a particular physical attribute of the cell
nuclei visualised in the medical image from each observation. There are ten unique measurements, and for
each of these the mean, standard error, and worst(the average of the three largest values in each
observation) are given, resulting in the thirty feature columns. The ten measurements that can be found in
the dataset are summarised below:
Measurement Name Measurement Description
radius The mean of distances from the centre of the cell
nucleus to the edge
texture The apparent texture of the cell nucleus, expressed as a
standard deviation of greyscale values
perimeter The length of the edge of the cell nucleus
area The area of the cell nucleus
smoothness A measure of the local variation in the length of radii at
different sections of the cell nucleus
compactness Defined as (perimeter2/(area 1))
concavity A measure of the severity of concave portions of the
contour of the cell nucleus
concave points The number of concave portions
symmetry A measure of the symmetry of the cell nucleus
fractal_dimension Defined as the coastline approximation1
PUBLIC / CYHOEDDUS
For more information on the dataset and its features, go to: Breast Cancer Wisconsin (Diagnostic)- UCI
Machine Learning Repository. Remember, however, that you must use the version of the dataset supplied
on Blackboard.
It is hoped that one day, with enough high-quality data, machine learning will be able to successfully
identify individuals that display symptoms of disease as accurately as (or even more accurately than)
clinicians, which may aid in the treatment and prognosis of life-threatening conditions like cancer.
However, until that data and biological understanding is available, it is imperative that healthcare
professionals and machine learning scientists/engineers work together to unlock the incredible potential of
AI in healthcare, whilst being mindful of the myriad ethical implications and potentially damaging
limitations and side effects of using historical medical and genetic data.
Aims of the Coursework
To solve this coursework, you may wish to use any of the techniques that we have learned in the first half
of the module, as well as using other concepts and ideas you find through independent study.
You are, of course, welcome to look for help and inspiration online, but refrain from copying code verbatim
from other sources and always remember to reference your sources failure to do so is plagiarism!
The marks of the coursework constitute 50% of the mark of this module:
The four sections are summarised below:
1. Pre-process the dataset and perform an Exploratory Data Analysis (EDA) of the data.
(35%)
This should include:
Splitting the dataset into a Training Set and a Test Set;
Taking care of any missing, duplicated or outlier values;
Transforming data, where appropriate to do so;
Encoding categorical features;
Performing feature engineering techniques such as feature extraction and selection;
Producing appropriate and informative plots and tables for an exploratory analysis;
Assessing statistical assumptions and inferences.
PUBLIC / CYHOEDDUS
2. Utilising features and attributes derived from the pre-processing and EDA stage, conduct an
unsupervised machine learning analysis with the aim of gaining further insights into the data via
either clustering or dimensionality reduction.
(25%)
To do this, you may consider:
Clustering using different appropriate algorithms, e.g. K-means, hierarchical, DBScan;
Performing a dimensionality reduction to see if this improves your clustering, or to see if a smaller
number of features can adequately explain the observations;
Evaluating the utility of the different

Step by Step Solution

There are 3 Steps involved in it

Step: 1

blur-text-image

Get Instant Access to Expert-Tailored Solutions

See step-by-step solutions with expert insights and AI powered tools for academic success

Step: 2

blur-text-image

Step: 3

blur-text-image

Ace Your Homework with AI

Get the answers you need in no time with our AI-driven, step-by-step assistance

Get Started

Recommended Textbook for

Modern Database Management

Authors: Heikki Topi, Jeffrey A Hoffer, Ramesh Venkataraman

13th Edition

0134773659, 978-0134773650

More Books

Students also viewed these Databases questions

Question

Define recruitment.

Answered: 1 week ago

Question

Identify external recruitment sources.

Answered: 1 week ago