Answered step by step
Verified Expert Solution
Question
1 Approved Answer
Advanced machine learning models are beginning to revolutionise the medical sciences, where they are finding use in the detection and diagnosis of disease. The use
Advanced machine learning models are beginning to revolutionise the medical sciences, where they are
finding use in the detection and diagnosis of disease. The use of algorithms to diagnose and inform on
conditions including dementia, diabetes and various forms of cancer have shown considerable promise in
recent years, offering hope for the millions of sufferers of these conditions.
The Wisconsin Breast Cancer dataset is a widely used collection of data which contains information relating
to the nuclei of cells extracted from breast cancer tumours. The data were collected by Dr William H
Wolberg at the University of Wisconsin Hospitals in the United States, in collaboration with W Nick Street
and Olvi L Mangasarian.
You are supplied with a modified version of this dataset with the filename MSSDataset.csv You
must use this version of the dataset and not one from elsewhere on the internet!
The dataset is composed of thirtytwo different columns. The target variable, diagnosis, indicates whether
the tumour is malignant M or benign B Another column, id gives a unique identifier for each sample.
The remaining thirty columns are the features, each pertaining to a particular physical attribute of the cell
nuclei visualised in the medical image from each observation. There are ten unique measurements, and for
each of these the mean, standard error, and worstthe average of the three largest values in each
observation are given, resulting in the thirty feature columns. The ten measurements that can be found in
the dataset are summarised below:
Measurement Name Measurement Description
radius The mean of distances from the centre of the cell
nucleus to the edge
texture The apparent texture of the cell nucleus, expressed as a
standard deviation of greyscale values
perimeter The length of the edge of the cell nucleus
area The area of the cell nucleus
smoothness A measure of the local variation in the length of radii at
different sections of the cell nucleus
compactness Defined as perimeterarea
concavity A measure of the severity of concave portions of the
contour of the cell nucleus
concave points The number of concave portions
symmetry A measure of the symmetry of the cell nucleus
fractaldimension Defined as the coastline approximation
PUBLIC CYHOEDDUS
For more information on the dataset and its features, go to: Breast Cancer Wisconsin Diagnostic UCI
Machine Learning Repository. Remember, however, that you must use the version of the dataset supplied
on Blackboard.
It is hoped that one day, with enough highquality data, machine learning will be able to successfully
identify individuals that display symptoms of disease as accurately as or even more accurately than
clinicians, which may aid in the treatment and prognosis of lifethreatening conditions like cancer.
However, until that data and biological understanding is available, it is imperative that healthcare
professionals and machine learning scientistsengineers work together to unlock the incredible potential of
AI in healthcare, whilst being mindful of the myriad ethical implications and potentially damaging
limitations and side effects of using historical medical and genetic data.
Aims of the Coursework
To solve this coursework, you may wish to use any of the techniques that we have learned in the first half
of the module, as well as using other concepts and ideas you find through independent study.
You are, of course, welcome to look for help and inspiration online, but refrain from copying code verbatim
from other sources and always remember to reference your sources failure to do so is plagiarism!
The marks of the coursework constitute of the mark of this module:
The four sections are summarised below:
Preprocess the dataset and perform an Exploratory Data Analysis EDA of the data.
This should include:
Splitting the dataset into a Training Set and a Test Set;
Taking care of any missing, duplicated or outlier values;
Transforming data, where appropriate to do so;
Encoding categorical features;
Performing feature engineering techniques such as feature extraction and selection;
Producing appropriate and informative plots and tables for an exploratory analysis;
Assessing statistical assumptions and inferences.
PUBLIC CYHOEDDUS
Utilising features and attributes derived from the preprocessing and EDA stage, conduct an
unsupervised machine learning analysis with the aim of gaining further insights into the data via
either clustering or dimensionality reduction.
To do this, you may consider:
Clustering using different appropriate algorithms, eg Kmeans, hierarchical, DBScan;
Performing a dimensionality reduction to see if this improves your clustering, or to see if a smaller
number of features can adequately explain the observations;
Evaluating the utility of the different
Step by Step Solution
There are 3 Steps involved in it
Step: 1
Get Instant Access to Expert-Tailored Solutions
See step-by-step solutions with expert insights and AI powered tools for academic success
Step: 2
Step: 3
Ace Your Homework with AI
Get the answers you need in no time with our AI-driven, step-by-step assistance
Get Started