Question

1 Approved Answer

Posted on Feb 23, 2024

In this first task, you will create a deep learning model to classify images of skin lesions into one of seven classes: 1.

In this first task, you will create a deep learning model to classify images of skin lesions into one of seven classes:

1. "MEL" = Melanoma
2. "NV" = Melanocytic nevus
3. "BCC" = Basal cell carcinoma
4. "AKIEC" = Actinic keratosis
5. "BKL" = Benign keratosis
6. "DF" = Dermatofibroma
7. "VASC" = Vascular lesion

The data for this task is a subset of: https://challenge2018.isic-archive.com/task3/

The data for this task is inside the `/content/data/img` folder. It contains ~3,800 images named like `ISIC_000000.jpg` and the following label files:

* `/content/data/img/train.csv`
* `/content/data/img/val.csv`
* `/content/data/img/train_small.csv`
* `/content/data/img/val_small.csv`

The `small` versions are the first 200 lines of each partition and are included for debugging purposes. To save time, ensure your code runs on the `small` versions first.

## Task 1a. Explore the training set

**INSTRUCTIONS**: Check for data issues, as we have done in the labs. Check the class distribution and at least 1 other potential data issue. Hint: Look in `explore.py` for a function that can plot the class distribution.

（Please Answer!）**REPORT**: What did you check for? What data issues are present in this dataset?

！Complete the following code based on the issue

import pandas as pd

IMG_CLASS_NAMES = ["MEL", "NV", "BCC", "AKIEC", "BKL", "DF", "VASC"]

train_df = pd.read_csv('/content/data/img/train.csv')
val_df = pd.read_csv('/content/data/img/val.csv')
train_df.head()
from PIL import Image
# Change the filename to view other examples from the dataset
display(Image.open('/content/data/img/ISIC_0024306.jpg'))
import explore

# TODO - Check for data issues
# Hint: You can convert from one-hot to integers with argmax
# This way you can convert 1, 0, 0, 0, 0, 0, 0 to class 0
# 0, 1, 0, 0, 0, 0, 0 to class 1
# 0, 0, 1, 0, 0, 0, 0 to class 2
# so it should be something like the following:
# train_labels = train_df.values[....].argmax(....)
# val_labels = val_df.values[....].argmax(....)
# - you need to fill in the ... parts with the correct values.
# You should then print output the contents of train_labels to see if
# it matches the contents of train.csv
#
# Next you can plot the class distributions like the following:
# explore.plot_label_distribution(....)
# - do the above for both the train and val labels.
#
# Following this look for other potential problems with the data
# You may also think of any other potential problems with the data.

## Task 1b. Implement Training loop

**INSTRUCTIONS**:

* Implement LesionDataset in `datasets.py`. Use the cell below to test your implementation.
* Implement the incomplete functions in `train.py` marked as "Task 1b"
* Go to the [Model Training Cell](#task-1-model-training) at the end of Task 1 and fill in the required code for "Task 1b".

(please answer!)**REPORT**: Why should you *not use* `random_split` in your code here?

！Complete the following code based on the issue

import datasets

ds = datasets.LesionDataset('/content/data/img',
'/content/data/img/train.csv')
input, label = ds[0]
print(input)
print(label)

## Task 1c. Implement a baseline convolutional neural network

You will implement a baseline convolutional neural network which you can compare results to. This allows you to evaluate any improvements made by hyperparameter tuning or transfer learning.

**INSTRUCTIONS**:

* Implement a `SimpleBNConv` in `models.py` with:
* 5 `nn.Conv2d` layers, with 8, 16, 32, 64, 128 output channels respectively, with the following between each convolution layer:
* `nn.ReLU()` for the activation function, and
* `nn.BatchNorm2d`, and
* finally a `nn.MaxPool2d` to downsample by a factor of 2.
* Use a normalised confusion matrix on the model's validation predictions in `train.py`.
* Go to the [Model Training Cell](#task-1-model-training) at the end of Task 1 and fill in the required code to train the model.

Training should take about 1 minute/epoch. Validation accuracy should be 60-70%, but UAR should be around 20-40%.

（Please answer！！）

**REPORT**: As training sets get larger, the length of time per epoch also gets larger. Some datasets take over an hour per epoch. This makes it impractical to debug typos in your code since it can take hours after starting for the program to reach new code. Name two ways to significantly reduce how long **each** epoch takes - for debugging purposes - while still using real data and using the real training code.

**REPORT**: Show the confusion matrix and plots of the validation accuracy and UAR in your report, and explain what is going wrong.
！Complete the following code based on the issue

## Task 1d. Account for data issues

**MARKS**: 12 (Code 8, Reports 4)

**INSTRUCTIONS**: Account for the data issues in Task 1a and retrain your model.

（Please Answer！）**REPORT**: How did you account for the data issues? Was it effective? How can you tell? Show another confusion matrix.

**IMPORTANT NOTE**: One of the techniques from the lab will cause a warning in the metric calculation on `train_small.csv`, but will work fine on `train.csv`.

！Complete the following code based on the issue

## Model Training Cell

Based on what you have implemented in the above sections, you can try to complete the whole training process here.

import torch
import torch.nn as nn
from torch.utils.data import DataLoader
from torch.utils.data.sampler import WeightedRandomSampler

import datasets
import models
import train

torch.manual_seed(42)

NUM_EPOCHS = 5
BATCH_SIZE = 64

# Create datasets/loaders
# TODO Task 1b - Create the data loaders from LesionDatasets
# TODO Task 1d - Account for data issues, if applicable
# train_dataset = ...
# val_dataset = ...
# train_loader = ...
# val_loader = ...

# Instantiate model, optimizer and criterion
# TODO Task 1c - Make an instance of your model
# TODO Task 1d - Account for data issues, if applicable
# model = ...

optimizer = torch.optim.Adam(model.parameters(), lr=0.001)
criterion = nn.CrossEntropyLoss()

# Train model
# TODO Task 1c: Set ident_str to a string that identifies this particular
# training run. Note this line in the training code
# exp_name = f"{model.__class__.__name__}_{ident_str}"
# So it means the the model class name is already included in the
# exp_name string. You can consider adding other information
# particular to this training run, e.g. learning rate (lr) used,
# augmentation (aug) used or not, etc.

train.train_model(model, train_loader, val_loader, optimizer, criterion,
IMG_CLASS_NAMES, NUM_EPOCHS, project_name="CSE3001 Assignment Task 1",
ident_str= "fill me in here")