Answered step by step
Verified Expert Solution
Link Copied!

Question

1 Approved Answer

title: STAT 2450 Assignment 7 (32 points) author: Your name here date: 'Banner:B00??????' output: html_document: default pdf_document: default word_document: default --- # Problem : Surviving

title: "STAT 2450 Assignment 7 (32 points)"

author: "Your name here"

date: 'Banner:B00??????'

output:

html_document: default

pdf_document: default

word_document: default

---

# Problem : Surviving the Titanic (32 points)

Load the librairies

```{r}

library("rpart")

library("tree")

library(ggplot2)

library(randomForest)

```

Load the data

```{r}

mytrain = read.csv("https://mathstat.dal.ca/~fullsack/DATA/titanictrain.csv")

mytest = read.csv("https://mathstat.dal.ca/~fullsack/DATA/titanictest.csv")

mytitanic = rbind(mytest,mytrain)

nrec=nrow(mytitanic)

```

You will be using the column 'Survived' as the outcome in our models. This should be treated as a factor.

All other columns are admissible as predictors of this outcome.

HINT-1: you can use the following template to split the data into folds, e.g. for cross-validation.

# Randomly shuffle your data

yourData<-yourData[sample(nrow(yourData)),]

# Create 10 pre-folds of equal size

myfolds <- cut(seq(1,nrow(yourData)),breaks=10,labels=FALSE)

# use these pre-folds for cross-validation

for(i in 1:10){ # loop over each of 10 folds

# recover the indexes of fold iand define the indexes of the test set

testIndexes <- which(myfolds==i,arr.ind=TRUE)

# define yout testfor this fold

testData <- yourData[testIndexes, ]

# define your training set for this fold as the complement

trainData <- yourData[-testIndexes, ]

#....

}

HINT-2: Use the following template to split data into a train and a test set of roughly the same size

set.seed(44182) # or use the recommended seed

trainindex=sample(1:nrec,nrec/2,replace=F)

mytrain=mydata[trainindex,] # training set

mytest=mydata[-trainindex,] # testing set =complementary subset of mydata

1. Define a 5 pre-folds of equal size of 'mytitanic' in a variable called 'myfolds'

(2 points)

```{r}

set.seed(2255)

# shuffle

#

# Create 5 folds of equal size

# myfolds ...

```

2. Use pre-fold number 3 to define a testing and a trainingset named 'mytest' and 'mytrain'

(2 points)

```{r}

i=3 # fold number to use

```

3. Fit a Random Forest model to the 'mytrain' dataset. Use the column 'Survived' as a factor outcome. Require importance to be true

and set the random seed to 523. (This is the 'trained model'). (2 points)

```{r}

# Fitting Random Forest Classification to the training set 'mytrain'

```

4. Plot the trained model results.

Has the OOB error rate roughly equilibrated with 50 trees?

Has the OOB error rate roughly equilibrated with 500 trees?

What is the stationary value of the OOB error rate?

Which of death or survival has the smallest prediction error?

(4 points)

```{r}

#

```

5. Calculate the predictions on 'mytest', the misclassification error and the prediction accuracy. (2 points)

```{r}

# Predicting survival on mytest

```

6. Print and plot the importance of predictors in the trained model. (2 points)

```{r}

```

```{r}

```

Now you are going to have a more direct look at predictors for the records in 'mytest'.

Tabulate the chances of survival by the column 'Title'. What do you conclude? (2 points)

Which other predictor would have given you the same information? (1 points)

Are the predictors independent? (1 points)

```{r}

```

```{r}

```

What is the median fare of passengers ? (1 points)

Hint: use the column 'Fare'

```{r}

```

Tabulate the survival according to the binary variable mytest$Fare < 15(2 points)

```{r}

table(mytest$Fare < 15,mytest$Survived)

```

```{r}

rm(mytrain,mytest)

#mytrain

```

7. Complete the code of the following function, which returns a vector of classification accuracies for $nrep$ random splits

into a training and testing sets of size half the number of records in the dataset 'mytitanic'.(4 points)

```{r}

dotitan <- function(nrep,ntree,mtry){

set.seed(495)

acc = NULL

for(i in 1:nrep){

rm(mytrain,mytest)

nrec=nrow(mytitanic)

#define a train -test split as recommendedin the hints

# Fit a Random Forest Classification to the training set, using ntree trees and mtry predictors

# Predict the response on the testing set

# tabulate the prediction accuracy( Confusion matrix )

# compute the misclassification error

# compute the classification accuracy

}

# return the classification accuracy

}

```

Run the function with 100 replicates, 500 trees per fit and 4 variables. (1 points)

Compute the mean accuracy and plot the histogram. Is the prediction performance of random forest highly variable?(2 points)

```{r}

```

8. Once again, define a train-test split ('mytrain' and 'mytest') of 'mytitanic' of size ntrain=nrec/2, as recommended in the hints.

Take 332 for random seed.

```{r}

```

Run 50 independent fits of the random forest model, all using the SAME dataset mytrain.

Accumulate the accuracy of each fit in an array of size 50. Plot the histogram of this array.

Do the different fits produce similar accuracies?(4 points)

```{r}

```

Step by Step Solution

There are 3 Steps involved in it

Step: 1

blur-text-image

Get Instant Access to Expert-Tailored Solutions

See step-by-step solutions with expert insights and AI powered tools for academic success

Step: 2

blur-text-image

Step: 3

blur-text-image

Ace Your Homework with AI

Get the answers you need in no time with our AI-driven, step-by-step assistance

Get Started

Recommended Textbook for

Theory Of Distributions

Authors: Svetlin G Georgiev

1st Edition

3319195271, 9783319195278

More Books

Students also viewed these Mathematics questions