Question
The following code uses cross-validation in order to estimate predictive accuracy for a linear model of days-to-remission as a function of gene expression in ALL
The following code uses cross-validation in order to estimate predictive accuracy for a linear model of days-to-remission as a function of gene expression in ALL dataset. It runs to completion without errors but produces a number of warnings (shown below) about "differing numbers of rows" and "mismatches in object lengths."
Please explain the source of those warnings and how they can be cleaned up. Please also explain how whatever caused those warnings affects the output (if at all), and how and why (and if) the output changes upon fixing the code. (Hint: in order to observe these warnings you do not have to go through all 12K genes at each step of cross-validation - one percent of that amount is plenty - and it will save you a lot of time you would otherwise waste watching it run; remember also that R is an interpreter, so you can run commands one at a time if you need to and examine their outputs).
library(ALL) data(ALL)
set.seed(1234)
# calculate days-to-remission:
ALL.pdat <- pData(ALL) date.cr.chr <- as.character(ALL.pdat$date.cr) diag.chr <- as.character(ALL.pdat$diagnosis) date.cr.t <- strptime(date.cr.chr,"%m/%d/%Y") diag.t <- strptime(diag.chr,"%m/%d/%Y") ALL.pdat$D2R <- as.numeric(date.cr.t - diag.t)
# prepare the data structures:
ALL.exprs <- exprs(ALL)[,!is.na(ALL.pdat$D2R)] ALL.pdat <- ALL.pdat[!is.na(ALL.pdat$D2R),] n.xval <- 5 s2.xval <- numeric()
xval.grps <- sample(1:dim(ALL.pdat)[1]%%n.xval+1)
# run over each cross-validation:
for ( i.xval in 1:n.xval ) { min.pval <- 1.0
min.id <- NA train.exprs <- ALL.exprs[,xval.grps!=i.xval] train.d2r <- ALL.pdat[xval.grps!=i.xval,"D2R"] # evaluate each gene in the training dataset to find the one # most associated with the outcome: for( i in 1:dim(train.exprs)[1]) { ###for( i in 1:100 ) {
p.val <- anova(lm(train.d2r~train.exprs[i,],))[1,"Pr(>F)"] if ( p.val < min.pval ) {
min.pval <- p.val
min.id <- i }
}
# print the gene found:
cat(rownames(train.exprs)[min.id],min.pval,fill=T)
# refit the model for best gene found on training dataset:
best.lm.xval <- lm(train.d2r~train.exprs[min.id,])
# calculate predictions on test dataset:
test.exprs <- ALL.exprs[,xval.grps==i.xval] test.d2r <- ALL.pdat[xval.grps==i.xval,"D2R"] test.pred <- predict(
best.lm.xval,data.frame(t(test.exprs),test.d2r) )
# accumulate squared errors of prediction:
s2.xval <- c(s2.xval,(test.pred-test.d2r)^2) }
40176_at 1.433363e-05 35296_at 8.721938e-07 1213_at 3.760985e-06 34852_g_at 2.161217e-06 33901_at 1.399374e-06 Warning messages:
1: 'newdata' had 19 rows but variables found have 77 rows 2: In test.pred - test.d2r :
longer object length is not a multiple of shorter object length ...
# print average squared error in cross-validation:
mean(s2.xval)
[1] 332.7707
Step by Step Solution
There are 3 Steps involved in it
Step: 1
Get Instant Access to Expert-Tailored Solutions
See step-by-step solutions with expert insights and AI powered tools for academic success
Step: 2
Step: 3
Ace Your Homework with AI
Get the answers you need in no time with our AI-driven, step-by-step assistance
Get Started