5 Solutions to Chapter 4 - Linear regression and logistic regression

Solutions to exercises of chapter 3.

Exercise 1.1. Before we begin, we first need to visualise the data as a whole. Heatmaps are one way of looking at large datasets. Since we're looking for differences I will make a heatmap of the difference between control and infected at each time point and subcluster by pattern:

library(pheatmap)
DeltaVals <- t(D[25:48,3:164] - D[1:24,3:164])
pheatmap(DeltaVals, cluster_cols = FALSE, cluster_rows = TRUE)

we can see a number of rows in which there appears to be large scale changes as the time series progresses. Pick one where this is particularly strong.

Exercise 1.1. We can systematically fit a model with increasing degree and evaluate/plot the RMSE on the held out data.

library(pheatmap)
RMSE <- rep(NULL, 10)
lrfit1 <- train(y~poly(x,degree=1), data=data.frame(x=D[1:24,1],y=D[1:24,geneindex]), method = "lm")
RMSE[1] <- lrfit1$results$RMSE
lrfit2 <- train(y~poly(x,degree=2), data=data.frame(x=D[1:24,1],y=D[1:24,geneindex]), method = "lm")
RMSE[2] <- lrfit2$results$RMSE
lrfit3 <- train(y~poly(x,degree=3), data=data.frame(x=D[1:24,1],y=D[1:24,geneindex]), method = "lm")
RMSE[3] <- lrfit3$results$RMSE
lrfit4 <- train(y~poly(x,degree=4), data=data.frame(x=D[1:24,1],y=D[1:24,geneindex]), method = "lm")
RMSE[4] <- lrfit4$results$RMSE
lrfit5 <- train(y~poly(x,degree=5), data=data.frame(x=D[1:24,1],y=D[1:24,geneindex]), method = "lm")
RMSE[5] <- lrfit5$results$RMSE
lrfit6 <- train(y~poly(x,degree=6), data=data.frame(x=D[1:24,1],y=D[1:24,geneindex]), method = "lm")
RMSE[6] <- lrfit6$results$RMSE
lrfit7 <- train(y~poly(x,degree=7), data=data.frame(x=D[1:24,1],y=D[1:24,geneindex]), method = "lm")
RMSE[7] <- lrfit7$results$RMSE
lrfit8 <- train(y~poly(x,degree=8), data=data.frame(x=D[1:24,1],y=D[1:24,geneindex]), method = "lm")
RMSE[8] <- lrfit8$results$RMSE
lrfit9 <- train(y~poly(x,degree=9), data=data.frame(x=D[1:24,1],y=D[1:24,geneindex]), method = "lm")
RMSE[9] <- lrfit9$results$RMSE
lrfit10 <- train(y~poly(x,degree=10), data=data.frame(x=D[1:24,1],y=D[1:24,geneindex]), method = "lm")
RMSE[10] <- lrfit10$results$RMSE

plot(RMSE)
plot(RMSE[1:5])
#barplot(c(lrfit2$results$RMSE,lrfit3$results$RMSE,lrfit4$results$RMSE))

From these plots it looks like the best model is one with degree \(d=2\) or \(d=4\), suggesting there is a lot more complexity to this gene.