Question

1 Approved Answer

Posted on Oct 10, 2024

Question 4: Germination Data Revisited In this question we go back to germination data, like in the lab. Lots of what you'll need was covered

Question 4: Germination Data Revisited

In this question we go back to germination data, like in the lab. Lots of what you'll need was covered in the lab examples, but there are also some fairly challenging parts to this question. If you are struggling and short of time, you may decide to skip some of the more challenging parts towards the end, and focus on doing all the other questions well. The last few percent may not be worth compromising your mental health!

The data set comes from an experiment on germinating white sapote seeds (Casimiroa edulis, also known as Aztec fruit or cochitzapotl). Sapote growers would like seeds to germinate faster, so researchers have trialled a new treatment they hope will speed up germination. They had 100 pots, each with a white sapote seed in it. 50 randomly selected pots were treated and the remainder acted as controls. The researchers checked every day and recorded the day that each seed germinated. The data is recorded in a file called 'white sapote time to germination.csv'.

1) Get the data into R and call the data frame that is read in 'ws'. Have a look at 'ws'. The two columns correspond to the times to germination for the 50 seeds for the control/treatment as indicated.

2) You'll need to get the data into standard format for the next bit ie one variable for all the times and another indicating which treatment. You can do this in R or Excel as you prefer (quicker in R if you can work it out - use the 'c' function to make a variable with all the times and factor(rep(c('c','t'),each=50)) will make a variable with the treatments.) Make a boxplot showing time to germination for the two treatments (ie treatment/control). Does it look like the treatment has an effect? What kind of effect?

3) Fit a linear model predicting time to germination by treatment. Plot the residuals by treatment (and/or fitted value). Does it look like we have homogeneity of variance? Plot a histogram of the residuals. Does it look like we have normal residuals/errors? Apply a Bartlett test and a Shapiro test to check these... you should find that they clearly indicate that there is a problem with both heterogeneity of variance and non-normal residuals/errors. In what way do the residuals/errors appear non-normal?

4) These data are actually 'survival data', in that they represent the time until something happened for a sample of individuals. A linear model is often likely to be inappropriate for such data, as we see above. A Poisson model is a possible option for such data in this case, because the days are all whole numbers, like counts. Fit a Poisson glm predicting time to germination by treatment, with default link function, and look at the results. Is over-dispersion a problem? How do you know? Fit a quasipoisson model to deal with this. Does this quasipoisson model indicate that the treatment has a significant impact?

5) A glm with a Gamma distribution for errors is often used to model 'survival data', at least in simple cases. Fit a Gamma glm predicting time to germination by treatment, with default link function, and look at the results. Does this model indicate that the treatment has a significant impact? Is the level of significance indicated by the Gamma glm very different to that indicated by the quasipoisson model? (remember anything between 0.01 and 0.05 gets one star * indicating significant, whereas less than 0.01 gets ** indicating highly significant, so 0.015 vs 0.0015 would be very different; 0.015 vs 0.025 would not be very different). Record the AIC of all models fitted so far... which one appears to be the best?

6) Now we want to go back to do something similar to the lab - using a binomial model. For that we need to modify the data to get the total number of seeds germinated at each time, for each treatment. You could probably do that in Excel, if you had nothing better to do on a Saturday afternoon. But let's do it in R instead - much faster. perfom a histogram of the times to germination for the control only, with 'bins' or 'breaks' of size 1. This code should work: hist(ws$control,breaks=0:100). This makes a plot but also creates an R object in the process. Give this object created a name, say 'histo' and then look at it. Note that it has a sub-object within it called 'counts', which the number in each bin of the histogram, or in this case, the number of seeds germinating on each day. This sub-object can be pulled out using histo$counts, just like pulling a variable out of a data frame. Applying the 'cumsum' function to these counts then provides the cumulative sum of seeds that have germinated by each day, which is what we want. It should now be pretty easy to calculate this for each treatment. You'll then need to stick these two variables together to get one long list of the total number of seeds germinated at each time, for each treatment. Remember that the 'c' function sticks numbers together. And then you'll need to create a second variable for the times and a third variable for the treatment. (All of this could be done in Excel if you prefer of course.) When you have the data sorted, you should be able to plot cumulative germination by time and get a plot like the one on the next page.

7) We can now try some binomial glms. Fit binomial glms predicting germination proportion by time and treatment, with the following link functions: logistic (logit), Cauchy cdf (cauchit), Gaussian cdf (probit) and complementary log-log (cloglog). Include an interaction. Don't worry if you get warning messages. Record the AIC of each model. Which looks best?

8) Plot the predictions of each model for the control treatment only onto the data if you can. You'll need to get 'response' predictions, but adjust them to make them for the number out of 50 instead of a proportion out of 1. Add a vertical line to the plot at time=27. Which of the four models gives the highest prediction for germination proportion for the control treatment at time=27? Which gives the second? Third? Last?

9) Now fit another four binomial glms predicting germination proportion by time and treatment, with the same four link functions, but no interaction. Record the AIC of each model. For which model(s) (ie which link function(s)) does the AIC indicate that the interaction should be included in the model? (Don't plot the predictions for these models.)

10) Now fit a binomial gam predicting germination proportion by time, with a standard logistic link function. You'll need to load the 'mgcv' library, would is usually already installed with R, and then use the 'gam' function. This works just like 'glm' except you can apply a 'smoother' to the explanatory variable, using the 's' function. In this case, this would mean the explanatory variable looks something like s(time) instead of just time (if time is what you called your time variable). Great if you can get that to work without an error! But now we want to predict germination proportion by time and treatment! For that, the explanatory variable part of the model formula should look like this: s(times,by=trt). If you can get that work, record the AIC of the fitted gam. How does it compare to the AIC of the other fitted models? Plot the predictions of the fitted gam for the control treatment only onto the data. How does the gam prediction for germination proportion at time=27 for the control treatment compare to the other plotted predictions? There is something a bit unrealistic about the gam prediction - what is it?

11) The shapes of the predictions of the four glm models are also not great, are they? One doesn't match the data well as germination approaches 100% - which is that? The other three don't match the data well in the very early stages - which is worst?

12) Let's try one last binomial glm link function, the log link function. Try to fit a binomial glm with a 'log' link function. Include the interaction. You should get an error message saying that no valid set of coefficients has been found. But if we now change the model so that we are predicting the proportion of seeds that HAVENT germinated, instead of the proportion that have, then we should get it to work. How does the AIC of this model compare to the AIC of the other four binomial glms with interactions considered so far (ie compare this one plus the other four with interactions - the gam doesn't count)? Plot the predictions of this binomial glm with a 'log' link function for the control treatment only onto the data. You will need to adjust the predictions to make them predictions for the number germinated. How does the prediction for germination proportion at time=27 for the control treatment compare to the five other plotted predictions? Does the shape of this model's predictions look better in comparison to the data than the other models?

13) Based on this last model, would you conclude that the treatment has a significant effect on germination of white sapote seeds?

please show all questions 1-13 in R and explain each step

treated	control
1	1
1	1
1	1
1	1
1	1
1	2
1	2
2	2
2	3
2	3
2	3
3	3
3	4
4	5
4	5
4	5
5	5
5	5
7	6
7	6
8	7
8	7
8	8
8	8
8	9
8	10
9	11
9	12
9	12
9	13
9	15
10	16
11	16
12	16
12	18
12	19
13	19
13	21
13	21
15	22
15	25
16	30
16	33
17	36
18	39
18	40
18	42
22	49
29	51
32	74