Question

1 Approved Answer

Posted on Oct 13, 2024

Exercises Chapter 2 2.1 Marginal and conditional probability: The social mobility data from Section 2.5 gives a joint probability distribution on (Y1 , Y2 )=

Exercises Chapter 2 2.1 Marginal and conditional probability: The social mobility data from Section 2.5 gives a joint probability distribution on (Y1 , Y2 )= (father's occupation, son's occupation). Using this joint distribution, calculate the following distributions: a) the marginal probability distribution of a father's occupation; b) the marginal probability distribution of a son's occupation; c) the conditional distribution of a son's occupation, given that the father is a farmer; d) the conditional distribution of a father's occupation, given that the son is a farmer. 2.2 Expectations and variances: Let Y1 and Y2 be two independent random variables, such that E[Yi ] = i and Var[Yi ] = i2 . Using the definition of expectation and variance, compute the following quantities, where a1 and a2 are given constants: a) E[a1 Y1 + a2 Y2 ] , Var[a1 Y1 + a2 Y2 ]; b) E[a1 Y1 a2 Y2 ] , Var[a1 Y1 a2 Y2 ]. 2.3 Full conditionals: Let X, Y, Z be random variables with joint density (discrete or continuous) p(x, y, z) f (x, z)g(y, z)h(z). Show that a) p(x|y, z) f (x, z), i.e. p(x|y, z) is a function of x and z; b) p(y|x, z) g(y, z), i.e. p(y|x, z) is a function of y and z; c) X and Y are conditionally independent given Z. 2.4 Symbolic manipulation: Prove the following form of Bayes' rule: Pr(E|Hj ) Pr(Hj ) Pr(Hj |E) = PK k=1 Pr(E|Hk ) Pr(Hk ) where E is any event and {H1 , . . . , HK } form a partition. Prove this using only axioms P1-P3 from this chapter, by following steps a)-d) below: a) Show that Pr(Hj |E) Pr(E) = Pr(E|Hj ) Pr(Hj ). P.D. Hoff, A First Course in Bayesian Statistical Methods, Springer Texts in Statistics, DOI 10.1007/978-0-387-92407-6 BM2, c Springer Science+Business Media, LLC 2009 226 2.5 2.6 2.7 2.8 Exercises b) Show that Pr(E) = Pr(E H1 ) + Pr(E {K k=2 Hk }). PK c) Show that Pr(E) = k=1 Pr(E Hk ). d) Put it all together to show Bayes' rule, as described above. Urns: Suppose urn H is filled with 40% green balls and 60% red balls, and urn T is filled with 60% green balls and 40% red balls. Someone will flip a coin and then select a ball from urn H or urn T depending on whether the coin lands heads or tails, respectively. Let X be 1 or 0 if the coin lands heads or tails, and let Y be 1 or 0 if the ball is green or red. a) Write out the joint distribution of X and Y in a table. b) Find E[Y ]. What is the probability that the ball is green? c) Find Var[Y |X = 0], Var[Y |X = 1] and Var[Y ]. Thinking of variance as measuring uncertainty, explain intuitively why one of these variances is larger than the others. d) Suppose you see that the ball is green. What is the probability that the coin turned up tails? Conditional independence: Suppose events A and B are conditionally independent given C, which is written AB|C. Show that this implies that Ac B|C, AB c |C, and Ac B c |C, where Ac means \"not A.\" Find an example where AB|C holds but AB|C c does not hold. Coherence of bets: de Finetti thought of subjective probability as follows: Your probability p(E) for event E is the amount you would be willing to pay or charge in exchange for a dollar on the occurrence of E. In other words, you must be willing to give p(E) to someone, provided they give you $1 if E occurs; take p(E) from someone, and give them $1 if E occurs. Your probability for the event E c =\"not E\" is defined similarly. a) Show that it is a good idea to have p(E) 1. b) Show that it is a good idea to have p(E) + p(E c ) = 1. Interpretations of probability: One abstract way to define probability is via measure theory, in that Pr() is simply a \"measure\" that assigns mass to various events. For example, we can \"measure\" the number of times a particular event occurs in a potentially infinite sequence, or we can \"measure\" our information about the outcome of an unknown event. The above two types of measures are combined in de Finetti's theorem, which tells us that an exchangeable model for an infinite binary sequence Y1 , Y2 , . . . is equivalent to modeling the sequence as conditionally i.i.d. given a parameter , where Pr( < c) represents our information that the long-run frequency of 1's is less than c. With this in mind, discuss the different ways in which probability could be interpreted in each of the following scenarios. Avoid using the word \"probable\" or \"likely\" when describing probability. Also discuss the different ways in which the events can be thought of as random. a) The distribution of religions in Sri Lanka is 70% Buddhist, 15% Hindu, 8% Christian, and 7% Muslim. Suppose each person can be identified Exercises 227 by a number from 1 to K on a census roll. A number x is to be sampled from {1, . . . , K} using a pseudo-random number generator on a computer. Interpret the meaning of the following probabilities: i. Pr(person x is Hindu); ii. Pr(x = 6452859); iii. Pr(Person x is Hindu|x=6452859). b) A quarter which you got as change is to be flipped many times. Interpret the meaning of the following probabilities: i. Pr(, the long-run relative frequency of heads, equals 1/3); ii. Pr(the first coin flip will result in a heads); iii. Pr(the first coin flip will result in a heads | = 1/3). c) The quarter above has been flipped, but you have not seen the outcome. Interpret Pr(the flip has resulted in a heads). Chapter 3 3.1 Sample survey: Suppose we are going to sample 100 individuals from a county (of size much larger than 100) and ask each sampled person whether they support policy Z or not. Let Yi = 1 if person i in the sample supports the policy, and Yi = 0 otherwise. a) Assume Y1 , . . . , Y100 are, conditional on , i.i.d. binary random variables with expectation . Write down the joint distribution of Pr(Y1 = y1 , .P . . , Y100 = y100 |) in a compact form. Also write down the form of Pr( Yi = y|). b) For the moment, suppose you believed that {0.0, 0.1, . . . , 0.9, 1.0}. P100 Given i=1 Yi = 57, compute P that the results of the survey were Pr( Yi = 57|) for each of these 11 values of and plot these probabilities as a function of . c) Now suppose you originally had no prior information to believe one of these -values over another, and so Pr( = 0.0) = Pr( = 0.1) Pn= = Pr( = 0.9) = Pr( = 1.0). Use Bayes' rule to compute p(| i=1 Yi = 57) for each -value. Make a plot of this posterior distribution as a function of . d) Now suppose you allow to be any value in the interval [0, 1]. Using the uniform prior P density for , so that p() = 1, plot the posterior n density p() Pr( i=1 Yi = 57|) as a function of . e) As discussed in this chapter, the posterior distribution of is beta(1 + 57, 1 + 100 57). Plot the posterior density as a function of . Discuss the relationships among all of the plots you have made for this exercise. 3.2 Sensitivity analysis: It is sometimes useful to express the parameters a and b in a beta distribution in terms of 0 = a/(a + b) and n0 = a + b, so that a = 0 n0 and b = (1 0 )n0 . Reconsidering the sample survey data in Exercise 3.1, for each combination of 0 {0.1, 0.2, . . . , 0.9} and n0 {1, 2, 8, 16, 32} find the corresponding a, b values and compute Pr( > 228 Exercises P 0.5| Yi = 57) using a beta(a, b) prior distribution for . Display the results with a contour plot, and discuss how the plot could be used to explain to someone whether P100 or not they should believe that > 0.5, based on the data that i=1 Yi = 57. 3.3 Tumor counts: A cancer laboratory is estimating the rate of tumorigenesis in two strains of mice, A and B. They have tumor count data for 10 mice in strain A and 13 mice in strain B. Type A mice have been well studied, and information from other laboratories suggests that type A mice have tumor counts that are approximately Poisson-distributed with a mean of 12. Tumor count rates for type B mice are unknown, but type B mice are related to type A mice. The observed tumor counts for the two populations are y A = (12, 9, 12, 14, 13, 13, 15, 8, 15, 6); y B = (11, 11, 10, 9, 9, 8, 7, 10, 6, 8, 8, 9, 7). a) Find the posterior distributions, means, variances and 95% quantilebased confidence intervals for A and B , assuming a Poisson sampling distribution for each group and the following prior distribution: A gamma(120,10), B gamma(12,1), p(A , B ) = p(A )p(B ). b) Compute and plot the posterior expectation of B under the prior distribution B gamma(12n0 , n0 ) for each value of n0 {1, 2, . . . , 50}. Describe what sort of prior beliefs about B would be necessary in order for the posterior expectation of B to be close to that of A . c) Should knowledge about population A tell us anything about population B? Discuss whether or not it makes sense to have p(A , B ) = p(A ) p(B ). 3.4 Mixtures of beta priors: Estimate the probability of teen recidivism based on a study in which there were n = 43 individuals released from incarceration and y = 15 re-offenders within 36 months. a) Using a beta(2,8) prior for , plot p(), p(y|) and p(|y) as functions of . Find the posterior mean, mode, and standard deviation of . Find a 95% quantile-based confidence interval. b) Repeat a), but using a beta(8,2) prior for . c) Consider the following prior distribution for : p() = 1 (10) [3(1 )7 + 7 (1 )], 4 (2) (8) which is a 75-25% mixture of a beta(2,8) and a beta(8,2) prior distribution. Plot this prior distribution and compare it to the priors in a) and b). Describe what sort of prior opinion this may represent. d) For the prior in c): i. Write out mathematically p() p(y|) and simplify as much as possible. Exercises 229 ii. The posterior distribution is a mixture of two distributions you know. Identify these distributions. iii. On a computer, calculate and plot p() p(y|) for a variety of values. Also find (approximately) the posterior mode, and discuss its relation to the modes in a) and b). e) Find a general formula for the weights of the mixture distribution in d)ii, and provide an interpretation for their values. 3.5 Mixtures of conjugate priors: Let p(y|) = c()h(y) exp{t(y)} be an exponential family model and let p1 (), . . . pK () be K different members of the conjugate class of prior densities PKgiven in Section 3.3. A mixture of conjugate priors is given by p () = k=1 wk pk (), where the wk 's are all P greater than zero and wk = 1 (see also Diaconis and Ylvisaker (1985)). a) Identify the general form of the posterior distribution of , based on n i.i.d. samples from p(y|) and the prior distribution given by p. b) Repeat a) but in the special case that p(y|) = dpois(y, ) and p1 , . . . , pK are gamma densities. 3.6 Exponential family expectations: Let p(y|) = c()h(y) exp{t(y)} be an exponential family model. a) Take derivatives with respect to of both sides of the equation R p(y|) dy = 1 to show that E[t(Y )|] = c0 ()/c(). b) Let p() c()n0 en0 t0 be the prior distribution for . Calculate dp()/d and, using the fundamental theorem of calculus, discuss what must be true so that E[c()/c()] = t0 . 3.7 Posterior prediction: Consider a pilot study in which n1 = 15 children enrolled in special education classes were randomly selected and tested for a certain type of learning disability. In the pilot study, y1 = 2 children tested positive for the disability. a) Using a uniform prior distribution, find the posterior distribution of , the fraction of students in special education classes who have the disability. Find the posterior mean, mode and standard deviation of , and plot the posterior density. Researchers would like to recruit students with the disability to participate in a long-term study, but first they need to make sure they can recruit enough students. Let n2 = 278 be the number of children in special education classes in this particular school district, and let Y2 be the number of students with the disability. b) Find Pr(Y2 = y2 |Y1 = 2), the posterior predictive distribution of Y2 , as follows: i. Discuss what assumptions are needed about the joint distribution of (Y1 , Y2 ) such that the following is true: Z 1 Pr(Y2 = y2 |Y1 = 2) = Pr(Y2 = y2 |)p(|Y1 = 2) d . 0 ii. Now plug in the forms for Pr(Y2 = y2 |) and p(|Y1 = 2) in the above integral. 230 Exercises iii. Figure out what the above integral must be by using the calculus result discussed in Section 3.1. c) Plot the function Pr(Y2 = y2 |Y1 = 2) as a function of y2 . Obtain the mean and standard deviation of Y2 , given Y1 = 2. d) The posterior mode and the MLE (maximum likelihood estimate; see Exercise 3.14) of , based on data from the pilot study, are both and find the mean = 2/15. Plot the distribution Pr(Y2 = y2 | = ), Compare these results to and standard deviation of Y2 given = . the plots and calculations in c) and discuss any differences. Which distribution for Y2 would you use to make predictions, and why? 3.8 Coins: Diaconis and Ylvisaker (1985) suggest that coins spun on a flat surface display long-run frequencies of heads that vary from coin to coin. About 20% of the coins behave symmetrically, whereas the remaining coins tend to give frequencies of 1/3 or 2/3. a) Based on the observations of Diaconis and Ylvisaker, use an appropriate mixture of beta distributions as a prior distribution for , the long-run frequency of heads for a particular coin. Plot your prior. b) Choose a single coin and spin it at least 50 times. Record the number of heads obtained. Report the year and denomination of the coin. c) Compute your posterior for , based on the information obtained in b). d) Repeat b) and c) for a different coin, but possibly using a prior for that includes some information from the first coin. Your choice of a new prior may be informal, but needs to be justified. How the results from the first experiment influence your prior for the of the second coin may depend on whether or not the two coins have the same denomination, have a similar year, etc. Report the year and denomination of this coin. 3.9 Galenshore distribution: An unknown quantity Y has a Galenshore(a, ) distribution if its density is given by p(y) = 2 2a 2a1 2 y2 y e (a) for y > 0, > 0 and a > 0. Assume for now that a is known. For this density, (a + 1/2) a E[Y ] = , E[Y 2 ] = 2 . (a) a) Identify a class of conjugate prior densities for . Plot a few members of this class of densities. b) Let Y1 , . . . , Yn i.i.d. Galenshore(a, ). Find the posterior distribution of given Y1 , . . . , Yn , using a prior from your conjugate class. c) Write down p(a |Y1 , . . . , Yn )/p(b |Y1 , . . . , Yn ) and simplify. Identify a sufficient statistic. d) Determine E[|y1 , . . . , yn ]. Exercises 231 e) Determine the form of the posterior predictive density p( y |y1 . . . , yn ). 3.10 Change of variables: Let = g(), where g is a monotone function of , and let h be the inverse of g so that = h(). If p () is the probability density of , then the probability density of induced by p is given by dh p () = p (h()) | d |. a) Let beta(a, b) and let = log[/(1 )]. Obtain the form of p and plot it for the case that a = b = 1. b) Let gamma(a, b) and let = log . Obtain the form of p and plot it for the case that a = b = 1. 3.12 Jeffreys' prior: Jeffreys (1961) suggested a default rule for generating a prior distribution of a parameter in a sampling model p(y|). Jeffreys' p prior is given by pJ () I() , where I() = E[ 2 log p(Y |)/2 |] is the Fisher information. a) Let Y binomial(n, ). Obtain Jeffreys' prior distribution pJ () for this model. b) Reparameterize the\u0001 binomial sampling model with = log /(1 ), so that p(y|) = ny ey (1 + e )n . Obtain Jeffreys' prior distribution pJ () for this model. c) Take the prior distribution from a) and apply the change of variables formula from Exercise 3.10 to obtain the induced prior density on . This density should be the same as the one derived in part b) of this exercise. This consistency under reparameterization is the defining characteristic of Jeffrey's' prior. 3.13 Improper Jeffreys' prior: Let Y Poisson(). a) Apply Jeffreys' procedure to this model, and compare the result to the family of gamma densities. Does Jeffreys' procedure p produce an actual probability density for ? In other words, can I() be proportional to an actual probability density for (0, )? p b) Obtain the form of the function f (, y) = I() p(y|). What probability density for is f (, y) proportional to? Can we think of R f (, y)/ f (, y)d as a posterior density of given Y = y? 3.14 Unit information prior: Let Y1 , . . . , Yn i.i.d. p(y|). Having observed the values Y1 = y1 , . . . , Yn = yn , the log likelihood is given by l(|y) = P log p(yi |), and the value of that maximizes l(|y) is called the maximum likelihood estimator . The negative of the curvature of the loglikelihood, J() = 2 l(|y)/2 , describes the precision of the MLE and is called the observed Fisher information. For situations in which it is difficult to quantify prior information in terms of a probability distribution, some have suggested that the \"prior\" distribution be based on the likelihood, for example, by centering the prior distribution around the To deal with the fact that the MLE is not really prior information, MLE . the curvature of the prior is chosen so that it has only \"one nth\" as much information as the likelihood, so that 2 log p()/2 = J()/n. Such a prior is called a unit information prior (Kass and Wasserman, 1995; Kass 232 Exercises and Raftery, 1995), as it has as much information as the average amount of information from a single observation. The unit information prior is not really a prior distribution, as it is computed from the observed data. However, it can be roughly viewed as the prior information of someone with weak but accurate prior information. a) Let Y1 , . . . , Yn i.i.d. binary(). Obtain the MLE and J()/n. b) Find a probability density pU () such that log pU () = l(|y)/n + c, where c is a constant that does not depend on . Compute the information 2 log pU ()/2 of this density. c) Obtain a probability density for that is proportional to pU () p(y1 , . . . , yn |). Can this be considered a posterior distribution for ? d) Repeat a), b) and c) but with p(y|) being the Poisson distribution. Chapter 4 4.1 Posterior comparisons: Reconsider the sample survey in Exercise 3.1. Suppose you are interested in comparing the rate of support in that county to the rate in another county. Suppose that a survey of sample size 50 was done in the second county, and the total number of people in the sample who supported the policy was 30. Identify the posterior distribution of 2 assuming a uniform prior. Sample 5,000 values of each of 1 and 2 from their posterior distributions and estimate Pr(1 < 2 |the data and prior). 4.2 Tumor count comparisons: Reconsider the tumor count data in Exercise 3.3: a) For the prior distribution given in part a) of that exercise, obtain Pr(B < A |y A , y B ) via Monte Carlo sampling. b) For a range of values of n0 , obtain Pr(B < A |y A , y B ) for A gamma(120, 10) and B gamma(12n0 , n0 ). Describe how sensitive the conclusions about the event {B < A } are to the prior distribution on B . c) Repeat parts a) and b), replacing the event {B < A } with the event {YB < YA }, where YA and YB are samples from the posterior predictive distribution. 4.3 Posterior predictive checks: Let's investigate the adequacy of the Poisson model for the tumor count data. Following the example in Section (1) (1000) (s) 4.4, generate posterior predictive datasets y A , . . . , y A . Each y A is a (s) sample of size nA = 10 from the Poisson distribution with parameter A , (s) A is itself a sample from the posterior distribution p(A |y A ), and y A is the observed data. (s) a) For each s, let t(s) be the sample average of the 10 values of y A , (s) divided by the sample standard deviation of y A . Make a histogram (s) of t and compare to the observed value of this statistic. Based on this statistic, assess the fit of the Poisson model for these data. Exercises 233 b) Repeat the above goodness of fit evaluation for the data in population B. 4.4 Mixtures of conjugate priors: For the posterior density from Exercise 3.4: a) Make a plot of p(|y) or p(y|)p() using the mixture prior distribution and a dense sequence of -values. Can you think of a way to obtain a 95% quantile-based posterior confidence interval for ? You might want to try some sort of discrete approximation. b) To sample a random variable z from the mixture distribution wp1 (z)+ (1 w)p0 (z), first toss a w-coin and let x be the outcome (this can be done in R with x 1 |data). Also plot the posterior densities (try to put p(1 |data) and p(2 |data) on the same plot). Comment on the differences across posterior opinions. i. Opinion 1: (a1 = a2 = 2.2 100, b1 = b2 = 100). Cancer rates for both types of counties are similar to the average rates across all counties from previous years. 234 Exercises ii. Opinion 2: (a1 = 2.2 100, b1 = 100, a2 = 2.2, b1 = 1). Cancer rates in this year for nonreactor counties are similar to rates in previous years in nonreactor counties. We don't have much information on reactor counties, but perhaps the rates are close to those observed previously in nonreactor counties. iii. Opinion 3: (a1 = a2 = 2.2, b1 = b2 = 1). Cancer rates in this year could be different from rates in previous years, for both reactor and nonreactor counties. d) In the above analysis we assumed that population size gives no information about fatality rate. Is this reasonable? How would the analysis have to change if this is not reasonable? e) We encoded our beliefs about 1 and 2 such that they gave no information about each other (they were a priori independent). Think about why and how you might encode beliefs such that they were a priori dependent. 4.6 Non-informative prior distributions: Suppose for a binary sampling problem we plan on using a uniform, or beta(1,1), prior for the population proportion . Perhaps our reasoning is that this represents \"no prior information about .\" However, some people like to look at proportions on the log-odds scale, that is, they are interested in = log 1 . Via Monte Carlo sampling or otherwise, find the prior distribution for that is induced by the uniform prior for . Is the prior informative about ? 4.7 Mixture models: After a posterior analysis on data from a population of squash plants, it was determined that the total vegetable weight of a given plant could be modeled with the following distribution: p(y|, 2 ) = .31dnorm(y, , ) + .46dnorm(21 , 2) + .23dnorm(y, 31 , 3) where the posterior distributions of the parameters have been calculated as 1/ 2 gamma(10, 2.5), and | 2 normal(4.1, 2 /20). a) Sample at least 5,000 y values from the posterior predictive distribution. b) Form a 75% quantile-based confidence interval for a new value of Y . c) Form a 75% HPD region for a new Y as follows: i. Compute estimates of the posterior density of Y using the density command in R, and then normalize the density values so they sum to 1. ii. Sort these discrete probabilities in decreasing order. iii. Find the first probability value such that the cumulative sum of the sorted values exceeds 0.75. Your HPD region includes all values of y which have a discretized probability greater than this cutoff. Describe your HPD region, and compare it to your quantile-based region. d) Can you think of a physical justification for the mixture sampling distribution of Y ? Exercises 235 4.8 More posterior predictive checks: Let A and B be the average number of children of men in their 30s with and without bachelor's degrees, respectively. a) Using a Poisson sampling model, a gamma(2,1) prior for each and the data in the files menchild30bach.dat and menchild30nobach.dat, obtain 5,000 samples of YA and YB from the posterior predictive distribution of the two samples. Plot the Monte Carlo approximations to these two posterior predictive distributions. b) Find 95% quantile-based posterior confidence intervals for B A and YB YA . Describe in words the differences between the two populations using these quantities and the plots in a), along with any other results that may be of interest to you. c) Obtain the empirical distribution of the data in group B. Compare this to the Poisson distribution with mean = 1.4. Do you think the Poisson model is a good fit? Why or why not? d) For each of the 5,000 B -values you sampled, sample nB = 218 Poisson random variables and count the number of 0s and the number of 1s in each of the 5,000 simulated datasets. You should now have two sequences of length 5,000 each, one sequence counting the number of people having zero children for each of the 5,000 posterior predictive datasets, the other counting the number of people with one child. Plot the two sequences against one another (one on the x-axis, one on the y-axis). Add to the plot a point marking how many people in the observed dataset had zero children and one child. Using this plot, describe the adequacy of the Poisson model. Chapter 5 5.1 Studying: The files school1.dat, school2.dat and school3.dat contain data on the amount of time students from three high schools spent on studying or homework during an exam period. Analyze data from each of these schools separately, using the normal model with a conjugate prior distribution, in which {0 = 5, 02 = 4, 0 = 1, 0 = 2} and compute or approximate the following: a) posterior means and 95% confidence intervals for the mean and standard deviation from each school; b) the posterior probability that i < j < k for all six permutations {i, j, k} of {1, 2, 3}; c) the posterior probability that Yi < Yj < Yk for all six permutations {i, j, k} of {1, 2, 3}, where Yi is a sample from the posterior predictive distribution of school i. d) Compute the posterior probability that 1 is bigger than both 2 and 3 , and the posterior probability that Y1 is bigger than both Y2 and Y3 . 236 Exercises 5.2 Sensitivity analysis: Thirty-two students in a science classroom were randomly assigned to one of two study methods, A and B, so that nA = nB = 16 students were assigned to each method. After several weeks of study, students were examined on the course material with an exam designed to give an average score of 75 with a standard deviation of 10. The scores for the two groups are summarized by { yA = 75.2, sA = 7.3} and { yB = 77.5, sb = 8.1}. Consider independent, conjugate normal prior distributions for each of A and B , with 0 = 75 and 02 = 100 for both groups. For each (0 , 0 ) {(1,1),(2,2),(4,4),(8,8),(16,16),(32,32)} (or more values), obtain Pr(A < B |y A , y B ) via Monte Carlo sampling. Plot this probability as a function of (0 = 0 ). Describe how you might use this plot to convey the evidence that A < B to people of a variety of prior opinions. 5.3 Marginal distributions: Given observations Y1 , . . . , Yn i.i.d. normal(, 2 ) and using the conjugate prior distribution for and 2 , derive the formula for p(|y1 , . . . , yn ), the marginal posterior distribution of , conditional on the data but marginal over 2 . Check your work by comparing your formula to a Monte Carlo estimate of the marginal distribution, using some values of Y1 , . . . , Yn , 0 , 02 , 0 and 0 that you choose. Also derive p( 2 |y1 , . . . , yn ), where 2 = 1/ 2 is the precision. 5.4 Jeffreys' prior: For sampling models expressed in terms of a p-dimensional p vector , Jeffreys' prior (Exercise 3.11) is defined as pJ () |I()|, where |I()| is the determinant of the p p matrix I() having entries I()k,l = E[ 2 log p(Y |)/k l ]. a) Show that Jeffreys' prior for the normal model is pJ (, 2 ) ( 2 )3/2 . b) Let y = (y1 , . . . , yn ) be the observed values of an i.i.d. sample from a normal(, 2 ) population. Find a probability density pJ (, 2 |y) such that pJ (, 2 |y) pJ (, 2 )p(y|, 2 ). It may be convenient to write this joint density as pJ (| 2 , y) pJ ( 2 |y). Can this joint density be considered a posterior density? 5.5 Unit information prior: Obtain a unit information prior for the normal model as follows: a) Reparameterize the normal modelP as p(y|, ), where = 1/ 2 . Write out the log likelihood l(, |y) = log p(yi |, ) in terms of and . b) Find a probability density pU (, ) so that log pU (, ) = l(, |y)/n + a constant that does P not depend on or . Hint: Write Pc, where2 c isP (yi ) as (yi y + y )2 = (yi y)2 + n( y)2 , and recall that log pU (, ) = log pU (|) + log pU (). c) Find a probability density pU (, |y) that is proportional to pU (, ) p(y1 , . . . , yn |, ). It may be convenient to write this joint density as pU (|, y) pU (|y). Can this joint density be considered a posterior density? Exercises 237 Chapter 6 6.1 Poisson population comparisons: Let's reconsider the number of children data of Exercise 4.8. We'll assume Poisson sampling models for the two groups as before, but now we'll parameterize A and B as A = , B = . In this parameterization, represents the relative rate B /A . Let gamma(a , b ) and let gamma(a , b ). a) Are A and B independent or dependent under this prior distribution? In what situations is such a joint prior distribution justified? b) Obtain the form of the full conditional distribution of given y A , y B and . c) Obtain the form of the full conditional distribution of given y A , y B and . d) Set a = 2 and b = 1. Let a = b {8, 16, 32, 64, 128}. For each of these five values, run a Gibbs sampler of at least 5,000 iterations and obtain E[B A |y A , y B ]. Describe the effects of the prior distribution for on the results. 6.2 Mixture model: The file glucose.dat contains the plasma glucose concentration of 532 females from a study on diabetes (see Exercise 7.6). a) Make a histogram or kernel density estimate of the data. Describe how this empirical distribution deviates from the shape of a normal distribution. b) Consider the following mixture model for these data: For each study participant there is an unobserved group membership variable Xi which is equal to 1 or 2 with probability p and 1 p. If Xi = 1 then Yi normal(1 , 12 ), and if Xi = 2 then Yi normal(2 , 22 ). Let p beta(a, b), j normal(0 , 02 ) and 1/j gamma(0 /2, 0 02 /2) for both j = 1 and j = 2. Obtain the full conditional distributions of (X1 , . . . , Xn ), p, 1 , 2 , 12 and 22 . c) Setting a = b = 1, 0 = 120, 02 = 200, 02 = 1000 and 0 = 10, implement the Gibbs sampler for at least 10,000 iterations. Let (s) (s) (s) (s) (s) (s) (1) = min{1 , 2 } and (2) = max{1 , 2 }. Compute and plot (s) (s) the autocorrelation functions of (1) and (2) , as well as their effective sample sizes. d) For each iteration s of the Gibbs sampler, sample a value x (s) 2(s) binary(p(s) ), then sample Y (s) normal(x , x ). Plot a histogram or kernel density estimate for the empirical distribution of Y (1) , . . . , Y (S) , and compare to the distribution in part a). Discuss the adequacy of this two-component mixture model for the glucose data. 6.3 Probit regression: A panel study followed 25 married couples over a period of five years. One item of interest is the relationship between divorce rates and the various characteristics of the couples. For example, the researchers would like to model the probability of divorce as a function of 238 Exercises age differential, recorded as the man's age minus the woman's age. The data can be found in the file divorce.dat. We will model these data with probit regression, in which a binary variable Yi is described in terms of an explanatory variable xi via the following latent variable model: Zi = xi + \u000fi Yi = (c,) (Zi ), where and c are unknown coefficients, \u000f1 , . . . , \u000fn i.i.d. normal(0, 1) and (c,) (z) = 1 if z > c and equals zero otherwise. a) Assuming normal(0, 2 ) obtain the full conditional distribution p(|y, x, z, c). b) Assuming c normal(0, c2 ), show that p(c|y, x, z, ) is a constrained normal density, i.e. proportional to a normal density but constrained to lie in an interval. Similarly, show that p(zi |y, x, z i , , c) is proportional to a normal density but constrained to be either above c or below c, depending on yi . c) Letting 2 = c2 = 16 , implement a Gibbs sampling scheme that approximates the joint posterior distribution of Z, , and c (a method for sampling from constrained normal distributions is outlined in Section 12.1.1). Run the Gibbs sampler long enough so that the effective sample sizes of all unknown parameters are greater than 1,000 (including the Zi 's). Compute the autocorrelation function of the parameters and discuss the mixing of the Markov chain. d) Obtain a 95% posterior confidence interval for , as well as Pr( > 0|y, x). Chapter 7 7.1 Jeffreys' prior: For the multivariate normal model, Jeffreys' rule for generating a prior distribution on (, ) gives pJ (, ) ||(p+2)/2 . a) Explain why the function pJ cannot actually be a probability density for (, ). b) Let pJ (, |y 1 , . . . , y n ) be the probability density that is proportional to pJ (, )p(y 1 , . . . , y n |, ). Obtain the form of pJ (, |y 1 , . . . , y n ), pJ (|, y 1 , . . . , y n ) and pJ (|y 1 , . . . , y n ). 7.2 Unit information prior: Letting = 1 , show that a unit information prior for (, ) is given by | P multivariate normal( y , 1 ) and 1 T Wishart(p + 1, S ), where S = (y i y)(y i y) /n. This can be done by mimicking the procedure outlined in Exercise 5.6 as follows: a) Reparameterize the multivariate normal model in terms of the precision matrix = 1 . Write out the resulting log likelihood, and find a probability density pU (, ) = pU (| )pU ( ) such that log p(, ) = l(, |Y)/n + c, where c does not depend on or . Exercises 239 P T Hint: Write (y i ) as (y i y + yP ), and note that ai Bai can be written as tr(AB), where A = ai aTi . b) Let pU () be the inverse-Wishart density induced by pU ( ). Obtain a density pU (, |y 1 , . . . , y n ) pU (|)pU ()p(y 1 , . . . , y n |, ). Can this be interpreted as a posterior distribution for and ? 7.3 Australian crab data: The files bluecrab.dat and orangecrab.dat contain measurements of body depth (Y1 ) and rear width (Y2 ), in millimeters, made on 50 male crabs from each of two species, blue and orange. We will model these data using a bivariate normal distribution. a) For each of the two species, obtain posterior distributions of the population mean and covariance matrix as follows: Using the semiconjugate prior distributions for and , set 0 equal to the sample mean of the data, 0 and S0 equal to the sample covariance matrix and 0 = 4. Obtain 10,000 posterior samples of and . Note that this \"prior\" distribution loosely centers the parameters around empirical estimates based on the observed data (and is very similar to the unit information prior described in the previous exercise). It cannot be considered as our true prior distribution, as it was derived from the observed data. However, it can be roughly considered as the prior distribution of someone with weak but unbiased information. b) Plot values of = (1 , 2 )0 for each group and compare. Describe any size differences between the two groups. c) From each covariance matrix obtained from the Gibbs sampler, obtain the corresponding correlation coefficient. From these values, plot posterior densities of the correlations blue and orange for the two groups. Evaluate differences between the two species by comparing these posterior distributions. In particular, obtain an approximation to Pr(blue < orange |y blue , y orange ). What do the results suggest about differences between the two populations? 7.4 Marriage data: The file agehw.dat contains data on the ages of 100 married couples sampled from the U.S. population. a) Before you look at the data, use your own knowledge to formulate a semiconjugate prior distribution for = (h , w )T and , where h , w are mean husband and wife ages, and is the covariance matrix. b) Generate a prior predictive dataset of size n = 100, by sampling (, ) from your prior distribution and then simulating Y 1 , . . . , Y n i.i.d. multivariate normal(, ). Generate several such datasets, make bivariate scatterplots for each dataset, and make sure they roughly represent your prior beliefs about what such a dataset would actually look like. If your prior predictive datasets do not conform to your beliefs, go back to part a) and formulate a new prior. Report the prior that you eventually decide upon, and provide scatterplots for at least three prior predictive datasets. c) Using your prior distribution and the 100 values in the dataset, obtain an MCMC approximation to p(, |y 1 , . . . , y 100 ). Plot the joint 240 Exercises posterior distribution of h and w , and also the marginal posterior density of the correlation between Yh and Yw , the ages of a husband and wife. Obtain 95% posterior confidence intervals for h , w and the correlation coefficient. d) Obtain 95% posterior confidence intervals for h , w and the correlation coefficient using the following prior distributions: i. Jeffreys' prior, described in Exercise 7.1; ii. the unit information prior, described in Exercise 7.2; iii. a \"diffuse prior\" with 0 = 0, 0 = 105 I, S0 = 1000 I and 0 = 3. e) Compare the confidence intervals from d) to those obtained in c). Discuss whether or not you think that your prior information is helpful in estimating and , or if you think one of the alternatives in d) is preferable. What about if the sample size were much smaller, say n = 25? 7.5 Imputation: The file interexp.dat contains data from an experiment that was interrupted before all the data could be gathered. Of interest was the difference in reaction times of experimental subjects when they were given stimulus A versus stimulus B. Each subject is tested under one of the two stimuli on their first day of participation in the study, and is tested under the other stimulus at some later date. Unfortunately the experiment was interrupted before it was finished, leaving the researchers with 26 subjects with both A and B responses, 15 subjects with only A responses and 17 subjects with only B responses. 2 2 a) Calculate empirical estimates of A , B , , A , B from the data using the commands mean , cor and var . Use all the A responses to get 2 2 A and A , and use all the B responses to get B and B . Use only the complete data cases to get . b) For each person i with only an A response, impute a B response as q 2 / 2. yi,B = B + (yi,A A ) B A For each person i with only a B response, impute an A response as q 2 / 2. yi,A = A + (yi,B B ) A B You now have two \"observations\" for each individual. Do a paired sample t-test and obtain a 95% confidence interval for A B . c) Using either Jeffreys' prior or a unit information prior distribution for the parameters, implement a Gibbs sampler that approximates the joint distribution of the parameters and the missing data. Compute a posterior mean for A B as well as a 95% posterior confidence interval for A B . Compare these results with the results from b) and discuss. Exercises 241 7.6 Diabetes data: A population of 532 women living near Phoenix, Arizona were tested for diabetes. Other information was gathered from these women at the time of testing, including number of pregnancies, glucose level, blood pressure, skin fold thickness, body mass index, diabetes pedigree and age. This information appears in the file azdiabetes.dat. Model the joint distribution of these variables for the diabetics and non-diabetics separately, using a multivariate normal distribution: a) For both groups separately, use the following type of unit information is the sample covariance matrix. prior, where , 0 = ; i. 0 = y ii. S0 = , 0 = p + 2 = 9 . Generate at least 10,000 Monte Carlo samples for { d , d } and { n , n }, the model parameters for diabetics and non-diabetics respectively. For each of the seven variables j {1, . . . , 7}, compare the marginal posterior distributions of d,j and n,j . Which variables seem to differ between the two groups? Also obtain Pr(d,j > n,j |Y) for each j {1, . . . , 7}. b) Obtain the posterior means of d and n , and plot the entries versus each other. What are the main differences, if any? Chapter 8 8.1 Components of variance: Consider the hierarchical model where 1 , . . . , m |, 2 i.i.d. normal(, 2 ) y1,j , . . . , ynj ,j |j , 2 i.i.d. normal(j , 2 ) . For this problem, we will eventually compute the following: Var[yi,j |i , 2 ], Var[ y,j |i , 2 ], Cov[yi1 ,j , yi2 ,j |j , 2 ] 2 Var[yi,j |, ], Var[ y,j |, 2 ], Cov[yi1 ,j , yi2 ,j |, 2 ] First, lets use our intuition to guess at the answers: a) Which do you think is bigger, Var[yi,j |i , 2 ] or Var[yi,j |, 2 ]? To guide your intuition, you can interpret the first as the variability of the Y 's when sampling from a fixed group, and the second as the variability in first sampling a group, then sampling a unit from within the group. b) Do you think Cov[yi1 ,j , yi2 ,j |j , 2 ] is negative, positive, or zero? Answer the same for Cov[yi1 ,j , yi2 ,j |, ]. You may want to think about what yi2 ,j tells you about yi1 ,j if j is known, and what it tells you when j is unknown. c) Now compute each of the six quantities above and compare to your answers in a) and b). d) Now assume we have a prior p() for . Using Bayes' rule, show that p(|1 , . . . , m , 2 , 2 , y 1 , . . . , y m ) = p(|1 , . . . , m , 2 ). 242 Exercises Interpret in words what this means. 8.2 Sensitivity analysis: In this exercise we will revisit the study from Exercise 5.2, in which 32 students in a science classroom were randomly assigned to one of two study methods, A and B, with nA = nB = 16. After several weeks of study, students were examined on the course material, and the scores are summarized by { yA = 75.2, sA = 7.3}, { yB = 77.5, sb = 8.1}. We will estimate A = + and B = using the two-sample model and prior distributions of Section 8.1. a) Let normal(75, 100), 1/ 2 gamma(1, 100) and normal(0 , 02 ). For each combination of 0 {4, 2, 0, 2, 4} and 02 {10, 50, 100, 500}, obtain the posterior distribution of , and 2 and compute i. Pr( < 0|Y); ii. a 95% posterior confidence interval for ; iii. the prior and posterior correlation of A and B . b) Describe how you might use these results to convey evidence that A < B to people of a variety of prior opinions. 8.3 Hierarchical modeling: The files school1.dat through school8.dat give weekly hours spent on homework for students sampled from eight different schools. Obtain posterior distributions for the true means for the eight different schools using a hierarchical normal model with the following prior parameters: 0 = 7, 02 = 5, 02 = 10, 0 = 2, 02 = 15, 0 = 2 . a) Run a Gibbs sampling algorithm to approximate the posterior distribution of {, 2 , , 2 }. Assess the convergence of the Markov chain, and find the effective sample size for { 2 , , 2 }. Run the chain long enough so that the effective sample sizes are all above 1,000. b) Compute posterior means and 95% confidence regions for { 2 , , 2 }. Also, compare the posterior densities to the prior densities, and discuss what was learned from the data. 2 c) Plot the posterior density of R = 2+ 2 and compare it to a plot of the prior density of R. Describe the evidence for between-school variation. d) Obtain the posterior probability that 7 is smaller than 6 , as well as the posterior probability that 7 is the smallest of all the 's. e) Plot the sample averages y1 , . . . , y8 against the posterior expectations of 1 , . . . , 8 , and describe the relationship. Also compute the sample mean of all observations and compare it to the posterior mean of . Chapter 9 9.1 Extrapolation: The file swim.dat contains data on the amount of time, in seconds, it takes each of four high school swimmers to swim 50 yards. Each swimmer has six times, taken on a biweekly basis. Exercises 243 a) Perform the following data analysis for each swimmer separately: i. Fit a linear regression model of swimming time as the response and week as the explanatory variable. To formulate your prior, use the information that competitive times for this age group generally range from 22 to 24 seconds. ii. For each swimmer j, obtain a posterior predictive distribution for Yj , their time if they were to swim two weeks from the last recorded time. b) The coach of the team has to decide which of the four swimmers will compete in a swimming meet in two weeks. Using your predictive distributions, compute Pr(Yj = max{Y1 , . . . , Y4 }|Y)) for each swimmer j, and based on this make a recommendation to the coach. 9.2 Model selection: As described in Example 6 of Chapter 7, The file azdiabetes.dat contains data on health-related variables of a population of 532 women. In this exercise we will be modeling the conditional distribution of glucose level (glu) as a linear combination of the other variables, excluding the variable diabetes. a) Fit a regression model using the g-prior with g = n, 0 = 2 and 02 = 1. Obtain posterior confidence intervals for all of the parameters. b) Perform the model selection and averaging procedure described in Section 9.3. Obtain Pr(j 6= 0|y), as well as posterior confidence intervals for all of the parameters. Compare to the results in part a). 9.3 Crime: The file crime.dat contains crime rates and data on 15 explanatory variables for 47 U.S. states, in which both the crime rates and the explanatory variables have been centered and scaled to have variance 1. A description of the variables can be obtained by typing library (MASS);?UScrime in R. a) Fit a regression model y = X+\u000f using the g-prior with g = n, 0 = 2 and 02 = 1. Obtain marginal posterior means and 95% confidence intervals for , and compare to the least squares estimates. Describe the relationships between crime and the explanatory variables. Which variables seem strongly predictive of crime rates? b) Lets see how well regression models can predict crime rates based on the X-variables. Randomly divide the crime roughly in half, into a training set {y tr , Xtr } and a test set {y te , Xte } i. Using only the training set, obtain least squares regression coeffi . Obtain predicted values for the test data by computing cients ols ols = XP ols versus y te and compute the prediction y te ols . Plot y 1 error nte (yi,te yi,ols )2 . ii. Now obtain the posterior mean Bayes = E[|y tr ] using the g-prior described above and the training data only. Obtain predictions Bayes = Xtest for the test set y Bayes . Plot versus the test data, compute the prediction error, and compare to the OLS prediction error. Explain the results. 244 Exercises c) Repeat the procedures in b) many times with different randomly generated test and training sets. Compute the average prediction error for both the OLS and Bayesian methods. Chapter 10 10.1 Reflecting random walks: It is often useful in MCMC to have a proposal distribution which is both symmetric and has support only on a certain region. For example, if we know > 0, we would like our proposal distribution J(1 |0 ) to have support on positive values. Consider the following proposal algorithm: sample uniform(0 , 0 + ); if < 0, set 1 = ; if 0, set 1 = . Show that the above algorithm draws samples In other words, 1 = ||. from a symmetric proposal distribution which has support on positive values of . It may be helpful to write out the associated proposal density J(1 |0 ) under the two conditions 0 and 0 > separately. 10.2 Nesting success: Younger male sparrows may or may not nest during a mating season, perhaps depending on their physical characteristics. Researchers have recorded the nesting success of 43 young male sparrows of the same age, as well as their wingspan, and the data appear in the file msparrownest.dat. Let Yi be the binary indicator that sparrow i successfully nests, and let xi denote their wingspan. Our model for Yi is logit Pr(Yi = 1|, , xi ) = + xi , where the logit function is given by logit = log[/(1 )]. Qn a) Write out the joint sampling distribution i=1 p(yi |, , xi ) and simplify as much as possible. b) Formulate a prior probability distribution over and by considering the range of Pr(Y = 1|, , x) as x ranges over 10 to 15, the approximate range of the observed wingspans. c) Implement a Metropolis algorithm that approximates p(, |y, x). Adjust the proposal distribution to achieve a reasonable acceptance rate, and run the algorithm long enough so that the effective sample size is at least 1,000 for each parameter. d) Compare the posterior densities of and to their prior densities. e) Using output from the Metropolis algorithm, come up with a way to make a confidence band for the following function f (x) of wingspan: f (x) = e+x , 1 + e+x where and are the parameters in your sampling model. Make a plot of such a band. Exercises 245 10.3 Tomato plants: The file tplant.dat contains data on the heights of ten tomato plants, grown under a variety of soil pH conditions. Each plant was measured twice. During the first measurement, each plant's height was recorded and a reading of soil pH was taken. During the second measurement only plant height was measured, although it is assumed that pH levels did not vary much from measurement to measurement. a) Using ordinary least squares, fit a linear regression to the data, modeling plant height as a function of time (measurement period) and pH level. Interpret your model parameters. b) Perform model diagnostics. In particular, carefully analyze the residuals and comment on possible violations of assumptions. In particular, assess (graphically or otherwise) whether or not the residuals within a plant are independent. What parts of your ordinary linear regression model do you think are sensitive to any violations of assumptions you may have detected? c) Hypothesize a new model for your data which allows for observations within a plant to be correlated. Fit the model using a MCMC approximation to the posterior distribution, and present diagnostics for your approximation. d) Discuss the results of your data analysis. In particular, discuss similarities and differences between the ordinary linear regression and the model fit with correlated responses. Are the conclusions different? 10.4 Gibbs sampling: Consider the general Gibbs sampler for a vector of parameters . Suppose (s) is sampled from the target distribution p() and then (s+1) is generated using the Gibbs sampler by iteratively updating each component of the parameter vector. Show that R the marginal probability Pr((s+1) A) equals the target distribution A p() d. 10.5 Logistic regression variable selection: Consider a logistic regression model for predicting diabetes as a function of x1 = number of pregnancies, x2 = blood pressure, x3 = body mass index, x4 = diabetes pedigree and x5 = age. Using the data in azdiabetes.dat, center and scale each of the xvariables by subtracting the sample average and dividing by the sample standard deviation for each variable. Consider a logistic regression model of the form Pr(Yi = 1|xi , , z) = ei /(1 + ei ) where i = 0 + 1 1 xi,1 + 2 2 xi,2 + 3 3 xi,3 + 4 4 xi,4 + 5 5 xi,5 . In this model, each j is either 0 or 1, indicating whether or not variable j is a predictor of diabetes. For example, if it were the case that = (1, 1, 0, 0, 0), then i = 0 + 1 xi,1 + 2 xi,2 . Obtain posterior distributions for and , using independent prior distributions for the parameters, such that j binary(1/2), 0 normal(0, 16) and j normal(0, 4) for each j > 0. 246 Exercises a) Implement a Metropolis-Hastings algorithm for approximating the (s) posterior distribution of and . Examine the sequences j and (s) (s) j j for each j and discuss the mixing of the chain. b) Approximate the posterior probability of the top five most frequently occurring values of . How good do you think the MCMC estimates of these posterior probabilities are? c) For each j, plot posterior densities and obtain posterior means for j j . Also obtain Pr(j = 1|x, y). Chapter 11 11.1 Full conditionals: Derive formally the full conditional distributions of , , 2 and the j 's as given in Section 11.2. 11.2 Randomized block design: Researchers interested in identifying the optimal planting density for a type of perennial grass performed the following randomized experiment: Ten different plots of land were each divided into eight subplots, and planting densities of 2, 4, 6 and 8 plants per square meter were randomly assigned to the subplots, so that there are two subplots at each density in each plot. At the end of the growing season the amount of plant matter yield was recorded in metric tons per hectare. These data appear in the file pdensity.dat. The researchers want to fit a model like y = 1 + 2 x + 3 x2 + \u000f, where y is yield and x is planting density, but worry that since soil conditions vary across plots they should allow for some across-plot heterogeneity in this relationship. To accommodate this possibility we will analyze these data using the hierarchical linear model described in Section 11.1. a) Before we do a Bayesian analysis we will get some ad hoc estimates of these parameters via least squares regression. Fit the model y = 1 +2 x+3 x2 +\u000f using OLS for each group, and make a plot showing the heterogeneity of the least squares regression lines. From the least squares coefficients find ad hoc estimates of and . Also obtain an estimate of 2 by combining the information from the residuals across the groups. b) Now we will perform an analysis of the data using the following distributions as prior distributions: 1 ) 1 Wishart(4, ) multivariate normal(, 2 2 inverse gamma(1, ) , where , 2 are the estimates you obtained in a). Note that this analysis is not combining prior information with information from the data, as the\"prior\" distribution is based on the observed data. Exercises 247 However, such an analysis can be roughly interpreted as the Bayesian analysis of an individual who has weak but unbiased prior information. c) Use a Gibbs sampler to approximate posterior expectations of for each group j, and plot the resulting regression lines. Compare to the regression lines in a) above and describe why you see any differences between the two sets of regression lines. d) From your posterior samples, plot marginal posterior and prior densities of and the elements of . Discuss the evidence that the slopes or intercepts vary across groups. e) Suppose we want to identify the planting density that maximizes average yield over a random sample of plots. Find the value xmax of x that maximizes expected yield, and provide a 95% posterior predictive interval for the yield of a randomly sampled plot having planting density xmax . 11.3 Hierarchical variances:. The researchers in Exercise 11.2 are worried that the plots are not just heterogeneous in their regression lines, but also in their variances. In this exercise we will consider the same hierarchical model as above except that the sampling variability within a group is given by yi,j normal(1,j + 2,j xi,j + 3,j x2i,j , j2 ), that is, the variances are allowed to differ across groups. As in Section 8.5, we will model 2 12 , . . . , m i.i.d. inverse gamma(0 /2, 0 02 /2), with 02 gamma(2, 2) and p(0 ) uniform on the integers {1, 2, . . . , 100}. a) Obtain the full conditional distribution of 02 . b) Obtain the full conditional distribution of j2 . c) Obtain the full conditional distribution of j . (s) 2 d) For two values 0 and 0 of 0 , obtain the ratio p(0 |02 , 12 , . . . , m ) (s) 2 2 2 divided by p(0 |0 , 1 , . . . , m ), and simplify as much as possible. e) Implement a Metropolis-Hastings algorithm for obtaining the joint posterior distribution of all of the unknown parameters. Plot values of 02 and 0 versus iteration number and describe the mixing of the Markov chain in terms of these parameters. f) Compare the prior and posterior distributions of 0 . Comment on any evidence there is that the variances differ across the groups. 11.4 Hierarchical logistic regression: The Washington Assessment of Student Learning (WASL) is a standardized test given to students in the state of Washington. Letting j index the counties within the state of Washington and i index schools within counties, the file mathstandard.dat includes data on the following variables: yi,j = the indicator that more than half the 10th graders in school i, j passed the WASL math exam; xi,j = the percentage of teachers in school i, j who have a masters degree. In this exercise we will construct an algorithm to approximate the posterior distribution of the parameters in a generalized linear mixed-effects 248 Exercises model for these data. The model is a mixed effects version of logistic regression: yi,j binomial(ei,j /[1 + ei,j ]), where i,j = 0,j + 1,j xi,j 1 , . . . , J i.i.d. multivariate normal (, ), where j = (0,j , 1,j ) a) The unknown parameters in the model include population-level parameters {, } and the group-level parameters { 1 , . . . , m }. Draw a diagram that describes the relationships between these parameters, the data {yi,j , xi,j , i = 1 . . . , nj , j = 1, . . . , m}, and prior distributions. b) Before we do a Bayesian analysis, we will get some ad hoc estimates of these parameters via maximum likelihood: Fit a separate logistic regression model for each group, possibly using the glm command in R via beta. j < glm(y.jX.j,family=binomial)$coef . Explain any problems you have with obtaining estimates for each county. Plot exp{0,j + 1,j x}/(1 + exp{0,j + 1,j x}) as a function of x for each county and describe what you see. Using maximum likelihood estimates only from those counties with 10 or more schools, obtain ad and of and . Note that these estimates may not hoc estimates be representative of patterns from schools with small sample sizes. c) Formulate a unit information prior distribution for and based on ) and the observed data. Specifically, let multivariate normal(, 1 1 let Wishart(4, ). Use a Metropolis-Hastings algorithm to approximate the joint posterior distribution of all parameters. d) Make plots of the samples of and (5 parameters) versus MCMC iteration number. Make sure you run the chain long enough so that your MCMC samples are likely to be a reasonable approximation to the posterior distribution. e) Obtain posterior expectations of j for each group j, plot E[0,j |y] + E[1,j |y]x as a function of x for each county, compare to the plot in b) and describe why you see any differences between the two sets of regression lines. f) From your posterior samples, plot marginal posterior and prior densities of and the elements of . Include your ad hoc estimates from b) in the plots. Discuss the evidence that the slopes or intercepts vary across groups. 11.5 Disease rates: The number of occurrences of a rare, nongenetic birth defect in a five-year period for six neighboring counties is y = (1, 3, 2, 12, 1, 1). The counties have populations of x = (33, 14, 27, 90, 12, 17), given in thousands. The second county has higher rates of toxic chemicals (PCBs) present in soil samples, and it is of interest to know if this town has a high disease rate as well. We will use the following hierarchical model to analyze these data: Yi |i , xi Poisson(i xi ); 1 , . . . , 6 |a, b gamma(a, b); a gamma(1,1) ; b gamma(10,1). Exercises 249 a) Describe in words what the various components of the hierarchical model represent in terms of observed and expected disease rates. b) Identify the form of the conditional distribution of p(1 , . . . , 6 |a, b, x, y), and from this identify the full conditional distribution of the rate for each county p(i | i , a, b, x, y). c) Write out the ratio of the posterior densities comparing a set of proposal values (a , b , ) to values (a, b, ). Note the value of , the vector of county-specific rates, is unchanged. d) Construct a Metropolis-Hastings algorithm which generates samples of (a, b, ) from the posterior. Do this by iterating the following steps: 1. Given a current value (a, b, ), generate a proposal (a , b , ) by sampling a and b from a symmetric proposal distribution centered around a and b, but making sure all proposals are positive (see Exercise 10.1). Accept the proposal with the appropriate probability. 2. Sample new values of the j 's from their full conditional distributions. Perform diagnostic tests on your chain and modify if necessary. e) Make posterior inference on the infection rates using the samples from the Markov chain. In particular, i. Compute marginal posterior distributions of 1 , . . . , 6 and compare them to y1 /x1 , . . . y6 /x6 . ii. Examine the posterior distribution of a/b, and compare it to the corresponding prior distribution as well as to the average of yi /xi across the six counties. iii. Plot samples of 2 versus j for each j 6= 2, and draw a 45 degree line on the plot as well. Also estimate Pr(2 > j |x, y) for each j and Pr(2 = max{1 , . . . , 6 }|x, y). Interpret the results of these calculations, and compare them to the conclusions one might obtain if they just examined yj /xj for each county j. Chapter 12 12.1 Rank regression: The 1996 General Social Survey gathered a wide variety of information on the adult U.S. population, including each survey respondent's sex, their self-reported frequency of religious prayer (on a six-level ordinal scale), and the number of items correct out of 10 on a short vocabulary test. These data appear in the file prayer.dat. Using the rank regression procedure described in Section 12.1.2, estimate the parameters in a regression model for Yi = prayer as a function of xi,1 = sex of respondent (0-1 indicator of being female) and xi,2 = vocabulary score, as well as their interaction xi,3 = xi,1 xi,2 . Compare marginal prior distributions of the three regression parameters to their posterior 250 Exercises distributions, and comment on the evidence that the relationship between prayer and score differs across the sexes. 12.2 Copula modeling: The file azdiabetes_alldata.dat contains data on eight variables for 632 women in a study on diabetes (see Exercise 7.6 for a description of the variables). Data on subjects labeled 201-300 have missing values for some variables, mostly for the skin fold thickness measurement. a) Using only the data from subjects 1-200, implement the Gaussian copula model for the eight variables in this dataset. Obtain posterior \u0001 means and 95% posterior confidence intervals for all 82 = 28 parameters. b) Now use the data from subjects 1-300, thus including data from subjects who are missing some variables. Implement the Gaussian copula model and obtain posterior means and 95% posterior confidence intervals for all parameters. How do the results differ from those in a)? 12.3 Constrained normal: Let p(z) dnorm(z, , )(a,b) (z), the normal density constrained to the interval (a, b). Prove that the inverse-cdf method outlined in Section 12.1.1 generates a sample from this distribution. 12.4 Categorical data and the Dirichlet distribution: Consider again the data on the number of children of men in their 30s from Exercise 4.8. These data could be considered as categorical data, as each sample Y lies in the discrete set {1, . . . , 8} (8 here actually denotes \"8 or more\" children). Let A = (A,1 , . . . , A,8 ) be the proportion in each of the eight categories from the population of men with bachelor's degrees, and let the vector B be defined similarly for the population of men without bachelor's degrees. a) Write in a compact form the conditional probability given A of observing a particular sequence {yA,1 , . . . , yA,n1 } for a random sample from the A population. b) Identify the sufficient stati