Answered step by step
Verified Expert Solution
Question
1 Approved Answer
title: A1Q3 author: date: 5/17/2017 output: html_document --```{r setup, include=FALSE} knitr::opts_chunk$set(echo = TRUE) ``` ## Assignment 1 ### Question 3 ```{r} myDirectory
title: "A1Q3" author: "" date: "5/17/2017" output: html_document --```{r setup, include=FALSE} knitr::opts_chunk$set(echo = TRUE) ``` ## Assignment 1 ### Question 3 ```{r} myDirectory <- "~/Desktop/STAT 444/Data" datafile <- "Advertising.csv" completePathname <- paste(myDirectory, datafile, sep="/") Advertising <- read.csv(completePathname, header = TRUE) tv <- Advertising$TV sales <- Advertising$Sales ``` #### part(a) ```{r} fit1 <- lm(Sales ~ TV, data=Advertising) plot(tv, sales, pch=19, col= adjustcolor("red",alpha.f = 0.5)) abline(fit1, col="blue",lwd=1) pred95.fit1 <- predict(fit1,interval = "prediction", level = 0.95) lines(sort(tv), pred95.fit1[order(tv),'lwr'], col="black",lwd=1, lty=2 ) lines(sort(tv), pred95.fit1[order(tv),'upr'], col="black",lwd=1, lty=2 ) ``` #### part(b)(iv) ```{r} fit2 <- lm(Sales ~ TV, data=Advertising, weights = TV) plot(tv, sales, pch=19, col= adjustcolor("red",alpha.f = 0.5)) abline(fit2, col="blue",lwd=1) pred95.fit2 <- predict(fit2,interval = "prediction", level = 0.95) lines(sort(tv), pred95.fit2[order(tv),'lwr'], col="black",lwd=1, lty=2 ) lines(sort(tv), pred95.fit2[order(tv),'upr'], col="black",lwd=1, lty=2 ) ``` ```{r} fit3 <- lm(Sales ~ TV, data=Advertising, weights = 1/TV) plot(tv, sales, pch=19, col= adjustcolor("red",alpha.f = 0.5)) abline(fit3, col="blue",lwd=1) pred95.fit3 <- predict(fit2,interval = "prediction", level = 0.95) lines(sort(tv), pred95.fit3[order(tv),'lwr'], col="black",lwd=1, lty=2 ) lines(sort(tv), pred95.fit3[order(tv),'upr'], col="black",lwd=1, lty=2 ) ``` qattachments_12b9d6a4e7a36e22d20825d110673a60a386629c TV Radio Newspaper Sales 230.1 37.8 69.2 22.1 44.5 39.3 45.1 10.4 17.2 45.9 69.3 9.3 151.5 41.3 58.5 18.5 180.8 10.8 58.4 12.9 8.7 48.9 75 7.2 57.5 32.8 23.5 11.8 120.2 19.6 11.6 13.2 8.6 2.1 1 4.8 199.8 2.6 21.2 10.6 66.1 5.8 24.2 8.6 214.7 24 4 17.4 23.8 35.1 65.9 9.2 97.5 7.6 7.2 9.7 204.1 32.9 46 19 195.4 47.7 52.9 22.4 67.8 36.6 114 12.5 281.4 39.6 55.8 24.4 69.2 20.5 18.3 11.3 147.3 23.9 19.1 14.6 218.4 27.7 53.4 18 237.4 5.1 23.5 12.5 13.2 15.9 49.6 5.6 228.3 16.9 26.2 15.5 62.3 12.6 18.3 9.7 262.9 3.5 19.5 12 142.9 29.3 12.6 15 240.1 16.7 22.9 15.9 248.8 27.1 22.9 18.9 70.6 16 40.8 10.5 292.9 28.3 43.2 21.4 112.9 17.4 38.6 11.9 97.2 1.5 30 9.6 265.6 20 0.3 17.4 95.7 1.4 7.4 9.5 290.7 4.1 8.5 12.8 266.9 43.8 5 25.4 74.7 49.4 45.7 14.7 43.1 26.7 35.1 10.1 228 37.7 32 21.5 202.5 22.3 31.6 16.6 177 33.4 38.7 17.1 293.6 27.7 1.8 20.7 206.9 8.4 26.4 12.9 25.1 25.7 43.3 8.5 175.1 22.5 31.5 14.9 89.7 9.9 35.7 10.6 239.9 41.5 18.5 23.2 227.2 15.8 49.9 14.8 66.9 11.7 36.8 9.7 199.8 3.1 34.6 11.4 100.4 9.6 3.6 10.7 Page 1 qattachments_12b9d6a4e7a36e22d20825d110673a60a386629c 216.4 182.6 262.7 198.9 7.3 136.2 210.8 210.7 53.5 261.3 239.3 102.7 131.1 69 31.5 139.3 237.4 216.8 199.1 109.8 26.8 129.4 213.4 16.9 27.5 120.5 5.4 116 76.4 239.8 75.3 68.4 213.5 193.2 76.3 110.7 88.3 109.8 134.3 28.6 217.7 250.9 107.4 163.3 197.6 184.9 289.7 135.2 222.4 296.4 280.2 187.9 238.2 41.7 46.2 28.8 49.4 28.1 19.2 49.6 29.5 2 42.7 15.5 29.6 42.8 9.3 24.6 14.5 27.5 43.9 30.6 14.3 33 5.7 24.6 43.7 1.6 28.5 29.9 7.7 26.7 4.1 20.3 44.5 43 18.4 27.5 40.6 25.5 47.8 4.9 1.5 33.5 36.5 14 31.6 3.5 21 42.3 41.7 4.3 36.3 10.1 17.2 34.3 39.6 58.7 15.9 60 41.4 16.6 37.7 9.3 21.4 54.7 27.3 8.4 28.9 0.9 2.2 10.2 11 27.2 38.7 31.7 19.3 31.3 13.1 89.4 20.7 14.2 9.4 23.1 22.3 36.9 32.5 35.6 33.8 65.7 16 63.2 73.4 51.4 9.3 33 59 72.3 10.9 52.9 5.9 22 51.2 45.9 49.8 100.9 21.4 17.9 5.3 22.6 21.2 20.2 23.7 5.5 13.2 23.8 18.4 8.1 24.2 15.7 14 18 9.3 9.5 13.4 18.9 22.3 18.3 12.4 8.8 11 17 8.7 6.9 14.2 5.3 11 11.8 12.3 11.3 13.6 21.7 15.2 12 16 12.9 16.7 11.2 7.3 19.4 22.2 11.5 16.9 11.7 15.5 25.4 17.2 11.7 23.8 14.8 14.7 20.7 Page 2 qattachments_12b9d6a4e7a36e22d20825d110673a60a386629c 137.9 25 90.4 13.1 255.4 225.8 241.7 175.7 209.6 78.2 75.1 139.2 76.4 125.7 19.4 141.3 18.8 224 123.1 229.5 87.2 7.8 80.2 220.3 59.6 0.7 265.2 8.4 219.8 36.9 48.3 25.6 273.7 43 184.9 73.4 193.7 220.5 104.6 96.2 140.3 240.1 243.2 38 44.7 280.7 121 197.6 171.3 187.8 4.1 93.9 149.8 46.4 11 0.3 0.4 26.9 8.2 38 15.4 20.6 46.8 35 14.3 0.8 36.9 16 26.8 21.7 2.4 34.6 32.3 11.8 38.9 0 49 12 39.6 2.9 27.2 33.5 38.6 47 39 28.9 25.9 43.9 17 35.4 33.2 5.7 14.8 1.9 7.3 49 40.3 25.8 13.9 8.4 23.3 39.7 21.1 11.6 43.5 1.3 59 29.7 23.2 25.6 5.5 56.5 23.2 2.4 10.7 34.5 52.7 25.6 14.8 79.2 22.3 46.2 50.4 15.6 12.4 74.2 25.9 50.6 9.2 3.2 43.1 8.7 43 2.1 45.1 65.6 8.5 9.3 59.7 20.5 1.7 12.9 75.6 37.9 34.4 38.9 9 8.7 44.3 11.9 20.6 37 48.7 14.2 37.7 9.5 5.7 50.5 24.3 19.2 7.2 8.7 5.3 19.8 13.4 21.8 14.1 15.9 14.6 12.6 12.2 9.4 15.9 6.6 15.5 7 11.6 15.2 19.7 10.6 6.6 8.8 24.7 9.7 1.6 12.7 5.7 19.6 10.8 11.6 9.5 20.8 9.6 20.7 10.9 19.2 20.1 10.4 11.4 10.3 13.2 25.4 10.9 10.1 16.1 11.6 16.6 19 15.6 3.2 15.3 10.1 Page 3 qattachments_12b9d6a4e7a36e22d20825d110673a60a386629c 11.7 131.7 172.5 85.7 188.4 163.5 117.2 234.5 17.9 206.8 215.4 284.3 50 164.5 19.6 168.4 222.4 276.9 248.4 170.2 276.7 165.6 156.6 218.5 56.2 287.6 253.8 205 139.5 191.1 286 18.7 39.5 75.5 17.2 166.8 149.7 38.2 94.2 177 283.6 232.1 36.9 18.4 18.1 35.8 18.1 36.8 14.7 3.4 37.6 5.2 23.6 10.6 11.6 20.9 20.1 7.1 3.4 48.9 30.2 7.8 2.3 10 2.6 5.4 5.7 43 21.3 45.1 2.1 28.7 13.9 12.1 41.1 10.8 4.1 42 35.6 3.7 4.9 9.3 42 8.6 45.2 34.6 30.7 49.3 25.6 7.4 5.4 84.8 21.6 19.4 57.6 6.4 18.4 47.4 17 12.8 13.1 41.8 20.3 35.2 23.7 17.6 8.3 27.4 29.7 71.8 30 19.6 26.6 18.2 3.7 23.4 5.8 6 31.6 3.6 6 13.8 8.1 6.4 66.2 8.7 7.3 12.9 14.4 13.3 14.9 18 11.9 11.9 8 12.2 17.1 15 8.4 14.5 7.6 11.7 11.5 27 20.2 11.7 11.8 12.6 10.5 12.2 8.7 26.2 17.6 22.6 10.3 17.3 15.9 6.7 10.8 9.9 5.9 19.6 17.3 7.6 9.7 12.8 25.5 13.4 Page 4 Assignment 1 STAT 444/844, CM 764 Due Friday May 19 at 11am (through crowdmark) 1. (15 marks) Linear models: Consider the pairs (xi , yi ) for i = 1, . . . , 6 for which we will construct a linear model yi = i + ri . In vector terms, we will have y=+r with i being (typically) expressed as a function of xi and in vector-matrix form as = X For each of the following cases, write down the contents of the matrix X and the parameter vector . a. The first two observations have one mean, the last four another. b. The mean of the first three observations lie along a line in x while that of the last three lie on a parabola in x that intersects the y axis in the same place as does the line. c. The i s for odd values of i lie on one line in x, but on a parallel line when i is even. d. The values of i lie on one line in x for i = 1, 2, 3 and on another for i = 4, 5, 6 but the two lines intersect when x = 3. 2. (35 marks) On the course website, from time to time, there will be data sets stored in the Data directory. These will often (but not always) be in the form of a .csv file, say named someData.csv. You will need to download these onto your machine and save it there in some directory, identified by some pathname (likely separating directories with\"/\") such as \"/Home/Me/Stats/Data/\". With these two pieces, directory and filename, you can load the data into R as follows: # Depending on your style, you might end your directory with the separator, # here "/", or not. # If you do, then myDirectory <- "/Home/Me/Stats/Data/" datafile <- "someData.csv" completePathname <- paste0(myDirectory, datafile) #paste0 has no separator # If you do not, then myDirectory <- "/Home/Me/Stats/Data" # no ending / datafile <- "someData.csv" completePathname <- paste(myDirectory, datafile, sep="/") # default sep is blank # # Either way the data is read into R with myData <- read.csv(completePathname) The reason for having a directory location name and a separate filename is that then you can have a lot of different data sets in the same directory and not have to retype the full directory name each time you want to load a different data set. In this question, you will conduct an analysis on several pairs of variates and compare the results. The data is artificial but interesting; it is the data set fakePairs.csv on the course website. Download it and read it into R assigning it to the variable fakePairs. There are four pairs of variates with paired names (x1 , y1), (x2, y2), (x3, y4), and (x4, y4). For each pair, there are nrow(fakePairs) = 11000 observations on each pair. a. (20 marks) For each pair: i. Calculate the correlation between the x and the y. Comment on the strength of the correlation. 1 ii. Fit a straight line model of the y variate to the corresponding x variate. Assign the fitted model to an R variable fit for the appropriate i {1, 2, 3, 4}. Show your code and print the summary of the fit. iii. From this output, write down the fitted intercept and slope parameters as: b =???? b =???? iv. From this output, assess the evidence against the hypothesis H : = 0 v. From this output, assess the evidence against the hypothesis H : = 0 vi. From this output, what is the value of R2 and what do you conclude about the quality of the fit? b. (4 marks) From the four summaries of the fit, how do the straight line models of the four variates compare in their interpretation? What might you conclude from the estimated generative model for each? c. (5 marks) Draw a scatterplot of y versus x for each of the four pairs of variates (use adjustcolor to effect a transparency or alpha level of 0.01); use an appropriate title (e.g. (x1, y1) pairs ). On each one, plot its fitted line. Arrange all four plots in a 2 2 grid for presentation of your solution. d. (6 marks) Comment on the appropriateness of the straight line model for each of the pairs. How does the information provided by the summary statistics compare with that provided by the plots in making this assessment? 3. (32 marks) Recall the Advertising data from class (provided by the authors of An Introduction to Statistical Learning with Applications in R and available on the class website as Advertising.csv). We will use this data set to explore the effect of different weighting schemes on our fitted model and its predictions. a. (3 marks) Plot the data Sales versus TV. i. Fit a simple straight line model to this data assigning the value of the fit to the variable fit1. ii. Add the fitted line to this scatterplot. iii. Using the predict function in R, add 95% prediction intervals to the plot. Your solution should contain your code and a single plot for the above. b. It should be clear from the plot that the residuals from the fitted line appear to vary more as the amount spent on TV advertising increases. Suppose that we model this as Yi = (xi ) + Ri where (xi ) = + xi and Ri N (0, 2 (xi )) independently for i = 1, . . . , n, and (x) is a function of x. Suppose the function (x) is known up to an unknown proportionality constant. That is, (x) = k(x) where > 0 is the unknown proportionality constant and k(x) is a known positive function of x. i. (3 marks) Show how the above model can be rewritten as Yi = (xi ) + Ri where Ri N (0, 2 ) are now independent and identically distributed random variates by simply writing down the value for each of the following (in simplest form): Ri Yi = ???? = ???? (xi ) = 2 ???? ii. (4 marks) We could get estimates b and b in one of two equivalent ways. Either using the second model and least-squares, minimizing n n (ri )2 = (yi (xi ))2 i=1 i=1 or by using the original model and weighted least-squares, minimizing n wi ri2 = i=1 n wi (yi (xi ))2 . i=1 Show that these two are equivalent and hence determine the value of wi =???? Consequently, describe how the weight wi changes as the V ar(Ri ) changes. Which points will have the greatest weight - those with large residual variability or those with least residual variability? iii. (4 marks) Consider the following two weight functions wi = xi wi = 1/xi Suppose we were to use each of these in separate weighted least-squares estimations of the original model. (No calculation needed yet. That happens in part iv.) For each of these weight functions, which points in the scatterplot of Sales versus TV would have the greatest influence on the fitted line? Which would have the least? iv. (5 marks) Use each of the weight functions given above to fit a straight line model of Sales versus TV. For each fit, draw a scatterplot of the data with the fitted line overlaid, and 95% prediction intervals (use lty=2 for the prediction intervals). Explain any differences you observe between the two fits. Note: When using the predict function to produce prediction intervals for models fitted by weighted least-squares, the weights must be specified as an argument as well as the fitted model. Your solution should include R code, the requested plots, and discussion as warranted. c. Often, the nature of the dependence of residual variation on xi is not known, but rather estimated from the data. Since variation often depends on (x), we might consider a plot of the absolute value of the estimated realized residuals, rbi versus the fitted function values bi = b(xi ) for i = 1, . . . , n. We can then fit a new model rbi = (b i ) + vi where vi is a new residual term and the function () is to be estimated. The estimated b(i ) (when expressed through i = (xi )) is then an estimate of k(xi ) from part b. i. (1 mark) Fit again the original straight line model of Sales on TV, call the fit fit1, and assign the absolute value of resulting estimated residuals to the variate ab_res1. ii. (2 marks) Plot ab_res1 versus the fitted values of fit1. iii. (2 marks) Fit a straight line model to the data of the above plot and add it to the plot. Call the resulting fit sfit. iv. (6 marks) For any x, k(x) from part b, may now be estimated by the predicted values of sfit for the value b(x), which in turn is predicted using fit1. Using this information, create a new fitted straight line model for Sales versus TV by weighted least squares based on our estimate of k(x). Call this fit2. Plot the original Sales versus TV and overlay the plot with the fitted line from fit2 95% prediction intervals from fit2 (use lty=2) Show your code and your plot. 3 d. (2 marks) Comment on the relative merits of the fits and prediction intervals of parts (a) and (c) above. Which fitted model is preferred? Why? 4. (10 marks) Graduate students only. Recall the usual definition of the sample correlation coefficient: N (xi x)(yi y) r = i=1 . N N 2 2 i=1 (yi y) i=1 (xi x) This is sometimes called the \"Pearson's product moment correlation\". An alternative way of measuring the \"correlation\" between variates is the so-called \"Spearman's rho\". Both are available in R via the cor(...) function. a. (4 marks) What is Spearman's rho? How would you calculate it given pairs (x1 , y1 ), . . . , (xN , yN )? What are some of the concerns in calculating it? b. (3 marks) Which is more \"robust\": Pearson's product-moment correlation or Spearman's rho? Explain your reasoning. c. (3 marks) What features of a functional relationship between continuous y and x are best captured by Pearson's correlation? What features by Spearman's? Explain. 4
Step by Step Solution
There are 3 Steps involved in it
Step: 1
Get Instant Access to Expert-Tailored Solutions
See step-by-step solutions with expert insights and AI powered tools for academic success
Step: 2
Step: 3
Ace Your Homework with AI
Get the answers you need in no time with our AI-driven, step-by-step assistance
Get Started