Answered step by step
Verified Expert Solution
Question
1 Approved Answer
LAB ACTIVITY #3 For each part of this lab, SUBMIT your solution to everything asked in BOLD RED in Problems 1-6 below. Include the respective
LAB ACTIVITY #3 For each part of this lab, SUBMIT your solution to everything asked in BOLD RED in Problems 1-6 below. Include the respective R code and output. Part 1 (Continuous Distributions in R) In this part we will practice finding probabilities and percentiles for Uniform, Normal, and Exponential random variables using R. Every continuous distribution in R has a root name. For example, for the Uniform distribution the root name is unif, for Normal - norm, and for Exponential - exp. The following letter-prefixes of the root are used to generate respective functions for the distributions: d p q r for "density", the density function (for continuous distributions this is the pdf) for "probability", the cumulative distribution function (this is the cdf) for "quantile", the inverse cdf (this allows to find percentiles) for "random", generates a random variable that has the specified distribution For a random variable that has Uniform distribution on interval from a to b dunif(x, a, b) punif(x, a, b) qunif(p, a, b) runif(m, a, b) # f(x), the pdf at argument x of Unif(a, b) random variable # F(x), the cdf at argument x of Unif(a, b) random variable # x, or the 100*(1-)th percentile of Unif(a, b) random variable, where p=1- # generates m random values from Unif(a, b) distribution Similarly, for a Normal random variable with parameters and 2 dnorm(x, mu, sigma) pnorm(x, mu, sigma) qnorm(p, mu, sigma) rnorm(m, mu, sigma) # f(x), the pdf at argument x of N(, 2) random variable # F(x), the cdf at argument x of N(, 2) random variable # x, or the 100*(1-)th percentile of N(, 2), where p=1- # generates m random values from N(, 2) distribution NOTE: In R you do not need to standardize normal random variables! You can work with any normal variable directly as long as you specify its mean and standard deviation . For an Exponential random variable with parameter dexp(x, lambda) pexp(x, lambda) qexp(p, lambda) rexp(m, lambda) Example: a) Let X quartiles of # f(x), the pdf at argument x of Exp() random variable # F(x), the cdf at argument x of Exp() random variable # x, or the 100*(1-)th percentile of Exp(), where p=1- # generates m random values from Exp() distribution have Exponential distribution with parameter 3. Find X . Using the cdf of X , P ( 0.1< X <0.4 )=F ( 0.4 )F ( 0.1 )=0.439624 P(0.1< X < 0.4) and the > pexp(0.4,3) - pexp(0.1,3) [1] 0.439624 The lower quartile is the 25th percentile, or the value x 0.75 such that F ( x 0.75 )=0.25 . Since we need to find the argument of cdf (not the cdf itself), we use the inverse cdf function and x 0.75=0.09589402 . > qexp(0.25,3) [1] 0.09589402 The median is the 50th percentile, or x 0.5 such that F ( x 0.5 ) =0.5 . So x 0.5=0.2310491 > qexp(0.5,3) [1] 0.2310491 The upper quartile is the 75th percentile, or x 0.25 such that F ( x 0.25 )=0.75 . So x 0.25=0.4620981 > qexp(0.75,3) [1] 0.4620981 b) Let X have Uniform distribution on [-10.3, 15.6]. Find P( X <12x 3.5) and generate 5 values from this distribution. using the definition of conditional probability cdf p( x <12 p(3.5 <12) f ( 12 )f (3.5) p <12|x 3.5 ) == > (punif(12,-10.3,15.6) - punif(-3.5,-10.3,15.6)) / (1 - punif(-3.5,-10.3,15.6)) [1] 0.8115183 The generated values are > runif(4,-10.3,15.6) [1] 1.840664 -3.343330 3.351622 1.127541 Note: The generated values are not going to be the same for different runs of this code. c) Let X N (30.6,13.8) . Find P( X >28) , P( X >28X >25.1) , the 92nd percentile and the x 0.85 . Also generate 10 values from this distribution. Using the cdf, P ( X >28 )=1F ( 28 )=0.758004 > 1 - pnorm(28, 30.6, sqrt(13.8)) [1] 0.758004 NOTE: The second parameter is the standard deviation, not the variance! Also as mentioned above, there is no need to standardize when using R! (However, still have to standardize when using the table.) Using definition of conditional P( X >28 X > 25.1) P ( X >28 X >25.1 )= = P(X >25.1) the probability and the cdf, P( X >28) 1F ( 28 ) = =0.8145004 P( X >25.1) 1F ( 25.1 ) > (1 - pnorm(28, 30.6, sqrt(13.8))) / (1 - pnorm(25.1, 30.6, sqrt(13.8))) [1] 0.8145004 The 92nd percentile, or x 0.08 is the value such that F ( x 0.08 )=0.92 . Then x 0.08=35.81961 > qnorm(0.92, 30.6, sqrt(13.8)) [1] 35.81961 Next, x 0.85 is the 15th percentile, that is F ( x 0.85 )=0.15 . Then x 0.85=26.74982 qnorm(0.15, 30.6, sqrt(13.8)) [1] 26.74982 And the 10 generated values are > rnorm(10, 30.6, sqrt(13.8)) [1] 29.48343 27.72339 38.01511 30.71013 35.83176 27.70189 37.64109 31.98042 [9] 35.58243 34.76870 Now solve the problem below using what we've discussed. Problem 1: a) The scores of a reference population on the Wechsler Intelligence Scale for Children (WISC) are normally distributed with mean 100 and variance 225. Find the following: the probability that a randomly selected child has a score of 110 or more the probability that a randomly selected child has a score of less than 88 the probability that a randomly selected child has a score between 88 and 110 the probability that a randomly selected child has a score of at least 105, given the score is below 135 the probability that a child scores within 2.5 standard deviations from the mean the score that separates a child from the top 5% of the population from the bottom 95%. What do we call this value? the 98th percentile of the scores the 34th percentile of the scores the median score b) Suppose messages arrive to a computer server according to Poisson process with the rate of 6 per hour. Find the following: the probability that there are no messages in the next five minutes the probability that there are no messages in the next three minutes, if there were no messages in the previous 5 minutes the probability that it will take between 5 and 10 minutes for the 11th message to arrive after the arrival of the 10th message the average time till the next message arrives the median of the time till the next message arrives c) Busses arrive at a certain stop at 15-minute intervals starting 7:00 am. Suppose a passenger comes to the stop at a time uniformly distributed between 7 and 7:30 am. Find the following: the probability that he waits for a bus less than 5 minutes the probability that he waits for a bus more than 10 minutes Part 2 (Normal Approximation to Binomial) In this part let us consider how well the Normal approximation to Binomial works. The R functions for Binomial distributions were discussed in Lab activity #2, and the R functions for Normal distributions are discussed in Part 1 of this lab. Problem 2: Suppose that 42% of all drivers stop at an intersection having flashing red lights when no other cars are visible. Of 350 randomly selected drivers coming to an intersection under these conditions, let X be the number of those who stop among the 350 drivers. Suppose we want to find P(115 X 155). what is the exact distribution of X? use the exact distribution of X to compute the exact probability P(115 X 155) in R Now let's see how well the Normal approximation to the Binomial works. check the condition(s) for the approximation. compute the mean and the standard deviation of the approximate distribution of X. now calculate the approximate probability P(115 X 155) using R and remember to use the continuity correction factor of 0.5! comment on how well/poorly the Normal approximation to the Binomial works here Part 3 (Integration in R) It is also useful to know how to integrate in R in order to find probabilities, expectations, etc. for continuous distributions. Let us illustrate how we use R to find integrals with an example. Example: Suppose the pdf of a continuous random variable X is x e x /5 25 f ( x )= if x< 0 , and 0 otherwise. a) Let's check that f ( x ) is a valid pdf. First, f(x) 0 for any x, because |x| and exponent are non-negative for any x. We also need to check that f ( x ) dx=1 . We only need to integrate over the support S X ={x <0 } . x e> f = function(x) +{ + abs(x)*exp(x/5)/25 +} > integrate(f, -Inf, 0) 1 with absolute error < 7.5e-06 Here the integral is equal to 1 (with a very small error of calculation < 7.5e-06). Thus, f(x) satisfies both conditions and is a pdf. b) Find the probability that |X +3|< 2 . Also find the probability that |X +3|< 20 . First, we rewrite the probability as probability about X so that we can use the pdf of X to find it by 1 integration. Then P (|X +3|< 2 )=P (2< X+ 3<2 )=P (5< X <1 )= f ( x)dx= 5 x e> integrate(f, -5, -1) 0.246718 with absolute error < 2.7e-15 17 P (|X +3|< 20 )=P (20< X +3<20 )=P (23< X< 17 )= f ( x) dx= Now for 23 x e x /5 dx= 0.9437097 25 0 0 f ( x)dx= 23 5 Note that we only integrate only over the part of the condition that's inside the support (because the density f(x) is 0 outside the support). So the integration is from -23 to 0 (not 17). > integrate(f, -23, 0) 0.9437097 with absolute error < 1e-14 c) Find the expectation and the variance of X . x x e x /5 dx=10 25 For this random variable, the expectation is 0 E ( X )= xf ( x ) dx= The function we need to integrate has changed (we integrate xf(x) here, not f(x)), so we need to specify it and let's call it g(x) for example. g = function(x) { x*abs(x)*exp(x/5)/25 } And now we integrate it integrate(g, -Inf, 0) # because we integrate function g(x) here, g is the first argument The R output is: > g = function(x) +{ + x*abs(x)*exp(x/5)/25 +} > integrate(g, -Inf, 0) -10 with absolute error < 4.6e-07 Recall the variance definition X X X ( 2)100 . 2 ( 2)(10 ) =E 2 ( 2)( E ( X ) ) =E Var ( X )=E x 2 x e x/ 5 dx 25 Since for 0 E ( X ) = x f ( x ) dx= 2 2 we integrate x e x /5 25 , we need to specify it in R, let's call x2 it h(x ) for example. h = function(x) { x^2*abs(x)*exp(x/5)/25 } E ( X 2 ) we use: Now to compute integrate(h, -Inf, 0) The output for E ( X 2 ) is: > integrate(h, -Inf, 0) 150 with absolute error < 0.00019 Therefore Var ( X )=150100=50 Now solve the problem below. Problem 3: (Based on #5 from section 3.3) The lifetime X, in months, of certain equipment is believed to have 1 x/ 10 pdf f ( x )= 100 x e if x> 0 , and 0 otherwise. Use R commands for the needed integrations to find the probability that the lifetime of the equipment is more than 5 years the mean of X the variance of X Part 4 (Scatterplots and correlations) One of the things we've talked about is correlation. Recall that it measures the strength of the linear relationship between two random variables. Check out the nice and useful animated illustration of the relationship between scatterplots and correlation coefficient that can be found at http://www.bolderstats.com/jmsl/doc/ (see CorrelationPicture option). Play with different values for a better understanding. Problem 4: Consider the following four graphs GRAPH A Scatterplot of Y vs X 8 7 Y 6 5 4 3 0 1 2 3 4 X GRAPH B Scatterplot of Y vs X 8 7 Y 6 5 4 3 2 0 1 2 X 3 4 GRAPH C Scatterplot of Y vs X 8 7 Y 6 5 4 3 -1 0 1 2 3 4 X GRAPH D Scatterplot of Y vs X 7 6 Y 5 4 3 2 -1 0 1 2 X 3 4 Here are four possible values of correlation: 0.90, 0.52, 0.003, -0.14. Fill out the table below by matching these four numbers with the corresponding graph that you think has that correlation and provide a short explanation for each choice. Grap h A Correlation value Why this value? B C D Part 5 (Central Limit Theorem) The Central Limit Theorem is one of the most important and useful results in statistics. It is used as theoretical basis for many statistical methods, in particular for confidence intervals and hypothesis testing that we will discuss later in the semester. 2 The CLT states: if X 1 , ... , X n be iid with mean and variance and n is large enough ( n 30 ), then the approximate distribution of their sum and the sample mean are, respectively, X 1 +...+ X n N ( n , n 2 ) X + ...+ X n N , X= 1 n n 2 ( ) . Example: Let's verify how the CLT works by studying the distribution of the sample mean. Let n = 40 and X + ...+ X 40 X 1 , ... , X 40 are iid Poisson(5). Study the distribution of X= 1 based on 1000 40 simulated samples of size n. For Poisson(5) distribution =E ( X i )=5 approximately normal with mean =5 and 2=Var ( X i )=5 . The CLT says that 2 and variance 5 = =0.125 . Let's verify this. n 40 X is We will simulate m = 1000 samples; each sample consists of n = 40 values for which we compute the sample mean. This way we will obtain 1000 sample means (from each of these 1000 samples.) n = 40 # sample size n s_m = array(0,1000) # creates a vector of length 1000 to record the 1000 sample means that will be created further; first parameter says that for now this is a vector of all 0s, and the second parameter specifies the lensth of the vector named s_m (for sample mean) for(i in 1:1000) # use a for-loop in R to fill out the entries of the vector above by calculating the sample mean for each sample: on step i, i = 1,..., 1000, we will generate ith sample of size 40 and compute the ith sample mean of the generated values and record it as the ith entry of vector s_m { x = rpois(n,5) # generate n=40 values from Poisson(5) distribution s_m[i] = sum(x)/n # compute sample average of the generated values and record in s_m } To get an idea of the distribution of the created 1000 sample means, we construct their histogram and also on the top of it draw the normal curve to determine how off the histogram is: hist(s_m, prob=TRUE, breaks=15, density=15) # create histogram for the entries of s_m xnorm = seq(min(s_m),max(s_m),length=40) ynorm = dnorm(xnorm, mean(s_m), sd(s_m)) lines(xnorm, ynorm, col=2, lwd=2) above # create the grid of points # compute the normal density at these points # add the normal curve to the histogram The resulting histogram and the plot are below (note that for different runs of the code, i.e. for different generated samples, the plot is going to be slightly different) The mean and variance of the 1000 sample means we created are, respectively > mean(s_m) [1] 4.99675 > var(s_m) [1] 0.1221428 Compared to the theoretical values based on CLT (see above), the values are close. Problem 5: Verify how the CLT works for the sample mean when the sample size n = 38 and X 1 , ... , X 38 are X 1+ ...+ X 38 iid Exp(4). Study the distribution of X= based on 1000 simulated samples of size 38 n=38. find the (theoretical) mean and variance 2 of the X i 's find the (theoretical) mean and variance of the sample mean (based on CLT) include the histogram of the created 1000 sample mean values with the normal curve. report the mean and variance of the created 1000 sample mean values and compare them to the theoretical values ss Problem 6: Penn State Fleet which operates and manages car rentals for Penn State employees found that the tire lifetime for their vehicles has a mean of 50,000 miles and standard deviation of 3500 miles. They consider a random sample of 50 vehicles. Find the following: what is the approximate distribution, the mean, and the variance of the mean lifetime? (Hint: Use the CLT) find the probability that the sample mean lifetime for these 50 vehicles exceeds 52,000
Step by Step Solution
There are 3 Steps involved in it
Step: 1
Get Instant Access to Expert-Tailored Solutions
See step-by-step solutions with expert insights and AI powered tools for academic success
Step: 2
Step: 3
Ace Your Homework with AI
Get the answers you need in no time with our AI-driven, step-by-step assistance
Get Started