Answered step by step

Verified Expert Solution

Link Copied!

Question

1 Approved Answer

Posted on Oct 07, 2024

First load the necessary packages: ```{r} library(ggplot2) library(dplyr) library(forcats) library(moderndive) 1. Ask two of your classmates what their estimate of $hat{p}$ was. How do the

First load the necessary packages:

```{r} library(ggplot2) library(dplyr) library(forcats) library(moderndive)

1. Ask two of your classmates what their estimate of $\hat{p}$ was. How do the $\hat{p}$ estimates from different samples compare?

Type your complete sentence answer here using inline R code and delete this comment.

2. **Why** did everyone get a different estimate?

Type your complete sentence answer here using inline R code and delete this comment.

***

## Estimating $\widehat{SE}$ from a Single Sample

Typically we only have the opportunity to collect **one sample** for our study. Consequently, we have to use the amount of variability in our **single sample** as an estimate of the amount of variability we might expect in our results if we had taken a random sample of 50 different people. The $\widehat{SE}_{\hat{p}}$ serves as an **ESTIMATE** of **sampling variability** if you only have a **single sample**. The formula for estimating the standard error of $\hat{p}$ is given in Equation \@ref(eq:se).

\begin{equation} \widehat{SE}_{\hat{p}} \approx \sqrt{\frac{\hat{p} \times (1-\hat{p})}{n}} (\#eq:se) \end{equation}

> Note that we use $n$ for the size of the sample, that p "wears a hat", like so: $\hat{p}$ because we are ESTIMATING a proportion based on only a sample, and that the SE "wears a hat" as well because we are ESTIMATING $SE$ based on only a sample.

The standard error of $\hat{p}$ can be estimated in R as follows:

```{r} n50_1rep %>% summarize(divorce_count = sum(marital == "Divorced"), n = n()) %>% mutate(p_hat = divorce_count/ n, se_hat = sqrt(p_hat * (1 - p_hat) / n)) ```

***

# Demo: Generating a Sampling Distribution of $\hat{p}$

If you ran the code chunk that takes a random sample of 50 cases a thousand more times....and wrote down every $\hat{p}$ you got, you would have what is called a simulated "sampling distribution".

> A sampling distribution shows every [or nearly every!] possible result a sampling statistic can have under every [or nearly every!] possible sample **of a given sample size** from a population.

## Simulated Sampling Distribution of $\hat{p}$ for $n = 50$

Instead of running the sampling code chunk for $n = 50$ over and over, we can "collect" 1000 samples of $n = 50$ easily with R. The following code chunk takes 1000 **different** samples of $n = 50$ and stores them in the data frame `n50_1000rep`:

```{r} set.seed(19) n50_1000rep <- gss_14 %>% rep_sample_n(size = 50, reps = 1000) ```

Be sure to look at `n50_rep1000` in the data viewer to get a sense of these 1000 samples look like.

***

3. What is the name of the column that identifies which of the 1000 samples each row is from?

Type your complete sentence answer here using inline R code and delete this comment.

4. What is the sample size $n$ for each of the $1000$ samples we took? (i.e. how many humans are sampled in each replicate)?

Type your complete sentence answer here using inline R code and delete this comment.

5. Based on your histogram, what appeared to be a very common value of $\hat{p}$? What was a very uncommon value? Specifically, find the 1%, 99%, the mean, and the standard deviation of the values stored in `p_hat_n50` to help answer the question.

Type your complete sentence answer here using inline R code and delete this comment.

```{r} # Your code here ```

6. How do these values compare to the estimates we got for $\hat{p}$ and $\widehat{SE}_{\hat{p}}$ for `Divorced` respondents based on your **single** sample of 50 people earlier in this Problem Set?

Type your complete sentence answer here using inline R code and delete this comment.

7. Use the `rep_sample_n` function to collect 1000 virtual samples of size $n = 15$. Store the 1000 virtual samples in an object named `n15_1000rep`. Use a seed of 910.

```{r} # Type your code and comments inside the code chunk

```

8. Calculate sample proportion $\hat{p}$ of people who reported they were `Divorced` for each replicate of your $n = 15$ sampling. Store the results in `ques8` and display the first six rows of `ques8`.

```{r} # Type your code and comments inside the code chunk

```

9. Visualize the sampling distribution of $\hat{p}$ from your $n = 15$ sampling with a purple histogram.

```{r} # Type your code and comments inside the code chunk

```

10. Calculate the mean of the $n = 15$ sampling distribution, and the standard error of the $n = 15$ sampling distribution

```{r} # Type your code and comments inside the code chunk

```

***

11. How does the standard error of the $n= 15$ sampling distribution compare to the standard error of the $n = 50$ sampling distribution?

Type your complete sentence answer here using inline R code and delete this comment.

12. Explain any observed differences from 11.

Type your complete sentence answer here using inline R code and delete this comment.

***

13. Use the `rep_sample_n` function to collect 1000 virtual samples of size $n = 600$. Store the 1000 virtual samples in an object named `n600_1000rep`. Use a seed of 84.

```{r} # Type your code and comments inside the code chunk

```

14. Calculate the proportion $\hat{p}$ of people who reported they were `Divorced`for each replicate of your $n = 600$ sampling. Store the results in `ques14` and display the first six rows of `ques14`.

```{r} # Type your code and comments inside the code chunk

```

15. Calculate the mean of the $n = 600$ sampling distribution, and the standard error of the $n = 600$ sampling distribution.

```{r} # Type your code and comments inside the code chunk

```

16. Was there more **variability** from sample to sample when we took a sample size of 600 or a sample size of 50? **Explain what evidence you have for assessing this**.

Type your complete sentence answer here using inline R code and delete this comment.

***

17. Which sampling distribution looked more normally distributed (bell shaped and symmetrical); the one built on n = 15, 50 or 600? **Why?**

Type your complete sentence answer here using inline R code and delete this comment.

18. Imagine we collected only a single small sample of 15 respondents as given from the code below.

```{r} set.seed(53) n15_1rep <- gss_14 %>% rep_sample_n(size = 15, reps = 1) # and n50_1rep <- gss_14 %>% rep_sample_n(size = 50, reps = 1)

Following the example from the beginning of the Problem Set (roughly line 138), estimate the **sample proportion** $\hat{p}$ of people who identified as `Divorced` based on `n15_1rep`... AS WELL AS the **standard error of $\hat{p}$**

19. Replace `x` with the standard error you obtained by taking the standard deviation of the $n = 15$ sampling distribution Replace `a` with the standard error you obtained for a single sample of $n = 15$ using the mathematical formula.

20. Based on what you observed for 19, **IF** you collected a single sample from 600 respondents, do you think the standard error will be smaller or larger than the one you calculated for $n = 15$. **Explain your reasoning**.