Question

1 Approved Answer

Posted on Nov 07, 2024

Problem 07 In the previous homework assignments, you have calculated the standard error on the mean as a function of the sample size. Rather than

Problem 07

In the previous homework assignments, you have calculated the standard error on the mean as a function of the sample size. Rather than repeating that process again, let's use Pandas to help visualize what's happening when we generate the random samples. Pandas has built-in plotting methods that make it quite simple to generate useful statistical graphics to help us understand a data set. We will discuss visualization in more detail later in the course.

The matplotlib.pyplot module is imported for you below.

In[]:

import matplotlib.pyplot as plt

Over the last few weeks we have talked about the difference between replications associated with simulations and the sample size effect we wish to study. We estimated the standard error on the mean by generating 5000 replications of the sample average. As we saw in Week 02, we used 5000 replications because the distribution on the sample average converges to a Gaussian (a bell curve) in the limit of an infinite number of replications. Replicating thousands of times allows our simulated results to match the theoretical results.

This week, you will work with a smaller number of replications. The simulated estimate to the standard error no longer matches the theoretical result with so few replications. However, it will be easier to visualize the random samples and summary statistics with so few replications. You will specifically use 100 replications for this problem.

7a)

You must use the same format of the last assignment where we stored the samples down rows and the replications along the columns of a NumPy 2D array. Use NumPy to generate 5 samples of a Normal (Gaussian or bell curve) with mean 100 and standard deviation 25 and replicate that process 100 times. Do NOT calculate summary statistics associated with these samples.

Assign the result to the variable X005.

IMPORTANT: Do NOT forget to set the random seed!!!!

7a) - SOLUTION

In[1]:

import numpy as np?np.random.seed(0) # set the random seedX005 = np.random.normal(loc=100, scale=25, size=(5, 100))

In[]:

7b)

Convert the X005 NumPy array to a Pandas DataFrame and assign the result to the df005 object. You may use the default index and columnsarguments when you create the DataFrame.

Use the appropriate attribute to display the number of rows and columns associated with df005 to the screen.

7b) - SOLUTION

In[2]:

import pandas as pd?df05 = pd.DataFrame(X005)print('Number of rows:', df05.shape[0])print('Number of columns:', df05.shape[1])

Number of rows: 5Number of columns: 100

In[]:

7c)

Let's visualize the summary statistics associated with the 5 random samples over the 100 replications with a boxplot. Again, you will learn about the boxplot in more detail later. For now, you will focus on the SPREAD or VARIATION through the HEIGHT of the box and whiskers (the vertical lines coming from the box) and on the CENTRAL behavior through the MEAN. Therefore, you must set the appropriate arguments to display the MEAN within the boxplot. The MEAN must be displayed as red triangles.

Use the appropriate method to summarize the replications of the 5 random samples as a boxplot.

7c) - SOLUTION

In[3]:

import matplotlib.pyplot as plt?plt.figure(figsize=(10, 6))df05.boxplot(showmeans=True, meanprops={"marker":"^","markerfacecolor":"red", "markeredgecolor":"red"})plt.show()

160 O O O O 140 O 120 O O O O O O 100 O 80 O O 60 O O O O O 40 O O O O O 012345678901234167290122436789032335838904234867896139456389640646676901734367898283883890929498989\f