Question
In Python AND Statistics! please answer fully and provide detail and explanations to what you did. Answer ALL parts of a question as well. ------------------------------------------------------------------------------------------------------------------------------------------------------------
In Python AND Statistics!
please answer fully and provide detail and explanations to what you did. Answer ALL parts of a question as well.
------------------------------------------------------------------------------------------------------------------------------------------------------------
Instructions The aim of this problem set is to work with and interpret hypothesis tests and t-tests. To do so appropriately, we will also need to be competent in data exploration, visualization, and transformations. In this dataset you will use a sample of AirBnB listings from Beijing and Seattle. The data is downloaded from AirBnB, http://insideairbnb.com/get-the-data.html. The sample, however, only contains two columns: city "Beijing" or "Seattle" price (in USD)
HERE IS THE DATASET TO WORK WITH: THERE IS NO MISSING INFORMATION, YOU WILL WORK WITH THIS DATA!: https://1drv.ms/x/s!AtfXPbdjkmO7oJoJA6QLYJSydgeCUQ?e=MyuQ9i
------------------------------------------------------------------------------------------------------------------------------------------------------------
PART 3. Brute-Force Approach:
In this section, we will attempt a 'brute-force' method to determine how likely we are to see a price difference as large, presuming the null hypothesis H0 is correct. This is equivalent to asking how likely we are to see a difference in log-prices as large. This key fact will guide us through our brute-force approach. To answer the question of how likely we are to see a difference in log-prices as large due to chance, we will use simulated new datasets of log-price price data where H0 is true. This simulated data will mimic the airbnb dataset in numbers of observations (both total and by city). However, you will use a random number generator based off of a normal distribution set to the overall mean and standard deviation of the log-transformed price data. Hence, when we compare the differences in the simulated Beijing and Seattle log-prices, we know this variation is due to underlying sampling variability. If this is done enough times, we should get a general idea of how likely the difference in log-prices could be due to chance alone.
1. Our null hypothesis H0 is that the difference in prices in the underlying population of AirBnBs in Beijing is 0. As previously discussed, this is equivalent to the statement that the difference in log-prices of the underlying population is 0. To make a viable distribution where this is the case, let's use the overall mean (0) and standard deviation (0) of the combined Beijing and Seattle log-prices. Please output these values here. Hint: the standard deviation is approximately 0.642.
2. Now create two sets of random normals, "simulatedlogBeijing" and "simulatedlogSeattle" (you may call these something else), both using the OVERALL mean 0 and standard deviation 0 of the entire dataset. The number of observations for each city in your simulated data must be the same as in our original sample. What is the difference between the mean log-prices of these cities? Hint: say, the mean is 5 and standard deviation is 0.5. You can create the corresponding normals like: