Question
Data Dredgingandp-hackingare umbrella terms for the dangerous practice of automatically testing a large number of hypotheses on the entirety or subsets of a single dataset
Data Dredgingandp-hackingare umbrella terms for the dangerous practice of automatically testing a large number of hypotheses on the entirety or subsets of a single dataset in order to find statistically significant results. In this exercise we will focus on the idea of testing hypotheses on subsets of a single data set.
Nefaria Octopain has landed her first data science internship at an aquarium. Her primary summer project has been to design and test a new feeding regimen for the aquarium's octopus population. To test her regimen, her supervisors have allowed her to deploy her new feeding regimen to 4 targeted octopus subpopulations of 40 octopuses each, every day, for a month.
The effectiveness of the new diet is measured simply by the rate at which the food is consumed, which is simply defined to be theproportionof octopuses that eat the food (POOTEF). The aquarium's standard octopus diet has a POOTEF of0.90
0.90. Nefaria is hoping to land a permanent position at the aquarium when she graduates, so she'sreallymotivated to show her supervisors that the POOTEF of her new diet regimen is a (statistically) significant improvement over their previous diet.
The data from Nefaria's summer experiment can be found inpootef.csv. Load this dataset as a Pandas DataFrame.
[53]:
[53]:
GroupDateFedAte01Oct 1 2018403711NaN403721NaN403531NaN403541Oct 5 20184036
Part A: State the null and alternate hypotheses that Nefaria should test to see if her new feeding regimen is an improvement over the aquarium's standard feeding regimen with a POOTEF of0.90
0.90.
Part B: Test the hypothesis fromPart Aat the=0.05
=0.05significance level using a p-value test. Is there sufficient evidence for Nefaria to conclude that her feeding regimen is an improvement?
[67]:
[67]:
Group 2.50000 Fed 40.00000 Ate 36.08871 dtype: float64
Part C: Bummer, Nefaria thinks. This is the part where she decides to resort to some questionable science. Maybe there is a reasonablesubsetof the data for which her alternative hypothesis is supported? Can she find it? Can she come up for a reasonable justification for why this subset of the data should be considered while the rest should be discarded?
Here are therules: Nefaria cannot modify the original data (e.g. by adding nonexistent feedings or bites to certain groups or days) because her boss will surely notice. Instead she needs to find a subset of the data for which her hypothesis is supported by a p-value test at the=0.05
=0.05significance levelandbe able to explain to her supervisors why her sub-selection of the data is reasonable.
In addition to your explanation of why your successful subset of the data is potentially reasonable, be sure to thoroughly explain the details of the tests that you perform and show all of your Python computation.
note: I am unable to add csv file but the sample mean is 0.9022, while population mean is 0.9. I believe that is all that's needed from the csv file.
Step by Step Solution
There are 3 Steps involved in it
Step: 1
Get Instant Access to Expert-Tailored Solutions
See step-by-step solutions with expert insights and AI powered tools for academic success
Step: 2
Step: 3
Ace Your Homework with AI
Get the answers you need in no time with our AI-driven, step-by-step assistance
Get Started