Answered step by step
Verified Expert Solution
Link Copied!

Question

1 Approved Answer

Please provide solution and explanation to the following python exercise. Thank you! 4. [25 pts] Often, there is data missing in many biomedical datasets (e.g.

Please provide solution and explanation to the following python exercise. Thank you!

image text in transcribed

image text in transcribed

4. [25 pts] Often, there is data missing in many biomedical datasets (e.g. electronic health records, epidemiology, etc.). For time-series data with missing values, one solution is to use the existing data to fill in the missing time point values. In this problem, we will use a time-series dataset to explore different data imputation techniques. a. Upload the data from the zipped dataset file. Plot the original data (line plot) and describe any patterns that are visible in the time series data. Hint: using Python, please convert the 'Month' column into pd.DatetimeIndex b. In this dataset, x-axis will be the time axis (i.e. date) as the variable, and y-axis will be the number of passengers traveling as the observation that is a function of the date variable. Please use Linear Regression to model the relationship between the date (the original 'Month' column) and '\#Passengers'. Plot the original data and the line of best fit in the same figure (Hint: sklearn.linear model). Calculate the R-squared using the sklearn package. What is the equation of the final linear model (including calculated values for model parameters)? Calculate the significance (p-value) of the model variables using the statsmodels package (Hint: statsmodels.org). c. Plot the 12-month rolling variance across time (Hint: Rolling.var). What assumptions are made when calculating the significance of features during ordinary least squares (OLS) linear regression? Does this data fulfill these assumptions? How does this affect the interpretation of these results? d. Generate a copy of the original dataset with 100 randomly missing timepoints values (Hint: set '\#Passengers' values to np.nan without removing the entire row). Plot the new dataset with missing values alongside a plot of the original dataset (two line or scatter plots). Make the missing values obvious from your plot to obtain full credit. e. Use three different interpolation techniques to fill in the missing values (Hint: pandas.DataFrame.interpolate). Use Linear interpolation, Nearest Neighbor interpolation, and Spline (order 3). Plot the original dataset, the missing dataset, and the three interpolated datasets (within a single figure). Compare and contrast the techniques. What features of the data described in 2.a are missing in the interpolated data if there are any. f. Calculate the correlation of imputed datasets with the original dataset (Hint: . Which approach performed the best/worst? For each imputation method, please explain an example situation where they would be the most appropriate (1-2 sentences for each imputation method)

Step by Step Solution

There are 3 Steps involved in it

Step: 1

blur-text-image

Get Instant Access to Expert-Tailored Solutions

See step-by-step solutions with expert insights and AI powered tools for academic success

Step: 2

blur-text-image

Step: 3

blur-text-image

Ace Your Homework with AI

Get the answers you need in no time with our AI-driven, step-by-step assistance

Get Started

Recommended Textbook for

Modern Database Management

Authors: Jeffrey A. Hoffer Fred R. McFadden

9th Edition

B01JXPZ7AK, 9780805360479

More Books

Students also viewed these Databases questions