Question
Final Case Analysis DataMining and Statistical Modeling Introduction Environmental Protection Agency(orEPAfor short)is responsible for regulating the amount of pollutant emission from all automobiles that run
Final Case Analysis
DataMining and Statistical Modeling
Introduction
Environmental Protection Agency(orEPAfor short)is responsible for regulating the amount of pollutant emission from all automobiles that run on American roads.You are asked to analyze the data released by EPA for more than a decade, specifically for three time periods: 2010 - 12, 2014-16, and 2018 - 20. There are several objectives to this case analysis, one of which is to test and learn about the possible changes in theamountof pollutions emitted by vehicles overtime. You are also asked to analyze similarities between vehicles over the three time periods and empirically determine if certain vehicles became more (or less) polluting over the period of study.
You willanalyzevarious aspects ofvehicle induced pollutionusing R programing. Youare expected tosubmit findings in a report format. The report must be at least20pages long with written description and explanation of your findings to the questions asked below.Make sure to run all code using R Markdown with your remarks, comments or explanations embedded within the document.
Data Details
You are givennineyearsof individual EPA datain csv format.The data files are not very large (each fileisapprox. 1 MB). Each yearly file contains thousands of vehicles along with their vital information and pollution testing records. Each file contains 42 columns, the details of which are given in the Data Dictionary document. Please note that the original data had more columns, and some of them were removed for the consistency purposes. The deleted columns also exist in the datadictionaryand you are asked to ignore them while referring to the dictionary.
There are three sections to this casestudy: Merging and cleaning (20points), Data Analysis (60 points), Visualization (20points) totaling 100 points.
Important Note:Make sure to keep the three time periods separate for the following analysis,i.e.perform separate analysis for each of the time period separately.
Merging and Cleaning(20 Points)
The first objective is to combine those files and stack them asthree large files, one for each time period.Runbasic EDAand descriptive statistics on some columns and clean any obvious outliersfrom each time period. Make sure that no more than 1% of the data are removedfrom within each time periodin this process. Clearly write the details of outlier detection and descriptive analysis.
Analysis(60 Points)
This section isfurtherbroken down into two parts:
Part A: (30points)
There are several numericcolumnslistedin the datasets.Use the tools of dimension reduction learnt during the course and condense the number of columns to smaller dimension for each time period separately.
Use the reduced dimensions to perform "grouping" of similar vehicles. Keep the number of groups between 5 and 8 for each time period. Clearly define groups based on their characteristics by running descriptive analytics on each group. Now compare the groups for the three time periods and point out any vehicles that jumped from one group to the other over time. Also explain what that jump means
Part B:(30 Points)
This part is about predictive modeling where you are asked to try several modeling techniques separately for each time period. You will then compare the results from the best models for each time period.
The response variable for this problem is mileage per gallon (columns name:RND_ADJ_FE). You will create the best predictive model predicting the mileage per gallon for each time period. You will then compare those models for the predictors and accuracy (R2, MSE etc.) and describe the results
Visualization(20 Points)
This is not a separate section of the analysis, but you are required to create several visual depictions of the analysis in both descriptive statistics and modeling parts of the report. Your grade will depend upon the uniqueness and description of the visuals.
Step by Step Solution
There are 3 Steps involved in it
Step: 1
Get Instant Access to Expert-Tailored Solutions
See step-by-step solutions with expert insights and AI powered tools for academic success
Step: 2
Step: 3
Ace Your Homework with AI
Get the answers you need in no time with our AI-driven, step-by-step assistance
Get Started