Question
Find a data set online that you find is interesting and that has between 5 and 10 variables per subject/item or per row/record, preferably related
Find a data set online that you find is interesting and that has between 5 and 10 variables per subject/item or per row/record, preferably related to current events. You cannot use the data that you have used in another course.
Possible web sites: https://www.kaggle.com/datasets https://archive.ics.uci.edu/ml/datasets.php https://data.fivethirtyeight.com/ https://www.vdh.virginia.gov/data/ https://data.worldbank.org/ https://data.gov/
(a) Plot the histograms of primary (important) continuous variables and probability distributions of categorical variables. (b) Plot the box plots of all the primary variables in the data. Identify and delete a couple of extreme outliers from the data if there are any. (c) Compute the matrix of sample correlations between every pair of variables. (d) Choose two variables that are highly correlated (positively or negatively) and plot their scatter plot and the best regression line. (e) Write a brief description of your data and summarize your findings on the variables.
Please submit your data as a csv file.
Do this in jupyter notebook.
Step by Step Solution
There are 3 Steps involved in it
Step: 1
Get Instant Access to Expert-Tailored Solutions
See step-by-step solutions with expert insights and AI powered tools for academic success
Step: 2
Step: 3
Ace Your Homework with AI
Get the answers you need in no time with our AI-driven, step-by-step assistance
Get Started