Question

1 Approved Answer

Posted on Jul 09, 2024

Please use R Studio for these questions. (Please submit these questions in RMarkdown with screenshots): Problem 1: For this question, we will use the US

Please use R Studio for these questions. (Please submit these questions in RMarkdown with screenshots):

Problem 1:

For this question, we will use the US census dataset from 1994, which is in adult.csv.

a. Show the descriptive and summary statistics for this dataset. Based on those metrics, what can we say about the distribution of age and education-num? Hint: Create a histogram

b. Now create a scatterplot matrix of the numerical variables. Are there any strong correlations between any two variables? If so, what are they?

c. Based on descriptive and summary statistics and box plots for age, education-num and hours-perweek, are there any differences between males and females? Hint: Use the "filter" function from the Tidyverse to create males and females subsets, and create box plot.

Problem 2:

In this question, you will integrate data on different years into one table and use some reshaping to get a visualization. There are two data files: population_even.csv and population_odd.csv. These are population data for even and odd years respectively.

a. Join the two tables together so that you have one table with each state's population for years 2010-2019. If you are unsure about what variable to use as the key for the join, consider what variable the two original tables have in common. (Show a head of the resulting table.)

b. Clean this data up a bit (show a head of the data after):

a. Remove the duplicate state ID column if your process created one. Hint: To remove duplicate column, use the syntax "select (-c(STATE.y, STATE.x)"

b. Reorder the columns to be in year order. Hint: Use the "relocate" function.

c. Rename columns to be just the year number. Hint: Make an array of existing columns and make another array of the renamed columns. Then use the "rename_at" function to change the existing column names to be the renamed column names.

c. Deal with missing values in the data by replacing them with the average of the surrounding years. For example, if you had a missing value for Georgia in 2016, you would replace it with the average of Georgia's 2015 and 2017 numbers. This may require some manual effort. Hint: Use the "mutate" function like the following "mutate(column= ifelse(is.na(column), calculation, column))" where is.na(column) means the column has missing values.

d. We can use some tidyverse aggregation to learn about the population. You can use the original dataset and not the one from (c)

a. Get the maximum population for a single year for each state. Note that because you are using an aggregation function (max) across a row, you will need the rowwise() command in your tidyverse pipe. If you do not, the max value will not be individual to the row.

b. Now get the total population across all years for each state. This should be possible with a very minor change to the code from (a).

e. Finally, get the total US population from the total population across all years for each state that you calculated in the previous step. Keep in mind that this can be done with a single line of code even without the tidyverse, so keep it simple.

Link to the attached files for each question:

adult.csv: https://drive.google.com/file/d/1Iq5BynJfBeFd9P7dODCIiYrVXY4DE-SX/view?usp=sharing, https://drive.google.com/file/d/1b1ZxIStrZhy3l2csVUUHpOeWXChvu25-/view?usp=sharing, https://drive.google.com/file/d/1h4y1Szdy7chHYbaMf80SKwqKvuhFS-Vg/view?usp=sharing

population_even.csv: https://drive.google.com/file/d/1h4y1Szdy7chHYbaMf80SKwqKvuhFS-Vg/view?usp=sharing

population_odd.csv: https://drive.google.com/file/d/1b1ZxIStrZhy3l2csVUUHpOeWXChvu25-/view?usp=sharing