Answered step by step
Verified Expert Solution
Link Copied!

Question

1 Approved Answer

1 . Start a new R Markdown file ( . Rmd ) as you learned in the previous module. Copy the following YAML header and

1. Start a new R Markdown file (.Rmd) as you learned in the previous module.
Copy the following YAML header and the first code chunk to set up the global environment for knitting. You must submit the HTML report for me to grade your work. I will use your HTML file as a primary document for grading. The YAML header and the first code chunk below allow you to organize your work well knit the Rmd file even when there might be an error.
---
title: "Final Exam with two data sets"
author: "Jae Jung"
date: '`r Sys.time()`'
output:
html_document:
toc: yes
toc_depth: 5
highlight: espresso
theme: journal
---
```{r setup, include=FALSE}
knitr::opts_chunk$set(echo = TRUE,
error = TRUE
)
```
Save the file as appropriately, starting with your name (e.g., "Jung, Jae-Finals.Rmd") in the "Test" folder.
2. Implement the following tasks in the R Markdown file.
Do not forget to copy and paste your questions and create code chunks in your R Markdown file for each question. For example, for this assignment, you must have 10 code chunks (You have 10 questions). Insert your code chunk below each question.
Part 1: Airqualty Data
Questions 1 through 5 will be about the same dataset available in the R program: "airquality."
# Question 1: 1) Get a local copy of the dataset "airquality" and name it "df" so that you can use it later. (2) Next, show the first 7 rows of it. Pay attention to the names of the variables. (3) Write a code that reveals how many variables and observations are in the dataset. (4) Also, write a code that gives you some basic descriptive statistics. You will notice that two variables have missing values.
# Question 2: Write the codes that tell you (1) where the missing values are located, (2) the number of missing values in the dataset (df),(3) the number of missing values in the Solar.R column, and (4) all the rows that include at least one missing value. (5) Lastly, write the code that returns the number of rows that include at least one missing value. Hint: there are rows that have more than one missing value.
# Question 3: (1) Replace all the missing values in the Solar.R column with the median of the values in the column. (2) Also, get the standard deviation and average of all columns.
# Question 4: The goal is to create a new column filled with "low", "average", and "high" based on information from Ozone and Solar.R columns.
(1) Create a new column called "newCol," which is full of NA values.
(2) If both values in the first two columns (i.e., Ozone and Solar.R) of the df dataset in each row are less than the average of the respective columns, put Low in the new column, if they are the same as the averages, put "Average," and if both values are greater than averages, put high in the new column (use the pipe operator).
*Hint*: You will need to replace the missing values on the Ozone column with the mean of the column before creating the new variable.
# Question 5: Rename the column "newCol" to "Air_Rate". Find a pair of variables with the highest and lowest correlation in df, and assign it to the variables highest_cor and lowest_cor.
Part 2: Gapminder Data
Questions 6 through 10 will be about the same dataset available from the "gapminder" package that you may have to install and load up to be able to use.
# Question 6: From the "gapminder" dataset, select the columns, "country", "continent", "year", and "lifeExp" and save the subset of gapminder data as "data." Tidy this data set using the pipe operator such that there is only one country in each row and many years in the columns and life expectancy as a value for year columns. Save this new tidy data as "wide_data" (should contain 13 columns in the end)
Hint: Since the data is not built-in in the R or RStduio, you would need to install the package called "gapminder" and then load it up to the computer's short-term memory.
# Question 7: Choose only the cases for the U.S.(Hint: 12 rows). Next, pipe the data into plotting a line chart with the variable "year" and "lifeExp" on the x-axis and y-axis, respectively. Improve the legibility of the chart. First, label the variables on the chart as "Years" for the x-axis and "Life Expectancy" for the y-axis. Next, provide the chart with the title "Life Expectancy over Years in the United States" and set the line color to red.
# Question 8:
(1) From the data set "gapminder," use the data for the most recent year only. Next, create a chart that shows the life expectancies by continent. Make the plot as beautiful and professional as you can. This includes adding color(s) to the bars and giving appropriate labels for the title, x-axis, and y-axis. What can you tell about the pattern of life expectancies across the continents? Hint: Due to a large number of countries in the data, using countries as x- or y-axis will be problematic in making the charts interpretable.
(2) This

Step by Step Solution

There are 3 Steps involved in it

Step: 1

blur-text-image

Get Instant Access to Expert-Tailored Solutions

See step-by-step solutions with expert insights and AI powered tools for academic success

Step: 2

blur-text-image

Step: 3

blur-text-image

Ace Your Homework with AI

Get the answers you need in no time with our AI-driven, step-by-step assistance

Get Started

Recommended Textbook for

Database Systems Design Implementation And Management

Authors: Carlos Coronel, Steven Morris

14th Edition

978-0357673034

More Books

Students also viewed these Databases questions

Question

1. Always guess when only right answers are scored.

Answered: 1 week ago

Question

Describe a department managers role in the union organizing process

Answered: 1 week ago