Answered step by step

Verified Expert Solution

Link Copied!

Question

1 Approved Answer

Posted on Aug 24, 2024

1 . Start a new R Markdown file ( . Rmd ) as you learned in the previous module. Copy the following YAML header and

1 .

Start a new R Markdown file

(.

Rmd

)

as you learned in the previous module.

Copy the following YAML header and the first code chunk to set up the global environment for knitting. You must submit the HTML report for me to grade your work. I will use your HTML file as a primary document for grading. The YAML header and the first code chunk below allow you to organize your work well knit the Rmd file even when there might be an error.

- - -

title: "Final Exam with two data sets"

author: "Jae Jung"

date:

' `

r Sys

.

time

() `'

output:

html

_

document:

toc: yes

toc

_

depth:

5

highlight: espresso

theme: journal

- - -

` ` ` {

r setup, include

=

FALSE

}

knitr::opts

_

chunk$set

(

echo

=

TRUE,

error

=

TRUE

)

` ` `

Save the file as appropriately, starting with your name

(

.

.,

"Jung, Jae

-

Finals.Rmd

")

in the "Test" folder.

2 .

Implement the following tasks in the R Markdown file.

Do not forget to copy and paste your questions and create code chunks in your R Markdown file for each question. For example, for this assignment, you must have

10

code chunks

(

You have

10

questions

) .

Insert your code chunk below each question.

Part

1

: Airqualty Data

Questions

1

through

5

will be about the same dataset available in the R program: "airquality."

# Question

1

1)

Get a local copy of the dataset "airquality" and name it

"

"

so that you can use it later.

(2)

Next, show the first

7

rows of it

.

Pay attention to the names of the variables.

(3)

Write a code that reveals how many variables and observations are in the dataset.

(4)

Also, write a code that gives you some basic descriptive statistics. You will notice that two variables have missing values.

# Question

2

: Write the codes that tell you

(1)

where the missing values are located,

(2)

the number of missing values in the dataset

(

), (3)

the number of missing values in the Solar.R column, and

(4)

all the rows that include at least one missing value.

(5)

Lastly, write the code that returns the number of rows that include at least one missing value. Hint: there are rows that have more than one missing value.

# Question

3

(1)

Replace all the missing values in the Solar.R column with the median of the values in the column.

(2)

Also, get the standard deviation and average of all columns.

# Question

4

: The goal is to create a new column filled with "low", "average", and "high" based on information from Ozone and Solar.R columns.

(1)

Create a new column called "newCol," which is full of NA values.

(2)

If both values in the first two columns

(

.

.,

Ozone and Solar.R

)

of the df dataset in each row are less than the average of the respective columns, put

Low

in the new column, if they are the same as the averages, put "Average," and if both values are greater than averages, put

high

in the new column

(

use the pipe operator

) .

*

Hint

*

: You will need to replace the missing values on the Ozone column with the mean of the column before creating the new variable.

# Question

5

: Rename the column "newCol" to "Air

_

Rate". Find a pair of variables with the highest and lowest correlation in df

,

and assign it to the variables highest

_

cor and lowest

_

cor.

Part

2

: Gapminder Data

Questions

6

through

10

will be about the same dataset available from the "gapminder" package that you may have to install and load up to be able to use.

# Question

6

: From the "gapminder" dataset, select the columns, "country", "continent", "year", and "lifeExp" and save the subset of gapminder data as "data." Tidy this data set using the pipe operator such that there is only one country in each row and many years in the columns and life expectancy as a value for year columns. Save this new tidy data as "wide

_

data"

(

should contain

13

columns in the end

)

Hint: Since the data is not built

-

in in the R or RStduio, you would need to install the package called "gapminder" and then load it up to the computer's short

-

term memory.

# Question

7

: Choose only the cases for the U

.

. (

Hint:

12

rows

) .

Next, pipe the data into plotting a line chart with the variable "year" and "lifeExp" on the x

-

axis and y

-

axis, respectively. Improve the legibility of the chart. First, label the variables on the chart as "Years" for the x

-

axis and "Life Expectancy" for the y

-

axis. Next, provide the chart with the title "Life Expectancy over Years in the United States" and set the line color to red.

# Question

8

(1)

From the data set "gapminder," use the data for the most recent year only. Next, create a chart that shows the life expectancies by continent. Make the plot as beautiful and professional as you can. This includes adding color

(

)

to the bars and giving appropriate labels for the title, x

-

axis, and y

-

axis. What can you tell about the pattern of life expectancies across the continents? Hint: Due to a large number of countries in the data, using countries as x