Answered step by step
Verified Expert Solution
Link Copied!

Question

1 Approved Answer

Often when you get a dataset, it is not in the exact format you want / need . So , you have to refine the

Often when you get a dataset, it is not in the exact format you want/need. So, you have to refine the dataset into something more useful this is often called data munging. In this lab exercise, you will read in a dataset and work on that (in a dataframe). Then, you will explore the distribution within the dataset.
An interactive tutorial on basic to advanced R features is also available within R itself. It is called Statistics With Interactive R Learning (SWIRL). More information about installing and using this feature can be found at http://swirlstats.com/students.html
Review the Readings/Resources in the unit. Complete the following steps:
Create a function (named readStates) to read a CSV file into R
You need to read a URL, not a local file to your computer.
The file is a dataset on state populations (within the United States)
The URL is:https://www2.census.gov/programs-surveys/popest/tables/2010-2011/state/totals/nst-est2011-01.csv (Note that you might need to use https:// rather than http://)
Clean the dataframe
Note the issues that need to be fixed (removing columns, removing rows, changing column names).
Within your function, make sure there are 51 rows (one per state + the district of Columbia).
Make sure there are only 5 columns with the columns having the following names (stateName, Census, Estimates, Pop2010, Pop2011).
Make sure the last four columns are numbers (i.e. not strings).
Store and explore the dataset
Store the dataset into a dataframe, called dfStates.
Test your dataframe by calculating the mean for the 2011 data, by doing: mean(dfStates$Pop2011).
Find the state with the highest population
Based on the 2011 data, what is the population of the state with the highest population? What is the name of that state?
Sort the data, in increasing order, based on the 2011 data.
Explore the distribution of the states
Write a function that takes two parameters. The first is a vector and the second is a number.
The function will return the percentage of elements within the vector that is less than the same (i.e. cumulative distribution below the value provided). For example, if the vector had 5 elements (1,2,3,4,5), with 2 being the number passed into the function, the function would return 0.2(since 20% of the numbers were below 2).
Test the function with the vector dfStates$Pop2011, and the mean of dfStates$Pop2011

Step by Step Solution

There are 3 Steps involved in it

Step: 1

blur-text-image

Get Instant Access to Expert-Tailored Solutions

See step-by-step solutions with expert insights and AI powered tools for academic success

Step: 2

blur-text-image_2

Step: 3

blur-text-image_3

Ace Your Homework with AI

Get the answers you need in no time with our AI-driven, step-by-step assistance

Get Started

Recommended Textbook for

More Books

Students also viewed these Databases questions