Answered step by step
Verified Expert Solution
Link Copied!

Question

1 Approved Answer

R code: ## 4. __Scrape baseball-reference.com with rvest__ You will use the package rvest to scrape data from the website baseball-reference.com. Begin at the teams

R code:

## 4. __Scrape baseball-reference.com with rvest__

You will use the package rvest to scrape data from the website baseball-reference.com.

Begin at the teams page .

For each active team (30), visit each team's page and download the "Franchise History" table. The node you will want to use is "#franchise_years". Combine all the tables in one. Note that some franchises have names and locations. To keep track of the team, add a column to the dataframe called "current" which will contain the current name of the team. (e.g. In the 'current' column, the row for 1965 Milwaukee Braves will contain the value 'Atlanta Braves')

__Hint:__ When I ran my code, my table had 2624 rows and 22 columns.

__Hint:__ _I used the function `html_table()` to extract the table from each team's page._

library(rvest) # starting page teampage <- read_html("http://www.baseball-reference.com/teams/")

# write your r code here # create a table called baseball that contains all of the teams' franchise histories

# at the end, be sure to print out the dimensions of your baseball table. dim(baseball) head(baseball)

```{r baseball_cleanup, error = TRUE} # you should not need to modify this code, but you will probably need to run it. library(stringr) # This code checks to see if text in table has regular space character # Because the text from the web uses a non-breaking space, we expect there to be a mismatch # I'm converting to raw because when displayed on screen, we cannot see the difference between # a regular breaking space and a non-breaking space. all.equal(charToRaw(baseball$Tm[1]), charToRaw("Arizona Diamondbacks"))

# identify which columns are character columns char_cols <- which(lapply(baseball, typeof) == "character")

# for each character column, convert to UTF-8 # then replace the non-breaking space with a regular space for(i in char_cols){ baseball[[i]] <- str_conv(baseball[[i]], "UTF-8") baseball[[i]] <- str_replace_all(baseball[[i]],"\\s"," ") # baseball[[i]] <- str_replace_all(baseball[[i]],"[:space:]"," ") # you might have to use this depending on your operating system and which meta characters it recognizes }

# check to see if the conversion worked ## should now be TRUE all.equal(charToRaw(baseball$Tm[1]), charToRaw("Arizona Diamondbacks"))

Step by Step Solution

There are 3 Steps involved in it

Step: 1

blur-text-image

Get Instant Access to Expert-Tailored Solutions

See step-by-step solutions with expert insights and AI powered tools for academic success

Step: 2

blur-text-image

Step: 3

blur-text-image

Ace Your Homework with AI

Get the answers you need in no time with our AI-driven, step-by-step assistance

Get Started

Recommended Textbook for

Temporal Databases Research And Practice Lncs 1399

Authors: Opher Etzion ,Sushil Jajodia ,Suryanarayana Sripada

1st Edition

3540645195, 978-3540645191

More Books

Students also viewed these Databases questions