Question
Scraping tables Tables are pretty common in web pages as data sources. We begin by extracting a simple HTML table from a website. For the
Scraping tables
Tables are pretty common in web pages as data sources. We begin by extracting a simple HTML table from a website. For the first example, we will use http: //www.imdb.com/chart, which is a box-office summary of the top 10 movies along with their gross profits for the current weekend, and their total gross profit. We would like to make a data frame with that information. Of course we can use the method introduced in the previous section to read the page into R, 3 and use the anchor to find the part of the data we want. By doing this, we will have to do extensive data cleaning to extract the information before wrapping it into a data frame. Luckily, R has nice packages that can help to scrape tables from web pages. To read an HTML table we need to use the R package rvest, which is a convenient wrapper to parse an HTML page and retrieve the table elements.
>library(rvest)
> movie_url = "https://www.imdb.com/chart/boxoffice/"
> movie_table = read_html("https://www.imdb.com/chart/boxoffice/")
> length(html_nodes(movie_table, "table"))
> html_table(html_nodes(movie_table, "table")[[1]])
> html_table(html_nodes(movie_table, "table")[[2]])
Often more tables are picked up than we intend. In this case there is some Amazon affiliate data at the bottom of the page that gets picked up as a table. You'll need to look at the content to see what's in the table and also to obtain the correct table. As another example, let's read the Canadian National Parks HTML table from Wikipedia (https://en.wikipedia.org/wiki/List_ of_National_Parks_of_Canada.
> park_url = "https://en.wikipedia.org/wiki/List_of_National_Parks_of_Canada"
> parks = read_html(park_url)
> length(html_nodes(parks, "table"))
> html_table(html_nodes(parks, "table")[[1]])
> park_table = html_table(html_nodes(parks, "table")[[1]])
Question
1. Scrape the Box office performance and the Critical and public response tables for all three phases of the Marvel Cinematic Universe films https://en.wikipedia.org/wiki/List_of_Marvel_Cinematic_ Universe_films. Obtain those tables as a single R data frame. Hint: The R function merge will allow you to make a single data frame by specifying your two tables using SQL behaviour (left join). Try something like this:
> merge(x, y, by.x = NameOfMovieNameColumnFromX,
+ by.y=NameOfMovieNameColumnFromY)
You'll need to clean up a couple of rows of data before using such a command (5pts).
2. Make a clean table of the movie name, Worldwide Box office gross (make sure it is numeric), budget (use net budget and make it numeric), release year (no need for month or day) and the (numeric values) Rotten Tomatoes and Metacritic Scores. Print out the first 10 rows of your data frame. (5pts)
3. Let's look at the moving averages of Box Office Gross Worldwide and Budget over time. Put these two lines on a single plot. For clarity report dollar amounts as log10 dollars. (5pts)
4. What is the distribution of revenue for Marvel movies? Make a plot of the log base 10 difference between Box Office Gross Worldwide and budget for these movies. (3pts)
5. What is the relationship between budget and ratings? On a single plot, show the log base 10 budget and the log base 10 Box office gross vs Rotten Tomatoes score. Include a moving average for ratings with respect to budget (not time). (6pts)
6. How have the ratings evolved over time? Make a plot of ratings over time. (3pts)
Step by Step Solution
There are 3 Steps involved in it
Step: 1
Get Instant Access to Expert-Tailored Solutions
See step-by-step solutions with expert insights and AI powered tools for academic success
Step: 2
Step: 3
Ace Your Homework with AI
Get the answers you need in no time with our AI-driven, step-by-step assistance
Get Started