Answered step by step

Verified Expert Solution

Link Copied!

Question

1 Approved Answer

Posted on Jul 05, 2024

Scraping tables Tables are pretty common in web pages as data sources. We begin by extracting a simple HTML table from a website. For the

Scraping tables

Tables are pretty common in web pages as data sources. We begin by extracting a simple HTML table from a website. For the first example, we will use http: //www.imdb.com/chart, which is a box-office summary of the top 10 movies along with their gross profits for the current weekend, and their total gross profit. We would like to make a data frame with that information. Of course we can use the method introduced in the previous section to read the page into R, 3 and use the anchor to find the part of the data we want. By doing this, we will have to do extensive data cleaning to extract the information before wrapping it into a data frame. Luckily, R has nice packages that can help to scrape tables from web pages. To read an HTML table we need to use the R package rvest, which is a convenient wrapper to parse an HTML page and retrieve the table elements.

>library(rvest)

> movie_url = "https://www.imdb.com/chart/boxoffice/"

> movie_table = read_html("https://www.imdb.com/chart/boxoffice/")

> length(html_nodes(movie_table, "table"))

> html_table(html_nodes(movie_table, "table")[[1]])

> html_table(html_nodes(movie_table, "table")[[2]])

Often more tables are picked up than we intend. In this case there is some Amazon affiliate data at the bottom of the page that gets picked up as a table. You'll need to look at the content to see what's in the table and also to obtain the correct table. As another example, let's read the Canadian National Parks HTML table from Wikipedia (https://en.wikipedia.org/wiki/List_ of_National_Parks_of_Canada.

> park_url = "https://en.wikipedia.org/wiki/List_of_National_Parks_of_Canada"

> parks = read_html(park_url)

> length(html_nodes(parks, "table"))

> html_table(html_nodes(parks, "table")[[1]])

> park_table = html_table(html_nodes(parks, "table")[[1]])

Question

1. Scrape the Box office performance and the Critical and public response tables for all three phases of the Marvel Cinematic Universe films https://en.wikipedia.org/wiki/List_of_Marvel_Cinematic_ Universe_films. Obtain those tables as a single R data frame. Hint: The R function merge will allow you to make a single data frame by specifying your two tables using SQL behaviour (left join). Try something like this:

> merge(x, y, by.x = NameOfMovieNameColumnFromX,

+ by.y=NameOfMovieNameColumnFromY)

You'll need to clean up a couple of rows of data before using such a command (5pts).

2. Make a clean table of the movie name, Worldwide Box office gross (make sure it is numeric), budget (use net budget and make it numeric), release year (no need for month or day) and the (numeric values) Rotten Tomatoes and Metacritic Scores. Print out the first 10 rows of your data frame. (5pts)

3. Let's look at the moving averages of Box Office Gross Worldwide and Budget over time. Put these two lines on a single plot. For clarity report dollar amounts as log10 dollars. (5pts)

4. What is the distribution of revenue for Marvel movies? Make a plot of the log base 10 difference between Box Office Gross Worldwide and budget for these movies. (3pts)

5. What is the relationship between budget and ratings? On a single plot, show the log base 10 budget and the log base 10 Box office gross vs Rotten Tomatoes score. Include a moving average for ratings with respect to budget (not time). (6pts)

6. How have the ratings evolved over time? Make a plot of ratings over time. (3pts)

Step by Step Solution

There are 3 Steps involved in it

Step: 1

Get Instant Access to Expert-Tailored Solutions

See step-by-step solutions with expert insights and AI powered tools for academic success

Step: 2

Step: 3

Ace Your Homework with AI

Get the answers you need in no time with our AI-driven, step-by-step assistance

Get Started

Recommended Textbook for

Introduction To Probability And Statistics

Authors: William Mendenhall, Robert Beaver, Barbara Beaver

14th Edition

1133103758, 978-1133103752

Students also viewed these Mathematics questions

Question

★★★★★

For each of the following activities, identify the inventory account (Materials Inventory, Work in Process Inventory, or Finished Goods Inventory), if any, that is affected. If an inventory account...

Answered: 1 week ago

Question

★★★★★

Consider a train station to which customers arrive in accordance with a Poisson process having rate . A train is summoned whenever there are N customers waiting in the station, but it takes K units...

Answered: 1 week ago

Question

★★★★★

Dfinity promises its customers that network services are 90% reliability. The network consists of three serial nodes, each of which must work for the network to be operational. Consistent with its...

Answered: 1 week ago

Question

★★★★★

17. Perla Kennel uses tenant-days as its measure of activity; an animal housed in the kennel for one day is counted as one tenant-day. During March, the kennel budgeted for 2,240 tenant-days, but its...

Answered: 1 week ago

Question

★★★★★

As you enter a career in SSEM or critically analyze your current career in SSEM, it is important to establish a strategic plan to administer your responsibilities in an organization. Outline the key...

Answered: 1 week ago

Question

★★★★★

Northwood Company manufactures basketballs. The company has a ball that sells for $25. At present, the ball is manufactured in a small plant that relies heavily on direct labor workers. Thus,...

Answered: 1 week ago

Question

★★★★★

A mural at the zoo is made up of 14 equal sections. An artist wants to paint four different animals using 11 of the 14 sections. Complete each fraction below to show one way that the artist can paint...

Answered: 1 week ago

Question

★★★★★

Suppose Neil Armstrong decided to throw a golf ball into the air while he was standing on the moon and that the height of the golf ball was modeled by the equation below, where s is measured in feet...

Answered: 1 week ago

Question

★★★★★

On January 1, 2025, Cullumber issued $740,000 of 9% serial bonds at par. Semiannual interest is payable on January 1 and July 1 and principal of $74,000 matures each January 1 starting in 2026. The...

Answered: 1 week ago

Question

★★★★★

A 10.0 g bullet moves at a constant speed of 500.0 m/s and collides with a 1.50 kg wooden block initially at rest. The surface of the table is frictionless and 70.0 cm above the floor level. After...

Answered: 1 week ago

Question

★★★★★

After executing of the following assembly program, what is the final value for Output in hexadecimal form? ORG 100 Load indig Store X Loop, Load X Subt Hundred SkipCond 000 Jump Endloop Load X Add...

Answered: 1 week ago

Previous Question Next Question