Answered step by step
Verified Expert Solution
Link Copied!

Question

1 Approved Answer

Using R Programming, answer the below For the data file, look under Data download the movies_merged file, and look at pr1.Rmd (https://piazza.com/gatech/spring2017/cse6242/resources) Answer the following:

Using R Programming, answer the below

For the data file, look under "Data" download the movies_merged file, and look at pr1.Rmd (https://piazza.com/gatech/spring2017/cse6242/resources)

Answer the following:

## 1. Remove non-movie rows

The variable `Type` captures whether the row is a movie, a TV series, or a game. Remove all rows from `df` that do not correspond to movies.

# TODO: Remove all rows from df that do not correspond to movies

## 2. Process `Runtime` column

The variable `Runtime` represents the length of the title as a string. Write R code to convert it to a numeric value (in minutes) and replace `df$Runtime` with the new numeric column.

Now investigate the distribution of `Runtime` values and how it changes over years (variable `Year`, which you can bucket into decades) and in relation to the budget (variable `Budget`). Include any plots that illustrate.

# TODO: Investigate the distribution of Runtime values and how it varies by Year and Budget

## 3. Encode `Genre` column

The column `Genre` represents a list of genres associated with the movie in a string format. Write code to parse each text string into a binary vector with 1s representing the presence of a genre and 0s the absence, and add it to the dataframe as additional columns. Then remove the original `Genre` column.

For example, if there are a total of 3 genres: Drama, Comedy, and Action, a movie that is both Action and Comedy should be represented by a binary vector <0, 1, 1>. Note that you need to first compile a dictionary of all possible genres and then figure out which movie has which genres (you can use the R `tm` package to create the dictionary).

# TODO: Investigate if Gross Revenue is related to Budget, Runtime or Genre

Plot the relative proportions of movies having the top 10 most common genres.

## 4. Eliminate mismatched rows

The dataframe was put together by merging two different sources of data and it is possible that the merging process was inaccurate in some cases (the merge was done based on movie title, but there are cases of different movies with the same title). The first sources release time was represented by the column `Year` (numeric representation of the year) and the second by the column `Released` (string representation of release date).

Find and remove all rows where you suspect a merge error occurred based on a mismatch between these two variables. To make sure subsequent analysis and modeling work well, avoid removing more than 10% of the rows that have a `Gross` value present.

# TODO: Remove rows with Released-Year mismatch

## 5. Explore `Gross` revenue

For the commercial success of a movie, production houses want to maximize Gross revenue. Investigate if Gross revenue is related to Budget, Runtime or Genre in any way.

# TODO: Investigate if Gross Revenue is related to Budget, Runtime or Genre

Note: To get a meaningful relationship, you may have to partition the movies into subsets such as short vs. long duration, or by genre, etc.

Step by Step Solution

There are 3 Steps involved in it

Step: 1

blur-text-image

Get Instant Access to Expert-Tailored Solutions

See step-by-step solutions with expert insights and AI powered tools for academic success

Step: 2

blur-text-image

Step: 3

blur-text-image

Ace Your Homework with AI

Get the answers you need in no time with our AI-driven, step-by-step assistance

Get Started

Recommended Textbook for

Main Memory Database Systems

Authors: Frans Faerber, Alfons Kemper, Per-Åke Alfons

1st Edition

1680833243, 978-1680833249

More Books

Students also viewed these Databases questions

Question

11-3. Why are customer email addresses so useful?

Answered: 1 week ago