Answered step by step

Verified Expert Solution

Link Copied!

Question

1 Approved Answer

Posted on Aug 27, 2024

Consider 3 files warc.csv , wat.csv and we . csv without headers. Consider a warc.csv file related data. An indicative line is: 2 0 1

Consider $3$ files warc.csv $,$ wat.csv and we $.$ csv without headers.

Consider a warc.csv file related data. An indicative line is:

$2017 - 03 - 22 T 22$ : $13$ : $57 Z,$ urn, response, $40802, 213.155.18.48$

$, h t t p$ :coco, Apache, $h t m l$

Columns in order: first the warc date, the warc record id $,$ the warc type $($ e $.$ g $.$ metadata,

response, etc $),$ the content length, the public IP address, the target URL, the server running

the site $($ eg apache, nginx, etc $),$ and finally the overall content of the page with the entire

HTML DOM.

Consider a wat.csv file related data. An indicative line is:

urn:uuid, $1053,$ http:coco

In order the columns are: first the warc record id $,$ the content length of the metadata, and

finally the target URL $($ it can be different from the target URL of the warc data $) .$

Consider a wet.csv file related data. An indicative line is:

urn:uuid, "extracted plaintext"

In order the columns are: first the warc record id and then the extracted plaintext from the

url $($ can be in ascii $) .$

Using RDDs write a Pyhton code to answer the following.

Task $1$ :

Find the most popular target URL $($ eg $,$ the record target URL that can be found in the HTML

DOM of another record.

Tips: You will need to join datasets to get the desired result. For this query you will need to

filter out the records that have null values. You should first find for each warc record what

its target URL is and what URLs are in the HTML DOM, so you get an intermediate result:

targetURL $- >$ list $($ urls in html dom $) .$ For the URLs you could simplify them and keep a

simpler format $/$ subdomain to get even more results.

Remember to restart the Spark cluster before each measurement, to avoid hot caches, or

you can clear the cache.

Task $2$ :

Perform Task $1$ using DataFrames $/$ Spark SQL and parquet file

Step by Step Solution

There are 3 Steps involved in it

Step: 1

Get Instant Access to Expert-Tailored Solutions

See step-by-step solutions with expert insights and AI powered tools for academic success

Step: 2

Step: 3

Ace Your Homework with AI

Get the answers you need in no time with our AI-driven, step-by-step assistance

Get Started

Recommended Textbook for

OCA Oracle Database SQL Exam Guide Exam 1Z0-071

Authors: Steve O'Hearn

1st Edition

1259585492, 978-1259585494

More Books

Students also viewed these Databases questions

Question

★★★★★

When should a business pursue a divest market strategy, and how could that strategy affect short-run profit performance?

Answered: 1 week ago

Question

★★★★★

11.44 Refer to Exercise 11.43. The means of two of the factor-level combinations-say, AB and A2B are =8.3 and x2 = 6.3, respectively. Find a 95% confidence interval for the difference between the two...

Answered: 1 week ago

Question

★★★★★

The Beef-up ranch feeds cattle for midwestern farmers and delivers them to processing plants in Topeka, Kansas and Tulsa, Oklahoma. The ranch must determine the amounts of cattle feed to buy so that...

Answered: 1 week ago

Question

★★★★★

Required: The MoMi Corporation's cash flow from operations before interest and taxes was $3.2 million in the year just ended, and it expects that this will grow by 5% per year forever. To make this...

Answered: 1 week ago

Question

★★★★★

a firm produces two products. Fixed manufacturing cost is applied at a rate $1.00 per machine hour. Per unit XY-7 BD-4 Selling price $4.00 $3.00 Variable manufacturing cost 2.00 1.50 Fixed...

Answered: 1 week ago

Question

★★★★★

QUESTION 9 A telecom incumbent owns the upstream network and is currently a monopoly retail supplier. The incumbet has fixed cost of F=1.4 and marginal cost of co=0.3 per call for its upstream...

Answered: 1 week ago

Question

★★★★★

1. an example of a flowchart of the Alaska airlines about the process of a passenger's bag follows from kiosk to destination carousel. Include the exception process for the TSA opening of selected...

Answered: 1 week ago

Question

★★★★★

Q3 Question #3 The manager of the Midtown hotel is in the process of redesigning the company's reservation system. The hotel has 300 rooms divided into 200 regular double-bed rooms and 100 luxury...

Answered: 1 week ago

Question

★★★★★

Not: There is no scenario for this assignment. Question: Sports events are an important social phenomenon. Identify the relationships between, and the roles played by, the various participants of a...

Answered: 1 week ago

Question

★★★★★

Which of the following concepts is not part of Java's object-oriented programming principles? Option A: Encapsulation Option B: Inheritance Option C: Polymorphism Option D: Multiple inheritance using...

Answered: 1 week ago

Question

★★★★★

=+4. Describe cost per click, cost per conversion, cost per engagement, and cost per action. In what ways could these be integrated into a social media campaign?

Answered: 1 week ago

Question

★★★★★

=+5. You have been asked to create an influencer guide to target micro-influencers for a local restaurant. What are some reasons you would suggest the restaurant implement this type of program...

Answered: 1 week ago

Question

★★★★★

=+4. Outline the different media in the PESO model. What are some examples of content that could be created for each?

Answered: 1 week ago

Previous Question Next Question