Answered step by step
Verified Expert Solution
Link Copied!

Question

1 Approved Answer

Learn to use Apache Spark with Python Q1c to Q3d Load the datasets: = gameclicks spark.read.csv (game-clicks.csv.gz, header='true') adclicks = spark.read.csv (ad-clicks.csv.gz, header='true') + Code

Learn to use Apache Spark with Python

Q1c to Q3d

image text in transcribedimage text in transcribedimage text in transcribedimage text in transcribedimage text in transcribedimage text in transcribed

Load the datasets: = gameclicks spark.read.csv ("game-clicks.csv.gz", header='true') adclicks = spark.read.csv ("ad-clicks.csv.gz", header='true') + Code + Text Check the schemas: [] gameclicks.printSchema ( ) -- root timestamp: string (nullable = true) |-- clickId: string (nullable = true) -- userId: string (nullable = true) |-- users ionId: string (nullable = true) |-- ishit: string (nullable = true) |-- teamId: string (nullable = true) |-- teamLevel: string (nullable = true) [ ] adclicks.printSchema () root |-- timestamp: string (nullable = true) |-- txId: string (nullable = true) |-- userSessionId: string (nullable = true) -- teamId: string (nullable = true) | -- userId: string (nullable = true) -- adid: string (nullable = true) |-- adCategory: string (nullable = true) The team_counts Dataframe is small enough to be moved to Pandas: [ ] team_counts_pd = team_counts.toPandas ( ) team_counts_pd.head() teamID UserCount 0 22 33 1 53 26 2 64 25 3 18 25 4 35 23 Plot a histogram using the number of users by team: [1] # your code goes here (Q1c: 4 points) df.select(num_cols).describe().show() Question 3: Use map and reduce with RDD Keywords: RDD map reduce Now we want to find the number of clicks of each adCategorory using map and reduce with RDD. Since we only care the adCategory, let's select this column only. U categoryOnly - adclicks.select("adCategory") categoryOnly.show(10) | adCategory electronics movies computers fashion clothing games movies electronics electronics | computers only showing top 10 rows Now use explicit Map/Reduce format with RDDs. First we emit a pair in the format (adCategory, 1) by using a m Now use explicit Map/Reduce format with RDDs. First we emit a pair in the format (adCategory, 1) by using a map() function: [ ] all_categories = # your code goes here (03a: 3 points) all_categories.take (5) [('electronics', 1), ('movies', 1), ('computers', 1), ('fashion', 1), ('clothing', 1)] Then, we sum counts by using a reduce () function, and then we sort by count: [ ] category_counts_rdd = # your code goes here (03b: 4 points) category_counts_rdd.take(3) [('computers', 2638), ('games', 2601), ('clothing', 2340)] Now we want to convert the category_counts_rdd to a dataframe category_counts by mappying the corresponding columns. [ ] category_counts = # your code goes here (03c: 4 points) category_counts.show() +-- + [ ] CategoriyCount| adCategory|| +- 2638 computers 2601 games 2340 clothing 2074 sports 1727 fashion 1692 movies| 1588 hardware 1097 | electronics 566| automotive -+ | +- Now use SparkSQL to verify the above result: [ ] #print(adclicks.count()) adclicks.registerTemptable("adclicks") query = # your code goes here (Q3d: 1 point) category_counts = spark.sql (query) category_counts.show() Load the datasets: = gameclicks spark.read.csv ("game-clicks.csv.gz", header='true') adclicks = spark.read.csv ("ad-clicks.csv.gz", header='true') + Code + Text Check the schemas: [] gameclicks.printSchema ( ) -- root timestamp: string (nullable = true) |-- clickId: string (nullable = true) -- userId: string (nullable = true) |-- users ionId: string (nullable = true) |-- ishit: string (nullable = true) |-- teamId: string (nullable = true) |-- teamLevel: string (nullable = true) [ ] adclicks.printSchema () root |-- timestamp: string (nullable = true) |-- txId: string (nullable = true) |-- userSessionId: string (nullable = true) -- teamId: string (nullable = true) | -- userId: string (nullable = true) -- adid: string (nullable = true) |-- adCategory: string (nullable = true) The team_counts Dataframe is small enough to be moved to Pandas: [ ] team_counts_pd = team_counts.toPandas ( ) team_counts_pd.head() teamID UserCount 0 22 33 1 53 26 2 64 25 3 18 25 4 35 23 Plot a histogram using the number of users by team: [1] # your code goes here (Q1c: 4 points) df.select(num_cols).describe().show() Question 3: Use map and reduce with RDD Keywords: RDD map reduce Now we want to find the number of clicks of each adCategorory using map and reduce with RDD. Since we only care the adCategory, let's select this column only. U categoryOnly - adclicks.select("adCategory") categoryOnly.show(10) | adCategory electronics movies computers fashion clothing games movies electronics electronics | computers only showing top 10 rows Now use explicit Map/Reduce format with RDDs. First we emit a pair in the format (adCategory, 1) by using a m Now use explicit Map/Reduce format with RDDs. First we emit a pair in the format (adCategory, 1) by using a map() function: [ ] all_categories = # your code goes here (03a: 3 points) all_categories.take (5) [('electronics', 1), ('movies', 1), ('computers', 1), ('fashion', 1), ('clothing', 1)] Then, we sum counts by using a reduce () function, and then we sort by count: [ ] category_counts_rdd = # your code goes here (03b: 4 points) category_counts_rdd.take(3) [('computers', 2638), ('games', 2601), ('clothing', 2340)] Now we want to convert the category_counts_rdd to a dataframe category_counts by mappying the corresponding columns. [ ] category_counts = # your code goes here (03c: 4 points) category_counts.show() +-- + [ ] CategoriyCount| adCategory|| +- 2638 computers 2601 games 2340 clothing 2074 sports 1727 fashion 1692 movies| 1588 hardware 1097 | electronics 566| automotive -+ | +- Now use SparkSQL to verify the above result: [ ] #print(adclicks.count()) adclicks.registerTemptable("adclicks") query = # your code goes here (Q3d: 1 point) category_counts = spark.sql (query) category_counts.show()

Step by Step Solution

There are 3 Steps involved in it

Step: 1

blur-text-image

Get Instant Access to Expert-Tailored Solutions

See step-by-step solutions with expert insights and AI powered tools for academic success

Step: 2

blur-text-image

Step: 3

blur-text-image

Ace Your Homework with AI

Get the answers you need in no time with our AI-driven, step-by-step assistance

Get Started

Recommended Textbook for

DATABASE Administrator Make A Difference

Authors: Mohciine Elmourabit

1st Edition

B0CGM7XG75, 978-1722657802

More Books

Students also viewed these Databases questions

Question

=+How should it be delivered?

Answered: 1 week ago

Question

=+4 How does one acquire a global mindset?

Answered: 1 week ago