Question
Description: Given a time series data which is a clickstream of user activity is stored in any flat flies, ask is to enrich the data
Description: Given a time series data which is a clickstream of user activity is stored in any flat flies, ask is to enrich the data with session id. Session Definition: Session expires after inactivity of 30 mins, because of inactivity no clickstream record will be generated. Session remains active for a total duration of 2 hours Steps: Load Data in any flat file format. Read the data and use spark batch (pyspark/scala) to do the computation. Save the results in parquet with enriched data. Note: Please do not use direct spark-sql.
Given Dataset: timestamp userid 2018-01-01T11:00:00Z u1
2018-01-01T12:00:00Z u1 2018-01-01T11:00:00Z u2 2018-01-02T11:00:00Z u2 2018-01-01T12:15:00Z u1
QUESTION 3 Description: In addition to the problem statement given in question 2 assume below scenario as well and design schema based on it: Get Number of sessions generated in a day. Total time spent by a user in a day Total time spent by a user over a month. Here are the guidelines and instructions for the solution of above queries: Design the table in any flat file format Write the script to create the file Load data into file Write all the queries in spark-sql Think in the direction of using partitioning, bucketing, etc.
Step by Step Solution
There are 3 Steps involved in it
Step: 1
Get Instant Access to Expert-Tailored Solutions
See step-by-step solutions with expert insights and AI powered tools for academic success
Step: 2
Step: 3
Ace Your Homework with AI
Get the answers you need in no time with our AI-driven, step-by-step assistance
Get Started