Question
School of Computing and Information Technology Session: Spring 2023 University of Wollongong Lecturer: Janusz R. Getta ISIT912 Big Data Management Assignment 2 Published on 21
School of Computing and Information Technology Session: Spring 2023 University of Wollongong Lecturer: Janusz R. Getta ISIT912 Big Data Management Assignment 2 Published on 21 August 2023 Scope The tasks included in Assignment 2 include conceptual modelling of a data warehouse, implementation of \"star schema\" as a collection of external tables in HQL, querying data cube in SQL and discovering a conceptual data warehouse schema from an operational database. This assignment is due on Saturday, 23 September 2023, 7:00pm (sharp). This assignment is worth 20% of the total evaluation in the subject. The assignment consists of 4 tasks and specification of each task starts from a new page. Only electronic submission through Moodle at: https://moodle.uowplatform.edu.au/login/index.php will be accepted. A submission procedure is explained at the end of Assignment 1 specification. A policy regarding late submissions is included in the subject outline. Only one submission of Assignment 2 is allowed and only one submission per student is accepted. A submission marked by Moodle as \"late\" is always treated as a late submission no matter how many seconds it is late. A submission that contains an incorrect file attached is treated as a correct submission with all consequences coming from the evaluation of the file attached. All files left on Moodle in a state \"Draft(not submitted)\" will not be evaluated. A submission of compressed files (zipped, gzipped, rared, tared, 7-zipped, lhzed, etc) is not allowed. The compressed files will not be evaluated. The implementations that do not compile well due to one or more syntactical and/or run time errors score no marks. The second assignment is an individual assignment and it is expected that all its tasks will be solved individually without any cooperation with the other students. However, it is allowed to declare in the submission comments that a particular component or task of this assignment has been implemented in cooperation with another student. In such a case evaluation of a task or component may be shared with another student. In all other cases plagiarism will result in a FAIL grade being recorded for entire assignment. If you have any doubts, questions, etc. please consult your lecturer or tutor during laboratory/tutorial classes or over e-mail. Task 1 (5 marks) Conceptual modelling of a data warehouse An objective of this task is to create a conceptual schema of a sample data warehouse domain described below. Read and analyse the following specification of a data warehouse domain. A university would like to create a data warehouse to store information about the submissions of student assignments and later on to analyse the contents of a data warehouse. It is expected that the planned data warehouse will contain historical information collected over a long period of time. A data warehouse supposed to contain information about the students, courses enrolled by students, degrees, specifications of assignments, lecturers who prepared assignments, submitted assignments, and assessments of submitted assignments. The students are described by a student number, first name, last name, and email address. A student number and email address separately identify each student. A university offers the courses. A course is described by a unique code, unique title, credits, session, and year when offered. A course is offered one time per year in either Autumn or Spring session. A specification of assignment is described by an assignment number, maximum evaluation, release date, and deadline for submission. A specification of assignment is prepared by a lecturer. A lecturer is described by a unique staff number, first name, last name, building and office number. The lecturers belong to the departments. A department is described by a unique name and building number where it is located. The courses enrolled by the students belong to the degrees. A degree is described by a unique name and total number of credit points required for its completion The students submit the assignments prepared by the lecturers and related to the courses enrolled by the students. A submission is described by a submission date and time, evaluation done by an assessor, amount of time spent on its implementation declared by a student, and complexity level declared by a lecturer. A data warehouse must be designed such it should be possible to easily implement of the following classes of applications. (1) It should be possible to find the total number of submitted assignments per year, per session, per month and per day, per student, per course, per lecturer who prepared the assignments, etc. For example, it should be to find the total number of submitted assignment in each month of 2023 in each course. (2) It should be possible to find an average evaluation of assignment per student, per course, per year, per session, per month, per department, and per degree, etc. For example, it should be possible to find an average mark from assignment 1 in a given subject in the last 5 years. (3) It should be possible to find the largest evaluation of assignment per course, per year, per students, per course, etc. For example, it should be possible to find the largest evaluations of each assignment in a given subject in the last 5 sessions. (4) It should be possible to find the total amount of time spent on implementation of assignments per student, per course, per session, per year, etc. For example, it should be possible to find a total amount of time spent on implementation of each assignment in a given subject over a period of last 5 sessions. To create a conceptual schema of a sample data warehouse domain, follow the steps listed below. Step 1 Find a fact entity, Step 2 Find the measures describing a fact entity. Step 3 Find the dimensions. Step 4 Find the hierarchies over the dimensions. Step 5 Find the descriptions (attributes) of all entity types. Step 5 Draw a conceptual schema. To draw a conceptual schema, use a graphical notation explained to you in a presentation 10 Conceptual Data Warehouse Design. To draw your diagram, you must use UMLetlet 14.3 diagram drawing tool and apply ISIT312 Palette. The palette is available from the right upper corner of the main menu of UMLet 14.3. Deliverables A file solution1.bmp with a drawing of a conceptual schema of a sample data warehouse domain. Task 2 (5 marks) Implementation of a data warehouse as a collection of external tables in Hive Consider the following conceptual schema of a four-dimensional data cube. Perform a logical design to transform a conceptual schema given above into a logical schema (\"star schema\"). Use UMLet 14.3 diagram drawing tool and apply Logical modelling Palette to draw a logical schema. The palette is available from the right upper corner of the main menu of UMLet 14.3. Save a diagram of logical schema in a file solution2.bmp. Download a file task2.zip and unzip it. You should obtain a folder TASK2 with the following files: customer.tbl, order_details.tbl, order.tbl, product.tbl, salesperson.tbl. Use editor to examine the contents of *.tbl files. Note, that each file has a header with information about the meanings of data in each column. A header is not a data component of each file. Before transferring the files to HDFS you may have to perform some ETL actions on the files, like for example remove a header from each file. Transfer the files into HDFS. Implement HQL script solution2.hql that creates the external tables obtained from a step of logical design performed earlier. The external tables must overlap on the files transferred to HDFS in the previous step. Note, that a header in each *.tbl file must be removed before creating the external tables. Include into solution2.hql script SELECT statements that lists any 3 rows from each one of the external tables implemented in the previous step and the total number of rows included in each table. When ready, use a command line interface beeline to process a script solution2.hql and to save a report from processing in a file solution2.txt. To save a report from processing of a file solution2.hql copy the contents of Terminal window into a text file solution2.txt. Processing of your script must return NO ERRORS! A solution with errors is worth no marks ! Deliverables A file solution2.bmp with a drawing of a logical schema and a file solution2.txt with a report from processing of HQL script solution2.hql. The report MUST NOT include any errors, and the report must list all SQL statements processed. Task 3 (5 marks) Querying a data cube Download a file task3.zip and unzip the file. You should obtain a folder TASK3 with the following files: dbcreate.hql, dbdrop.hql, partsupp.tbl, lineitem.tbl, and orders.tbl. A file orders.tbl contains information about the orders submitted by the customers. A file lineitem.tbl contains information about the items included in the orders. A file partsupp.tbl contains information about the items and suppliers of items included in the orders. Open Terminal window and use cd command to navigate to a folder with the just unzipped files. Start Hive Server 2 in the terminal window (remember to start Hadoop first). When ready process a script file dbcreate.hql to create the internal relational tables and to load data into the tables. You can use either beeline or SQL Developer. A script dbdrop.hql can be used to drop the tables. The relational tables PARTSUPP, LINEITEM, ORDERS implement a simple twodimensional data cube. The relational tables PARTSUPP and ORDERS implement the dimensions of parts supplied by suppliers and orders. A relational table LINEITEM implements a fact entity of a data cube. (1) Implement the following query using GROUP BY clause with ROLLUP operator. For the parts with the keys (L_PARTKEY) 7, 8,9 find the largest discount applied (L_DISCOUNT) per part key (L_PARTKEY) and per part key and supplier key (L_PARTKEY, L_SUPPKEY) and the largest discount applied at all. (2) Implement the following query using GROUP BY clause with GROUPING SETS operator. Find the smallest price (L_EXTENDEDPRICE) per order year (O_ORDERDATE), and order clerk (O_CLERK). Implement the following SQL queries as SELECT statements using window partitioning technique. (3) For each part list its key (PS_PARTKEY), all its available quantities (PS_AVAILQTY), the smallest available quantity, and the average available quantity. Consider only the parts with the keys 5 and 15. (4) For each part list its key (PS_PARTKEY) and all its available quantities (PS_AVAILQTY) sorted in descending order and a rank (position number in an ascending order) of each quantity. Consider only the parts with the keys 10 and 20. Use an analytic function ROW_NUMBER(). (5) For each part list its key (PS_PARTKEY), its available quantity, and an average available quantity (PS_AVAILQTY) of the current quantity and all previous quantities in the ascending order of available quantities. Consider only the parts with the keys 15 and 25. Use ROWS UNBOUNDED PRECEEDING sub-clause within PARTITION BY clause. When ready, save your SELECT statements in a file solution3.hql. Then, process a script file solution3.hql and save the results in a report solution3.txt. To save a report from processing of a file solution3.hql copy the contents of Terminal window into a text file solution3.txt. Processing of your script must return NO ERRORS! A solution with errors is worth no marks ! Deliverables A file solution3.txt that contains a report from processing of SELECT statements included in HQL script solution3.hql. The report MUST NOT include any errors, and the report must list all SQL statements processed. Task 4 (5 marks) Discovering a conceptual data warehouse schema from an operational database The Research and Innovative Technology Administration (RITA) coordinates the US Department of Transportations (DOT) research programs. It collects several statistics about many kinds of transportation means, including the information about flight segments between airports summarized by month. There is a set of tables T T100I Segment All Carrier XXXX, one by year, ranging from 1990 up until now. These tables include information about the scheduled and actually departured flights, the number of seats sold, the freight transported, and the distance travelled, among other ones. The schema and description of these tables are given below. From the information given below, construct an appropriate conceptual data warehouse schema. Analyze the input data and motivate the choice of your schema. Summaries Departures scheduled Departures performed Available payload (pounds) Available seats Non-stop segment passengers transported Non-stop segment freight transported (pounds) Non-stop segment mail transported (pounds) Distance between airports (miles) Ramp to ramp time (minutes) Airborne time (minutes) Carrier Unique Carrier Airline ID Unique Carrier Name Carrier Region Carrier Code Carrier Name Carrier Group Origin Origin Airport ID Origin airport Origin City Name Origin Country ID Origin Country Name Destination Dest Airport ID Dest Airport Dest City Name Dest Country ID Dest Country Name Aircraft Aircraft ID Aircraft Group Aircraft Type Aircraft Config Time period Year Quarter Month Flight Flight number Date Departure time To draw a conceptual schema, use a graphical notation explained to you in a presentation 10 Conceptual Data Warehouse Design. To draw your diagram, you must use UMLetlet 14.3 diagram drawing tool and apply ISIT312 Palette. The palette is available from the right upper corner of the main menu of UMLet 14.3. Deliverables A file solution4.bmp with a drawing of a conceptual schema of a sample data warehouse domain. Submission of Assignment 2 Note, that you have only one submission. So, make it absolutely sure that you submit the correct files with the correct contents. No other submission is possible ! Submit the files solution1.bmp, solution2.bmp, solution2.txt, solution3.txt, and solution4.bmp through Moodle in the following way: (1) Access Moodle at http://moodle.uowplatform.edu.au/ (2) To login use a Login link located in the right upper corner the Web page or in the middle of the bottom of the Web page (3) When logged select a site ISIT312/912 (S223) Big Data Management (4) Scroll down to a section Assessment items (Assignments) (5) Click at In this place you can submit the outcomes of your work on the tasks included in Assignment 2 link. (6) Click at a button Add Submission (7) Move a file solution1.bmp into an area You can drag and drop files here to add them. You can also use a link Add (8) Repeat step (7) for the remaining files solution2.bmp, solution2.txt, solution3.txt, and solution4.bmp (9) Click at a button Save changes (10)Click at the checkbox with a text attached: By checking this box, I confirm that this submission is my own work, in order to confirm the authorship of your submission. (11)Click at a button Continue (12)Check if Submission status is Submitted for grading. End of specification
Step by Step Solution
There are 3 Steps involved in it
Step: 1
Get Instant Access to Expert-Tailored Solutions
See step-by-step solutions with expert insights and AI powered tools for academic success
Step: 2
Step: 3
Ace Your Homework with AI
Get the answers you need in no time with our AI-driven, step-by-step assistance
Get Started