Answered step by step
Verified Expert Solution
Link Copied!

Question

1 Approved Answer

The final project is an opportunity to broaden or deepen your knowledge about big data which have not been covered in class. It could be

The final project is an opportunity to broaden or deepen your knowledge about big data which have not been covered in class. It could be one of the following options: 1. learn new functionalities in the known big data libraries; 2. learn new big data tools/software libraries; 3. learn new scalable models/algorithms/frameworks. Note, the new here only means new to you or to all the members in your group. It does not necessarily mean the newly open-sourced software or newly published frameworks.

For the project, you will work individually or team of 2-3 students on a project of your choosing that is interesting, significant, and relevant to big data. The goal of your final project is to research on something new, dig deep into it, and share what you have learned with rest of the class. All members of a group will receive the same grade on group work. Therefore, it is in your interest to choose other group member (ideally, first week of the class) who have the same goal in the class as you do. It is also in your interest to work together and ensure that all tasks are completed effectively. Your scores on group work may be adjusted based on your contribution.

Project Idea

We provide some concrete idea here as example to demo how the project should looks like. Also, we provide some project ideas for your consideration.

Sample project:

We will learn Spark library (week 5) in class as well as how to handle data streams (week 7), but in class not all build-in libraries will be covered in detail or in assignment, For example, we will not cover Spark Streaming library in class. Your project could be study spark streaming library. Try to practice some functions offered in the library with the dataset you chosed. Here is more detailed documentation of Spark streaming. The documentation is based on spark 3.3. Actually, we can use Spark 3.3 in ODU's cluster. Please send an email to instructor for setting up the environment for spark 3.3. Here, we just briefly mention the overall idea, your project abstract should go more lower level than what we provided here. For example, the detailed problem could be figure out how to use Dataframe and SQL queries on steaming data by working on some dataset. Note, in this project, one actually needs to implement and test out the functions using data as well as code. Reading through the documentation should be part of the project, but it should not be only thing in the project.

project ideas:

You may wonder what type of big data systems you want to study. In the following, I summarize popular big data system and group them into different categories:

  1. Database or data warehouse: AmazonRedshift, Vertica, Google BigQuery, Hive, Presto, Amazon Athena, Azure SQL Data warehouse, Snowflake
  2. big data processing: Apache Spark, AWS EMR (Spark), Azure HDInsight, and AWS Sage-maker
  3. Machine Learning (ML) on large scale: Amazon SageMaker, BigQuery ML, Spark MLlib.
  4. Deep Learning: Azure Batch AI, Tensorflow, Pytorch.

other popular systems:

  1. graph database service: Amazon Neptune, Neo4j
  2. NoSQL (not only SQL) database: Amazon DynamoDB
  3. steaming data: Spark Steaming
  4. natural language processing: Amazon Comprehend
  5. Spark GraphX

People may not have heard about some of the above mentioned systems. Following notes provide more context for some of these big data systems:

  • For Tensorflow or PyTorch, sample project could be: code and test a Multi-layer perception neural network to recognize hand-written digit.
  • Presto: support interactive SQL queries

  • to use Spark (e.g. SparkML and SparkSQL) you can choose among ODU's cluster, Azure HDInsight, AWS EMR, and AWS Sage-maker

  • Amazon Athena: interactive query service to query data and analyze big data in Amazon S3

  • Spark Steaming is a build-in library in Spark that allows data engineers and data scientists to process real-time data from various sources including (but not limited to) Kafka, Flume, and Amazon Kinesis.)

Afore-mentioned are all ideas for your consideration. There could be many other interesting topics: such as study view in SQL, study hyper parameter tuning library for deep neural networks, etc.

Step by Step Solution

There are 3 Steps involved in it

Step: 1

blur-text-image

Get Instant Access with AI-Powered Solutions

See step-by-step solutions with expert insights and AI powered tools for academic success

Step: 2

blur-text-image

Step: 3

blur-text-image

Ace Your Homework with AI

Get the answers you need in no time with our AI-driven, step-by-step assistance

Get Started

Students also viewed these Databases questions