Question
Applied Data Mining Project Overview Introduction The course project is a culminating learning experience in this class. The goal of the project is to conduct
Applied Data Mining Project Overview Introduction The course project is a culminating learning experience in this class. The goal of the project is to conduct a data analysis in a very similar to real-world applications settings. It will allow to apply skills covered in the entire course. The domain application of the analysis is open to student to choose. However, the chosen topic should be approved by the instructor, first. The project will require to identify data sources (Unit 4), convert data into a format feasible for data analysis, prepare data, and then analyze it. The preliminary results of the analysis should be reported during Unit 7, followed by a full report and an analysis presentation (Unit 8). To meet expectations students will be asked to discuss how parallel computer can benefit the analysis. To exceed expectation, one of parallelization methods should be applied to the data analysis. Directions The project consists of the following stages: 1) Identify the datasets to be used for the analysis. The project should utilize at least two publicly available datasets, which has not been used in any other assignments in the class. By the Unit 4 submit a short description of the datasets, links to the dataset, and couple of paragraphs on how those datasets can be studied together, and how this study will contribute a business or society. This submission does not require to create specific research questions, however, an explanation of the problems the analysis of the combination of the selected datasets may solve is required. Cities or states usually provide public datasets that could be used. For example, https://data.kcmo.org/ for Kansas City, MO (or one for your city or state). You can also find many links other datasets on this page: https://www.datasciencecentral.com/profiles/blogs/great-github-list-of-public-data- sets 2) Identify the best technology to conduct data conversion, data cleaning, and data munging. Apply those techniques to the selected dataset and to produce a single merged dataset for further analysis. 3) Identify the research question and what characteristics (variables) you will need to study it.
4) Identify the need or a potential for a need in distributed computing in order to store, manipulate, or analyze data. 5) Conduct the preliminary analysis by running one of the data mining techniques (e.g. clustering, or regression). 6) Interpret and report the preliminary results of the analysis (Unit 7, Sunday 11:59pm). Use any appropriate format (e.g. tables, charts) to report the results of the analysis; writing must include results-based response to the research question. 7) Prepare the full report which must include:
a. Research question
b. Description of the datasets
c. Description of the specific data preparation process conducted
d. Description of analytical techniques
e. Description of the parallelization technologies used or a potential need in using those technologies
f. Results of the analysis including tables and charts following basics of data visualization.
g. Conclusions of the results, limitation, and the process of the conducted data analysis.
Step by Step Solution
There are 3 Steps involved in it
Step: 1
Get Instant Access to Expert-Tailored Solutions
See step-by-step solutions with expert insights and AI powered tools for academic success
Step: 2
Step: 3
Ace Your Homework with AI
Get the answers you need in no time with our AI-driven, step-by-step assistance
Get Started