Answered step by step
Verified Expert Solution
Link Copied!

Question

1 Approved Answer

1. Business Understanding Students are expected to identify a data analytics task of your choice. You have to detail the Business Understanding part of your

1. Business Understanding

Students are expected to identify a data analytics task of your choice. You have to detail the Business Understanding part of your problem under this heading which basically addresses the following questions.

What is the business problem that you are trying to solve?

What data do you need to answer the above problem?

What are the different sources of data?

What kind of analytics task are you performing?

Score: 1 Mark in total (0.25 mark each)

 
--------------Type the answers below this line-------------- 

2. Data Acquisition

For the problem identified , find an appropriate data set (Your data set must be unique) from any public data source.

2.1 Download the data directly

 
##---------Type the code below this line------------------##

2.2 Code for converting the above downloaded data into a dataframe

 
##---------Type the code below this line------------------##

2.3 Confirm the data has been downloaded correctly by displaying the first 5 and last 5 records.

 
##---------Type the code below this line------------------##

2.4 Display the column headings, statistical information, description and statistical summary of the data.

 
##---------Type the code below this line------------------##

2.5 Write your observations from the above.

Size of the dataset

What type of data attributes are there?

Is there any null data that has to be cleaned?

Score: 2 Marks in total (0.25 marks for 2.1, 0.25 marks for 2.2, 0.5 marks for 2.3, 0.25 marks for 2.4, 0.75 marks for 2.5)

 
--------------Type the answers below this line--------------

3. Data Preparation

 
If input data is numerical or categorical, do 3.1, 3.2 and 3.4
If input data is text, do 3.3 and 3.4

3.1 Check for

duplicate data

missing data

data inconsistencies

 
##---------Type the code below this line------------------##

3.2 Apply techiniques

to remove duplicate data

to impute or remove missing data

to remove data inconsistencies

 
##---------Type the code below this line------------------##

3.3 Encode categorical data

 
##---------Type the code below this line------------------##

3.4 Text data

Remove special characters

Change the case (up-casing and down-casing).

Tokenization process of discretizing words within a document.

Filter Stop Words.

 
##---------Type the code below this line------------------##

 
##---------Type the code below this line------------------##

3.4 Report

Mention and justify the method adopted

to remove duplicate data, if present

to impute or remove missing data, if present

to remove data inconsistencies, if present

OR for textdata

How many tokens after step 3?

how may tokens after stop words filtering?

If the any of the above are not present, then also add in the report below.

Score: 2 Marks (based on the dataset you have, the data prepreation you had to do and report typed, marks will be distributed between 3.1, 3.2, 3.3 and 3.4)

 
##---------Type the code below this line------------------##

 
##---------Type the code below this line------------------##

3.5 Identify the target variables.

Separate the data from the target such that the dataset is in the form of (X,y) or (Features, Label)

Discretize / Encode the target variable or perform one-hot encoding on the target or any other as and if required.

Report the observations

Score: 1 Mark

 
##---------Type the code below this line------------------##

4. Data Exploration using various plots

4.1 Scatter plot of each quantitative attribute with the target.

Score: 1 Mark

 
##---------Type the code below this line------------------##

4.2 EDA using visuals

Use (minimum) 2 plots (pair plot, heat map, correlation plot, regression plot...) to identify the optimal set of attributes that can be used for classification.

Name them, explain why you think they can be helpful in the task and perform the plot as well. Unless proper justification for the choice of plots given, no credit will be awarded.

Score: 2 Marks

 
##---------Type the code below this line------------------##

5. Data Wrangling

5.1 Univariate Filters

Numerical and Categorical Data

Identify top 5 significant features by evaluating each feature independently with respect to the target variable by exploring

Mutual Information (Information Gain)

Gini index

Gain Ratio

Chi-Squared test

Fisher Score (From the above 5 you are required to use only any two)

For Text data

Stemming / Lemmatization.

Forming n-grams and storing them in the document vector.

TF-IDF (From the above 2 you are required to use only any two)

Score: 3 Marks

 
##---------Type the code below this line------------------##

5.2 Report observations

Write your observations from the results of each method. Clearly justify your choice of the method.

Score 1 mark

 
##---------Type the code below this line------------------##

6. Implement Machine Learning Techniques

Use any 2 ML algorithms

Classification -- Decision Tree classifier

Clustering -- kmeans

Association Analysis

Anomaly detection

Textual data -- Naive Bayes classifier (not taught in this course)

A clear justification have to be given for why a certain algorithm was chosen to address your problem.

Score: 4 Marks (2 marks each for each algorithm)

6.1 ML technique 1 + Justification

 
##---------Type the code below this line------------------##
 

6.2 ML technique 2 + Justification

 
##---------Type the code below this line------------------##

7. Conclusion

Compare the performance of the ML techniques used.

Derive values for preformance study metrics like accuracy, precision, recall, F1 Score, AUC-ROC etc to compare the ML algos and plot them. A proper comparision based on different metrics should be done and not just accuracy alone, only then the comparision becomes authentic. You may use Confusion matrix, classification report, Word cloud etc as per the requirement of your application/problem.

Score 1 Mark

 
##---------Type the code below this line------------------##

8. Solution

What is the solution that is proposed to solve the business problem discussed in Section 1. Also share your learnings while working through solving the problem in terms of challenges, observations, decisions made etc.

Score 2 Marks

--------------Type the answers below this line--------------

 
##---------Type the answer below this line------------------##

Step by Step Solution

There are 3 Steps involved in it

Step: 1

blur-text-image

Get Instant Access to Expert-Tailored Solutions

See step-by-step solutions with expert insights and AI powered tools for academic success

Step: 2

blur-text-image

Step: 3

blur-text-image

Ace Your Homework with AI

Get the answers you need in no time with our AI-driven, step-by-step assistance

Get Started

Recommended Textbook for

Advances In Spatial And Temporal Databases 11th International Symposium Sstd 2009 Aalborg Denmark July 8 10 2009 Proceedings Lncs 5644

Authors: Nikos Mamoulis ,Thomas Seidl ,Kristian Torp ,Ira Assent

2009th Edition

3642029817, 978-3642029813

More Books

Students also viewed these Databases questions