Question
1. Business Understanding Students are expected to identify a data analytics task of your choice. You have to detail the Business Understanding part of your
1. Business Understanding
Students are expected to identify a data analytics task of your choice. You have to detail the Business Understanding part of your problem under this heading which basically addresses the following questions.
What is the business problem that you are trying to solve?
What data do you need to answer the above problem?
What are the different sources of data?
What kind of analytics task are you performing?
Score: 1 Mark in total (0.25 mark each)
--------------Type the answers below this line--------------
2. Data Acquisition
For the problem identified , find an appropriate data set (Your data set must be unique) from any public data source.
2.1 Download the data directly
##---------Type the code below this line------------------##
2.2 Code for converting the above downloaded data into a dataframe
##---------Type the code below this line------------------##
2.3 Confirm the data has been downloaded correctly by displaying the first 5 and last 5 records.
##---------Type the code below this line------------------##
2.4 Display the column headings, statistical information, description and statistical summary of the data.
##---------Type the code below this line------------------##
2.5 Write your observations from the above.
Size of the dataset
What type of data attributes are there?
Is there any null data that has to be cleaned?
Score: 2 Marks in total (0.25 marks for 2.1, 0.25 marks for 2.2, 0.5 marks for 2.3, 0.25 marks for 2.4, 0.75 marks for 2.5)
--------------Type the answers below this line--------------
3. Data Preparation
If input data is numerical or categorical, do 3.1, 3.2 and 3.4
If input data is text, do 3.3 and 3.4
3.1 Check for
duplicate data
missing data
data inconsistencies
##---------Type the code below this line------------------##
3.2 Apply techiniques
to remove duplicate data
to impute or remove missing data
to remove data inconsistencies
##---------Type the code below this line------------------##
3.3 Encode categorical data
##---------Type the code below this line------------------##
3.4 Text data
Remove special characters
Change the case (up-casing and down-casing).
Tokenization process of discretizing words within a document.
Filter Stop Words.
##---------Type the code below this line------------------##
##---------Type the code below this line------------------##
3.4 Report
Mention and justify the method adopted
to remove duplicate data, if present
to impute or remove missing data, if present
to remove data inconsistencies, if present
OR for textdata
How many tokens after step 3?
how may tokens after stop words filtering?
If the any of the above are not present, then also add in the report below.
Score: 2 Marks (based on the dataset you have, the data prepreation you had to do and report typed, marks will be distributed between 3.1, 3.2, 3.3 and 3.4)
##---------Type the code below this line------------------##
##---------Type the code below this line------------------##
3.5 Identify the target variables.
Separate the data from the target such that the dataset is in the form of (X,y) or (Features, Label)
Discretize / Encode the target variable or perform one-hot encoding on the target or any other as and if required.
Report the observations
Score: 1 Mark
##---------Type the code below this line------------------##
4. Data Exploration using various plots
4.1 Scatter plot of each quantitative attribute with the target.
Score: 1 Mark
##---------Type the code below this line------------------##
4.2 EDA using visuals
Use (minimum) 2 plots (pair plot, heat map, correlation plot, regression plot...) to identify the optimal set of attributes that can be used for classification.
Name them, explain why you think they can be helpful in the task and perform the plot as well. Unless proper justification for the choice of plots given, no credit will be awarded.
Score: 2 Marks
##---------Type the code below this line------------------##
5. Data Wrangling
5.1 Univariate Filters
Numerical and Categorical Data
Identify top 5 significant features by evaluating each feature independently with respect to the target variable by exploring
Mutual Information (Information Gain)
Gini index
Gain Ratio
Chi-Squared test
Fisher Score (From the above 5 you are required to use only any two)
For Text data
Stemming / Lemmatization.
Forming n-grams and storing them in the document vector.
TF-IDF (From the above 2 you are required to use only any two)
Score: 3 Marks
##---------Type the code below this line------------------##
5.2 Report observations
Write your observations from the results of each method. Clearly justify your choice of the method.
Score 1 mark
##---------Type the code below this line------------------##
6. Implement Machine Learning Techniques
Use any 2 ML algorithms
Classification -- Decision Tree classifier
Clustering -- kmeans
Association Analysis
Anomaly detection
Textual data -- Naive Bayes classifier (not taught in this course)
A clear justification have to be given for why a certain algorithm was chosen to address your problem.
Score: 4 Marks (2 marks each for each algorithm)
6.1 ML technique 1 + Justification
##---------Type the code below this line------------------##
6.2 ML technique 2 + Justification
##---------Type the code below this line------------------##
7. Conclusion
Compare the performance of the ML techniques used.
Derive values for preformance study metrics like accuracy, precision, recall, F1 Score, AUC-ROC etc to compare the ML algos and plot them. A proper comparision based on different metrics should be done and not just accuracy alone, only then the comparision becomes authentic. You may use Confusion matrix, classification report, Word cloud etc as per the requirement of your application/problem.
Score 1 Mark
##---------Type the code below this line------------------##
8. Solution
What is the solution that is proposed to solve the business problem discussed in Section 1. Also share your learnings while working through solving the problem in terms of challenges, observations, decisions made etc.
Score 2 Marks
--------------Type the answers below this line--------------
##---------Type the answer below this line------------------##
Step by Step Solution
There are 3 Steps involved in it
Step: 1
Get Instant Access to Expert-Tailored Solutions
See step-by-step solutions with expert insights and AI powered tools for academic success
Step: 2
Step: 3
Ace Your Homework with AI
Get the answers you need in no time with our AI-driven, step-by-step assistance
Get Started