Answered step by step
Verified Expert Solution
Question
1 Approved Answer
data science. dataset from kaggle, title Youth Tobacco Dataset ( 2 Decades ) . . It is should to use the Pandas library and package
data science.
dataset from kaggle, title Youth Tobacco Dataset Decades
It is should to use the Pandas library and package in Python to work
The project will be divided into four main stages:
Data Cleaning: clean the dataset by handling missing data appropriately, removing duplicates and outliers, and ensuring consistency in data format.
Based on the data sets, you can implement some more cleaning if required.
Exploratory Data Analysis EDA: After cleaning the data, display the basic statistics about the dataset. perform EDA to understand the dataset's distribution, correlation, and relationship between variables. visualize their finding in at least five ways, including but not limited to scatter plots, bar charts, histograms, and heat maps, or any other format they prefer.
Feature Selection: Based on their EDA findings, select the relevant features for analysis. Any suitable method of feature selection can be used so that can explain why they have selected the features and justify why other features were excluded.
Predictive Modeling: use linear or multiple regression to predict the values for the output variable for new inputs. For this, should divide the dataset into training and test sets, train the model on the training set, and validate the results on the test set. should also provide the accuracy of model. also explain the rationale behind selecting the regression method and interpret the results obtained.
include a detailed explanation of the woke problem statement, the data cleaning process, the EDA findings, the feature selection process, and the regression model. include the visualizations that used to communicate findings.
I started with these codes. Verify its validity and complete the required information
import pandas as pd
data pdreadcsvYouth Tobacco Dataset Decadescsv
printdatahead
printMissing values per column:"
printdataisnullsum
data.dropnainplaceTrue
data.dropduplicatesinplaceTrue
Q data.quantile
Q data.quantile
IQR Q Q
for col in data.selectdtypesincludeint 'floatcolumns:
data data~datacolQcol IQRcoldatacolQcol IQRcol
lowerbound Q IQR
upperbound Q IQR
dataYEAR pdtodatetimedataYEAR formatY
dataGender dataGenderstrlower lowercase
data data.applylambda x: xastypestrstrlower if xdtype 'object' else x
data.tocsvCleanedYouthTobaccoDataset.csv indexFalse
import pandas as pd
data pdreadcsvYouth Tobacco Dataset Decadescsv
numericalcolumns data.selectdtypesincludeint 'float'columns
for column in numericalcolumns:
Q datacolumnquantile
Q datacolumnquantile
IQR Q Q
lowerbound Q IQR
upperbound Q IQR
outliers datadatacolumn lowerbounddatacolumn upperbound
printOutliers in column", column, : outliers
datanooutliers data~datanumericalcolumns lowerbounddatanumericalcolumns upperboundanyaxis
data.tocsvCleanedYouthTobaccoDataset.csv indexFalse
printdataDataValue'unique
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
data pdreadcsvYouth Tobacco Dataset Decadescsv
Step by Step Solution
There are 3 Steps involved in it
Step: 1
Get Instant Access to Expert-Tailored Solutions
See step-by-step solutions with expert insights and AI powered tools for academic success
Step: 2
Step: 3
Ace Your Homework with AI
Get the answers you need in no time with our AI-driven, step-by-step assistance
Get Started