Answered step by step
Verified Expert Solution
Link Copied!

Question

1 Approved Answer

data science. dataset from kaggle, title Youth Tobacco Dataset ( 2 Decades ) . . It is should to use the Pandas library and package

data science.
dataset from kaggle, title Youth Tobacco Dataset (2 Decades).
. It is should to use the Pandas library and package in Python to work .
The project will be divided into four main stages:
1. Data Cleaning: clean the dataset by handling missing data appropriately, removing duplicates and outliers, and ensuring consistency in data format.
Based on the data sets, you can implement some more cleaning if required.
2. Exploratory Data Analysis (EDA): After cleaning the data, display the basic statistics about the dataset. perform EDA to understand the dataset's distribution, correlation, and relationship between variables. visualize their finding in at least five ways, including but not limited to scatter plots, bar charts, histograms, and heat- maps, or any other format they prefer.
3. Feature Selection: Based on their EDA findings, select the relevant features for analysis. Any suitable method of feature selection can be used so that can explain why they have selected the features and justify why other features were excluded.
4. Predictive Modeling: use linear or multiple regression to predict the values for the output variable for new inputs. For this, should divide the dataset into training and test sets, train the model on the training set, and validate the results on the test set. should also provide the accuracy of model. also explain the rationale behind selecting the regression method and interpret the results obtained.
5. include a detailed explanation of the woke problem statement, the data cleaning process, the EDA findings, the feature selection process, and the regression model. include the visualizations that used to communicate findings.
...
I started with these codes. Verify its validity and complete the required information
..
import pandas as pd
data = pd.read_csv("Youth Tobacco Dataset (2 Decades).csv")
print(data.head())
print("Missing values per column:")
print(data.isnull().sum())
data.dropna(inplace=True)
data.drop_duplicates(inplace=True)
Q1= data.quantile(0.25)
Q3= data.quantile(0.75)
IQR = Q3- Q1
for col in data.select_dtypes(include=['int64', 'float64']).columns:
data = data[~((data[col]<(Q1[col]-1.5* IQR[col]))|(data[col]>(Q3[col]+1.5* IQR[col])))]
lower_bound = Q1-1.5* IQR
upper_bound = Q3+1.5* IQR
data['YEAR']= pd.to_datetime(data['YEAR'], format='%Y')
data['Gender']= data['Gender'].str.lower() lowercase
data = data.apply(lambda x: x.astype(str).str.lower() if x.dtype == 'object' else x)
data.to_csv("Cleaned_Youth_Tobacco_Dataset.csv", index=False)
..
import pandas as pd
data = pd.read_csv("Youth Tobacco Dataset (2 Decades).csv")
numerical_columns = data.select_dtypes(include=['int', 'float']).columns
for column in numerical_columns:
Q1= data[column].quantile(0.25)
Q3= data[column].quantile(0.75)
IQR = Q3- Q1
lower_bound = Q1-1.5* IQR
upper_bound = Q3+1.5* IQR
outliers = data[(data[column]< lower_bound)|(data[column]> upper_bound)]
print("Outliers in column", column, ":", outliers)
data_no_outliers = data[~((data[numerical_columns]< lower_bound)|(data[numerical_columns]> upper_bound)).any(axis=1)
data.to_csv("Cleaned_Youth_Tobacco_Dataset.csv", index=False)
..
print(data['Data_Value'].unique())
..
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
data = pd.read_csv("Youth Tobacco Dataset (2 Decades).csv")

Step by Step Solution

There are 3 Steps involved in it

Step: 1

blur-text-image

Get Instant Access with AI-Powered Solutions

See step-by-step solutions with expert insights and AI powered tools for academic success

Step: 2

blur-text-image

Step: 3

blur-text-image

Ace Your Homework with AI

Get the answers you need in no time with our AI-driven, step-by-step assistance

Get Started

Students also viewed these Databases questions