Answered step by step

Verified Expert Solution

Link Copied!

Question

1 Approved Answer

Posted on Sep 06, 2024

data science. dataset from kaggle, title Youth Tobacco Dataset ( 2 Decades ) . . It is should to use the Pandas library and package

data science.

dataset from kaggle, title Youth Tobacco Dataset

(2

Decades

) .

.

It is should to use the Pandas library and package in Python to work

.

The project will be divided into four main stages:

1 .

Data Cleaning: clean the dataset by handling missing data appropriately, removing duplicates and outliers, and ensuring consistency in data format.

Based on the data sets, you can implement some more cleaning if required.

2 .

Exploratory Data Analysis

(

EDA

)

: After cleaning the data, display the basic statistics about the dataset. perform EDA to understand the dataset's distribution, correlation, and relationship between variables. visualize their finding in at least five ways, including but not limited to scatter plots, bar charts, histograms, and heat

-

maps, or any other format they prefer.

3 .

Feature Selection: Based on their EDA findings, select the relevant features for analysis. Any suitable method of feature selection can be used so that can explain why they have selected the features and justify why other features were excluded.

4 .

Predictive Modeling: use linear or multiple regression to predict the values for the output variable for new inputs. For this, should divide the dataset into training and test sets, train the model on the training set, and validate the results on the test set. should also provide the accuracy of model. also explain the rationale behind selecting the regression method and interpret the results obtained.

5 .

include a detailed explanation of the woke problem statement, the data cleaning process, the EDA findings, the feature selection process, and the regression model. include the visualizations that used to communicate findings.

. . .

I started with these codes. Verify its validity and complete the required information

. .

import pandas as pd

data

=

.

read

_

csv

("

Youth Tobacco Dataset

(2

Decades

) .

csv

")

(

data

.

head

())

("

Missing values per column:"

)

(

data

.

isnull

() .

sum

())

data.dropna

(

inplace

=

True

)

data.drop

_

duplicates

(

inplace

=

True

)

1 =

data.quantile

(0.25)

3 =

data.quantile

(0.75)

IQR

=

3 -

1

for col in data.select

_

dtypes

(

include

= ['

int

64',

'float

64']) .

columns:

data

=

data

[

((

data

[

col

] < (

1 [

col

] - 1.5 *

IQR

[

col

])) | (

data

[

col

] > (

3 [

col

] + 1.5 *

IQR

[

col

])))]

lower

_

bound

=

1 - 1.5 *

IQR

upper

_

bound

=

3 + 1.5 *

IQR

data

['

YEAR

'] =

.

_

datetime

(

data

['

YEAR

'],

format

=' %

')

data

['

Gender

'] =

data

['

Gender

'] .

str

.

lower

()

lowercase

data

=

data.apply

(

lambda x: x

.

astype

(

str

) .

str

.

lower

()

if x

.

dtype

= =

'object' else x

)

data.to

_

csv

("

Cleaned

_

Youth

_

Tobacco

_

Dataset.csv

",

index

=

False

)

. .

import pandas as pd

data

=

.

read

_

csv

("

Youth Tobacco Dataset

(2

Decades

) .

csv

")

numerical

_

columns

=

data.select

_

dtypes

(

include

= ['

int

',

'float'

]) .

columns

for column in numerical

_

columns:

1 =

data

[

column

] .

quantile

(0.25)

3 =

data

[

column

] .

quantile

(0.75)

IQR

=

3 -

1

lower

_

bound

=

1 - 1.5 *

IQR

upper

_

bound

=

3 + 1.5 *

IQR

outliers

=

data

[(

data

[

column

] <

lower

_

bound

) | (

data

[

column

] >

upper

_

bound

)]

("

Outliers in column", column,

"

",

outliers

)

data

_

_

outliers

=

data

[

((

data

[

numerical

_

columns

] <

lower

_

bound

) | (

data

[

numerical

_

columns

] >

upper

_

bound

)) .

any

(

axis

= 1)

data.to

_

csv

("

Cleaned

_

Youth

_

Tobacco

_

Dataset.csv

",

index

=

False

)

. .

(

data

['

Data

_

Value'

] .

unique

())

. .

import pandas as pd

import seaborn as sns

import matplotlib.pyplot as plt

data

=

.

read

_

csv

("

Youth Tobacco Dataset

(2

Decades

) .

csv

")

Step by Step Solution

There are 3 Steps involved in it

Step: 1

Get Instant Access to Expert-Tailored Solutions

See step-by-step solutions with expert insights and AI powered tools for academic success

Step: 2

Step: 3

Ace Your Homework with AI

Get the answers you need in no time with our AI-driven, step-by-step assistance

Get Started

Recommended Textbook for

Intranet And Web Databases For Dummies

Authors: Paul Litwin

1st Edition

0764502212, 9780764502217

More Books

Students also viewed these Databases questions

Question

★★★★★

Cooper Industries had a piece of equipment that cost $56,000 and had accumulated depreciation of $25,000. Requirement 1. Record the disposition of the equipment assuming the following independent...

Answered: 1 week ago

Question

★★★★★

Camila Company has set the following standard cost per unit for direct materials and direct labor. Direct materials (10 pounds $3 per pound) Direct labor (2 hours @ $12 per hour) $ 30 24 During June...

Answered: 1 week ago

Question

★★★★★

Imagine that you are part of a new task force in your state or province. This task force has been instructed to make recommendations to improve the situation of women in the workplace. Based on the...

Answered: 1 week ago

Question

★★★★★

Fifteen adult males between the ages of 35 and 50 participated in a study to evaluate the effect of diet and exercise on blood cholesterol levels. The total cholesterol was measured in each subject...

Answered: 1 week ago

Question

★★★★★

40 The following financial information is available for Ligutti Company: 2018 2017 Cash and cash equivalents $ 12,000 $15,000 Accounts receivable 25,000 22,000 Buildings 108,000 45,000 Accumulated...

Answered: 1 week ago

Question

★★★★★

What is the least flexible approach to address binding in relation to using memory? logical time execution time compile time load time

Answered: 1 week ago

Question

★★★★★

Q2:Instruction: Describe one static priority and one dynamic priority real-time scheduling algorithm. You should discuss the issue of admission control, and comment on the data structures that an...

Answered: 1 week ago

Question

★★★★★

Data for the following subsidiaries of Sheridan Manufacturing, which are operated as investment centers, are as follows: Compute the missing amounts using the ROI calculation. (Round Return on...

Answered: 1 week ago

Question

★★★★★

Discuss your current supervisor, or a supervisor/leader of whose style you are aware of (family/friend, or a known person - political leader, CEO, coach, etc.) Identify and briefly summarize what the...

Answered: 1 week ago

Question

★★★★★

811 3 4 5 XVfx A B C D E G MCO Leather manufactures leather purses. Each purse requires 2 pounds of direct materials at a cost of $4 per pound and 0.8 direct labor hours at a rate of $16 per hour....

Answered: 1 week ago

Question

★★★★★

. The Skills Model of leadership outlines three "skills" needed for top management, middle management and supervisory management. Describe the three skills and explain when they are needed in the...

Answered: 1 week ago

Question

★★★★★

The interest in developing know-how, often through foreign sources, has led to the development and uptake of training in HRM through the universities and through foreign companies.

Answered: 1 week ago

Question

★★★★★

The entry of foreign organizations with more attractive compensation packages and work is also causing problems of retention in state-owned and justprivatized organizations, and there is a need to...

Answered: 1 week ago

Question

★★★★★

Pay and benefits. Evidence suggests a significant departure from the old centralized mode, particularly for managers, professionals and technicians, with pay for clerical and manual workers still...

Answered: 1 week ago

Previous Question Next Question