Answered step by step
Verified Expert Solution
Link Copied!

Question

1 Approved Answer

Table 1 . Data Description table [ [ Field , Description ] , [ ID , The ID of the patient is automatically assigned

Table 1. Data Description
\table[[Field,Description],[ID,The ID of the patient is automatically assigned],[Age,The recorded Age of the patient],[sex,The Gender of the patient],[bmi,The recorded Body Mass Index of the patient],[children,The number of children],[smoker,Identifier if a person is smoker or not],[region,The geographical area where the individual resides.],[charges,The total medical charge]]
Using the given data do t
Using the given data do t
B-1.[3 marks]: Read and display the dataset provided. Determine the number of rows and columns present. Additionally, identify the columns containing missing data and list their names, if any.
B-2.[5 marks]: Type Consistency: From the given dataset using the python script identify the columns with categorical data. Furthermore, identify every column type. Indicate the type consistency in the given dataset, if any. Convert the "Id" column from numerical to object type for ease of numeric operation such as normalization.
B-3.[6 marks]: Filter noise: Look at the given dataset. Using python commands filter out negative values in the following two columns "bmi" and "children". Furthermore, some values are in decimal by mistake in "age" columns correct it using appropriate method. Also, find the unique categorical values and remove "unknown" values ,if any (i.e., NaN is not considered as unknown).
B-4.[6 marks]: Handling NaN values: Drop all columns containing 23% or more missing values. Then impute the columns having missing values using median if the column is numerical and mode if the column is categorical.
B-5.[6 marks]: Normalization/Transformation: Transform the "age" and "charges" columns to have a mean of zero and a standard deviation of one. Moreover, transform the "bmi" column such that the minimum value is 0 and maximum value is 1. Print only the transformed columns.
B-6.[4 marks]: Discretize the "age" column into the following five bins using only Pandas. Save it into another column as age group
\table[[Age,Bin],[Below 20,Teen],[20-29,Twenties],[30-39,Thirties],[40-49,Fourties],[50 and above,Fiftiest]]
B-7.[3marks]: Encoding: Convert "region" using one-hot encoder. The new name should start with "region" (region_northeast). Remove the original column.
B-8.[4 marks]: Encoding: For the column "sex", convert male to 1 and female to 0. The column name should remain unchanged.
B-9.[3marks]: Provide a reasonable data aggregate table for the table given below. (all figures are in SAR millions unless stated otherwise)
\table[[Month,Revenue,Month,Revenue,Month,Revenue,Month,Revenue],[Year 2020,Year 2021,Year 2022,Year 2023],[January,50,January,80,January,90,January,70],[February,90,February,140,February,210,February,190],[March,70,March,40,March,100,March,100],[April,30,April,70,April,60,April,50],[May,40,May,120,May,40,May,30],[June,60,June,50,June,80,June,90],[July,35,July,40,July,70,July,60],[August,75,August,80,August,90,August,80],[September,45,September,70,September,120,September,110],[October,55,October,50,October,80,October,100],[November,40,November,150,November,75,November,95],[December,100,December,110,December,95,December,105]]
B-10.[10 marks]: General questions (write your answers in a jupyter notebook):
(i) When is the discrete data useful? [2 marks]
(ii) List three data collection methods. [2 marks]
(iii) What is a min-max scaler? [2 marks]
(iv) Provide two sources of structured data and unstructured data. [2 marks]
(v) Why is there a need of data preprocessing? [2 marks]
Problem B [50 Marks]: Consider the data given in "HW2_Data_B" Microsoft Excel (.csv) file and described in Table 1. Note: Solve all the following questions using Python. Use the Pandas & Sklearn library for all the following analyses.
Table 1. Data Description
\table[[Field,Description],[ID,The ID of the patient is automatically assigned],[Age,The recorded Age of the patient],[sex,The Gender of the patient],[bmi,The recorded Body Mass Index of the patient],[children,The number of children],[smoker,Identifier if a person is smoker or not],[region,The geographical area where the individual resides.],[charges,The total medical charge]]
Using the given data do the following:
Table 1. Data Description
\table[[Field,Description],[ID,The ID of the patient is automatically assigned],[Age,The recorded Age of the patient],[sex,The Gender of the patient],[bmi,The recorded Body Mass Index of the patient],[children,The number of children],[smoker,Identifier if a person is smoker or not],[region,The geographical area where the individual resides.],[charges,The total medical charge]]
Using the given data do t
Using the given data do t
B-1.[3 marks]: Read and display the dataset provided. Determine th
image text in transcribed

Step by Step Solution

There are 3 Steps involved in it

Step: 1

blur-text-image

Get Instant Access to Expert-Tailored Solutions

See step-by-step solutions with expert insights and AI powered tools for academic success

Step: 2

blur-text-image

Step: 3

blur-text-image

Ace Your Homework with AI

Get the answers you need in no time with our AI-driven, step-by-step assistance

Get Started

Students also viewed these Databases questions

Question

design a simple performance appraisal system

Answered: 1 week ago