Answered step by step
Verified Expert Solution
Link Copied!

Question

1 Approved Answer

Data cleaning is a very important step in Data Science to get meaningful analytic results or beneficial prediction outcomes. This assignment aims at helping

Data cleaning is a very important step in Data Science to get meaningful analytic results or beneficial Step 5. A valid IATA is composed of two English alphabets, two numerical numbers, or a combination of one 25. Which attribute causes the most tuples being removed from the dataset? 26. After the data cleaning

Data cleaning is a very important step in Data Science to get meaningful analytic results or beneficial prediction outcomes. This assignment aims at helping students to develop knowledge and analytic skills to properly clean the unprocessed data for data analysis and model training. The given dataset is a collection of airline data containing 8 attributes and 6150 records. A data dictionary describing each attribute is given in the table below. You are expected to complete the steps in the following section to clean the dataset and to answer the questions in Parts A and B. Attribute Airline ID Name Alias IATA ICAO Callsign Country Active Description Unique OpenFlights identifier for this airline. Name of the airline. Alias of the airline. 2-letter ICAO code, if available. 3-letter ICAO code, if available. Airline callsign. The airline;s incorporated country or territory. "y" if the airline is or has until recently been operational, "N" if it is defunct. The dataset (airlines_2022.csv) is dirty. Assume that you have received instructions from a senior data scientist to develop an appropriate solution to clean the dataset. The following steps are proposed to clean the dataset after loading it into the selected tool (e.g. Jupyter Notebook), in order to meet the requirements of the senior data scientist: Step 1. Check the original dataset for any duplicate tuples based on the value of Airline ID. o Remove any duplicate tuples in the dataset. Step 2. The Airline ID should range from 0 to any positive integer. o If the original Airline ID is negative, set it to zero. o Add a new column to store the cleaned data. Do not overwrite the original Airline ID. Step 3. A valid airline Name should start should start with either English alphabet or numerical number only. o If the original airline Name starts with non-English alphabet, replace it with "unknown". o Add a new column to store the cleaned data. Do not overwrite the original airline Name. Step 4. A valid Alias should start with English alphabet or numerical number only. Only six symbols are acceptable between alphabets and numbers: "-". "&", ".", [space], "{", and ")". o If Alias value is "\N", " " or missing, replace it with "unknown". o If other symbols appear in the Alias value, replace it with "unknown". o Add a new column to store the cleaned data. Do not overwrite the original Alias value. Step 5. A valid IATA is composed of two English alphabets, two numerical numbers, or a combination of one English alphabet and one numerical number. o If the IATA value is not valid or missing, replace it with "unknown". o Add a new column to store the cleaned data. Do not overwrite the original IATA value. Step 6. A valid ICAO should contain three characters only. The characters can be English alphabets or numerical numbers. o If the ICAO value is "\N", " " or missing, replace it with "unknown". o Add a new column to store the cleaned data. Do not overwrite the original ICAO value. Step 7. A valid CallSign should contain three characters only. The characters can be English alphabets or numerical numbers. o If the CallSign value is "\N", " " or missing, replace it with "unknown". o Add a new column to store the cleaned data. Do not overwrite the original CallSign value. Questions Part A Answer the following questions based on the steps previously described: 1. How many unique tuples are there in the dataset after Step 1? 2. How many unique values are there in the Airline ID attribute after Step 2? 3. How many unique values are there in the Name attribute after Step 3? 4. How many unique values are there in the Alias attribute after Step 4? 5. How many unique values are there in the IATA attribute after Step 5? 6. How many unique values are there in the ICAO attribute after Step 6? 7. How many unique values are there in the CallSign attribute after Step 7? 8. How many unique values are there in the Country attribute after cleaning? 9. How many unique values are there in the Active attribute after cleaning? 10. How many "unknown" are included in the Name attribute after cleaning? 11. How many "unknown" are included in the Alias attribute after cleaning? 12. How many "unknown" are included in the IATA attribute after cleaning? 13. How many "unknown" are included in the ICAO attribute after cleaning? 14. How many "unknown" are included in the CallSign attribute after cleaning? 15. How many "unknown" are included in the Country attribute after cleaning? 16. How many "unknown" are included in the Active attribute after cleaning? 17. How many tuples are pending removal based on the Name attribute after cleaning? 18. How many tuples are pending removal based on the Alias attribute after cleaning? 19. How many tuples are pending removal based on the IATA attribute after cleaning? 20. How many tuples are pending removal based on the ICAO attribute after cleaning? 21. How many tuples are pending removal based on the CallSign attribute after cleaning? 22. How many tuples are pending removal based on the Country attribute after cleaning? 23. How many unique tuples included the cleaned dataset? 24. How many percent of the tuples are removed from the dataset after cleaning? 25. Which attribute causes the most tuples being removed from the dataset? 26. After the data cleaning process, which attribute(s) is/are the target(s) for you to go back and ask for more detail or discuss on the solutions with your client? Part B According to the cleaned data, which country owns the most active operation routes? How many active operation routes is currently owned by that country?

Step by Step Solution

3.34 Rating (151 Votes )

There are 3 Steps involved in it

Step: 1

1 To answer the following questions we need to clean the dataset following the steps described Step 1 import pandas as pd df pdreadcsvairlines2022csv Remove duplicate tuples based on the value of Airl... blur-text-image

Get Instant Access to Expert-Tailored Solutions

See step-by-step solutions with expert insights and AI powered tools for academic success

Step: 2

blur-text-image_2

Step: 3

blur-text-image_3

Ace Your Homework with AI

Get the answers you need in no time with our AI-driven, step-by-step assistance

Get Started

Recommended Textbook for

Business Analytics Communicating With Numbers

Authors: Sanjiv Jaggia, Alison Kelly, Kevin Lertwachara, Leida Chen

1st Edition

978-1260785005, 1260785009

More Books

Students also viewed these Programming questions