Question

1 Approved Answer

Posted on May 16, 2024

Taxi Trip Records in New York City Data Analysis Assignment Introduction In this data analysis assignment, we will explore, clean, analyze, and visualize the Taxi

Taxi Trip Records in New York City Data Analysis Assignment Introduction In this data analysis assignment, we will explore, clean, analyze, and visualize the Taxi Trip Records in New York City dataset to derive meaningful insights and provide actionable recommendations. The dataset contains information about taxi trips in New York City, including the pickup and dropoff locations, trip duration, and trip fare. Objective The objective of this assignment is to gain insights into taxi trip patterns in New York City. Specifically, we will investigate the following questions: What are the most popular pickup and dropoff locations? What are the average trip duration and fare? How does trip duration and fare vary by time of day and day of the week? What are the factors that contribute to variations in trip duration and fare? Tools and Libraries We will use the following tools and libraries to perform the data analysis: Jupyter Notebook: A web-based interactive programming environment for Python Pandas: A Python library for data manipulation and analysis Matplotlib: A Python library for data visualization Data Exploration the dataset: Use the pandas library to load the taxi trip records CSV file into a DataFrame. Python import pandas as pd the dataset data = pd.read_csv(\'taxi_trip_records.csv\') Use code with caution. Learn more Inspect the data: Use the head() method to view the first few rows of the DataFrame and the info() method to check the data types and missing values. Python Inspect the data print(data.head()) print(data.info()) Use code with caution. Learn more Handle missing values: Identify and handle missing values in the dataset. If necessary, impute or remove missing values depending on their impact on the analysis. Python Check for missing values print(data.isnull().sum()) Handle missing values if \'pickup_location\' in data.columns: Impute or remove missing values in pickup_location column pass Use code with caution. Learn more Data cleaning: Clean and prepare the data for further analysis. This may involve converting data types, standardizing values, and removing outliers. Python Convert data types if \'trip_duration\' in data.columns: Convert trip_duration to numeric data type data[\'trip_duration\'] = data[\'trip_duration\'].str.replace(\'\"\', \'\').astype(int) Use code with caution. Learn more Data Analysis Popular pickup and dropoff locations: Identify the most popular pickup and dropoff locations based on the frequency of occurrences. Python Calculate the frequency of pickup locations pickup_frequency = data[\'pickup_location\'].value_counts() Identify the most popular pickup locations top_pickup_locations = pickup_frequency.head(10) Use code with caution. Learn more Average trip duration and fare: Calculate the average trip duration, average trip fare, and standard deviation for both. Python Calculate average trip duration avg_trip_duration = data[\'trip_duration\'].mean() Calculate average trip fare avg_trip_fare = data[\'trip_fare\'].mean() Calculate standard deviation for trip duration and fare std_trip_duration = data[\'trip_duration\'].std() std_trip_fare = data[\'trip_fare\'].std() Use code with caution. Learn more Trip duration and fare by time of day and day of the week: Analyze how trip duration and fare vary by time of day and day of the week. This can be done by grouping the data by time and day and calculating the average trip duration and fare for each group. Python Group data by time of day time_groups = data.groupby(\'pickup_datetime\')[\'trip_duration\', \'trip_fare\'].mean() Group data by day of the week day_groups = data.groupby(\'pickup_day\')[\'trip_duration\', \'trip_fare\'].mean() Use code with caution. Learn more Factors contributing to variations in trip duration and fare: Identify and analyze the factors that contribute to variations in trip duration and fare. This may involve using correlation analysis or other statistical techniques. Python Calculate correlation between trip duration and other factors correlation_matrix = data[[\'trip_duration\', \'pickup_distance\', \'trip_fare\']].corr() Use code with caution. Learn more Data Visualization Create visualizations to illustrate the findings from the data analysis. This may include bar charts, maps, or other types of visualizations. Python Create a bar chart showing the top 10 pickup locations pickup_frequency.plot(kind=\'bar\') Create a map showing the distribution of pickup and dropoff Use code with caution. here is the code to load the NYC taxi data set for the months of January, March, and June in pandas\' data frame: Python import pandas as pd January data df_jan = pd.read_parquet(\'yellow_tripdata_2023-01.parquet\') March data df_mar = pd.read_parquet(\'yellow_tripdata_2023-03.parquet\') June data df_jun = pd.read_parquet(\'yellow_tripdata_2023-06.parquet\') Combine the data frames df = pd.concat([df_jan, df_mar, df_jun]) Use code with caution. Learn more This code will load the data from the three Parquet files into separate pandas DataFrames. Then, it will combine the three DataFrames into a single DataFrame named df. Load the NYC taxi data set for the months of January, March, and June in pandas\' data frame. Python import pandas as pd January data df_jan = pd.read_parquet(\'yellow_tripdata_2023-01.parquet\') March data df_mar = pd.read_parquet(\'yellow_tripdata_2023-03.parquet\') June data df_jun = pd.read_parquet(\'yellow_tripdata_2023-06.parquet\') Combine the data frames df = pd.concat([df_jan, df_mar, df_jun]) Use code with caution. Learn more Compare the 3 months of data and identify and discuss 3 different trends in it. Python Compare the number of trips in each month print(df[\'dropoff_datetime\'].dt.month.value_counts()) Compare the average trip duration in each month print(df[\'trip_duration\'].mean()) Compare the average trip fare in each month print(df[\'trip_fare\'].mean()) Use code with caution. Learn more Task 2: Data Exploration and Pre-processing Check for missing values in the dataset. Handle them appropriately and explain why you used a certain strategy? Python Check for missing values in the dataset print(df.isnull().sum()) Handle missing values in the \'pickup_location\' column by imputing them with the most common pickup location df[\'pickup_location\'].fillna(df[\'pickup_location\'].mode()[0], inplace=True) Handle missing values in the \'trip_duration\' column by removing them df.dropna(subset=[\'trip_duration\'], inplace=True) Use code with caution. Learn more Identify two columns that have noisy (erroneous) values. Explain why you think they are noisy. Identify how many such values exist in the dataset. Python Identify noisy values in the \'passenger_count\' column print(df[\'passenger_count\'].describe()) Identify noisy values in the \'trip_distance\' column print(df[\'trip_distance\'].describe()) Use code with caution. Learn more Identify 2 columns that are highly correlated and explain their correlation. Python Calculate the correlation matrix correlation_matrix = df.corr() Identify two highly correlated columns print(correlation_matrix[[\'trip_duration\', \'trip_distance\']]) Use code with caution. Learn more Task 3: Featurization Create a feature which is a flag indicating if the trip is in rush-hour or not. Python Create a rush_hour flag rush_hour_flag = df.apply(lambda row: 1 if row[\'pickup_datetime\'].hour in [7, 8, 17, 18] else 0, axis=1) df[\'rush_hour_flag\'] = rush_hour_flag Use code with caution. Learn more Create a feature that encodes the complexity of the trip by comparing the actual distance of the trip to the straight-line distance of the trip. Python Calculate the straight-line distance between pickup and dropoff locations straight_line_distance = df.apply(lambda row: distance.earth_distance((row[\'pickup_latitude\'], row[\'pickup_longitude\']), (row[\'dropoff_latitude\'], row[\'dropoff_longitude\'])), axis=1) df[\'straight_line_distance\'] = straight_line_distance Create a feature that encodes the complexity of the trip trip_complexity = df[\'trip_distance\'] / df[\'straight_line_distance\'] df[\'trip_complexity\'] = trip_complexity Use code with caution. Learn more Calculate the pickup and drop-off frequency in each taxi zone. Python Calculate the pickup frequency in each taxi zone pickup_frequency = df[\'pickup_location\'].value_counts() Calculate the dropoff frequency in each taxi zone dropoff_frequency = df[\'dropoff_location\'].value_counts() Use code with caution. Task 4: Data Analysis Rank the vendors by popularity. Python Rank the vendors by the number of trips vendor_rank = df[\'vendor_id\'].value_counts().rank() df[\'vendor_rank\'] = vendor_rank Use code with caution. What are the peak travel hours? Python Calculate the number of trips for each hour of the day hour_of_day = df[\'pickup_datetime\'].dt.hour trip_count_by_hour = hour_of_day.value_counts() Identify the peak travel hours peak_travel_hours = trip Use code with caution.

Attachments:

data-