Question

1 Approved Answer

Posted on Sep 24, 2024

Problem to solve You are managing a taxi fleet in NYC and you would like to identify the best waiting areas for your vehicles. To

image text in transcribed

Problem to solve You are managing a taxi fleet in NYC and you would like to identify the best waiting areas for your vehicles. To solve this problem, you have at your disposal a large dataset of taxi trip records; more specifically you will use the 2009 records available here. Each record of this dataset includes the GPS coordinate of the starting point and end point for the corresponding trip (note that for the most recent years, locations are specified by zone number which is not useful for us). Since we want to identify the best waiting area, we are interested by the starting points. The dataset is contained in a simple comma-separated value file (csv), each line corresponding to a trip record. The different columns correspond to the attributes of each trip and are named as follows (if needed, more details can be found in the web site): vendor_name, Trip_Pickup_DateTime, Trip_Dropoff_DateTime, Passenger_Count, Trip Distance, Start_Lon, Start_Lat, Rate_Code, store_and_forward, End_Lon, End_Lat, Payment_Type, Fare_Amt, surcharge, mta_tax, Tip_Amt, Tolls_Amt, Total_Amt We therefore ask you to use the DB-SCAN algorithm in order to cluster the starting point locations of the trip records in the NYC 2009 tax taset. The centers of the largest clusters will become the waiting area for your taxi fleet. Note that in order to accurately apply the DB-SCAN algorithm, the distance between two GPS locations should be measured in meters. However, as a simplification and because the GPS locations are all located in a relatively small area, you can simply use the Euclidean distance between the GPS coordinates. 1. Object-Oriented Part (Java) [8 points)[8% of your final grade] We ask you to implement the DB-SCAN algorithm in order to cluster the various trip records using the GPS coordinates of the starting points. Your program must be a Java application, named TaxiClusters, that is run by specifying the dataset filename and the values of the parameters mints and eps. This program should produce as output the list of clusters contained in a csv file specifying, for each cluster, its position (average value of the GPS coordinates of its point set) and the number of points it contains. The outliers points are discarded. Since this dataset is very large, we give you a reduced version of it containing all the trip records for January 15, 2009 between 12pm and 1pm. O . Since this solution must follow the object-oriented paradigm, your program must be composed of a set of classes. Specifically, it must include, among others, the following classes: class GPScoord class TripRecord having the following attributes : pickup_DateTime (String) pickup_Location (GPScoord) dropoff_Location (GPScoord) trip_Distance (float) class Cluster O O o Problem to solve You are managing a taxi fleet in NYC and you would like to identify the best waiting areas for your vehicles. To solve this problem, you have at your disposal a large dataset of taxi trip records; more specifically you will use the 2009 records available here. Each record of this dataset includes the GPS coordinate of the starting point and end point for the corresponding trip (note that for the most recent years, locations are specified by zone number which is not useful for us). Since we want to identify the best waiting area, we are interested by the starting points. The dataset is contained in a simple comma-separated value file (csv), each line corresponding to a trip record. The different columns correspond to the attributes of each trip and are named as follows (if needed, more details can be found in the web site): vendor_name, Trip_Pickup_DateTime, Trip_Dropoff_DateTime, Passenger_Count, Trip Distance, Start_Lon, Start_Lat, Rate_Code, store_and_forward, End_Lon, End_Lat, Payment_Type, Fare_Amt, surcharge, mta_tax, Tip_Amt, Tolls_Amt, Total_Amt We therefore ask you to use the DB-SCAN algorithm in order to cluster the starting point locations of the trip records in the NYC 2009 tax taset. The centers of the largest clusters will become the waiting area for your taxi fleet. Note that in order to accurately apply the DB-SCAN algorithm, the distance between two GPS locations should be measured in meters. However, as a simplification and because the GPS locations are all located in a relatively small area, you can simply use the Euclidean distance between the GPS coordinates. 1. Object-Oriented Part (Java) [8 points)[8% of your final grade] We ask you to implement the DB-SCAN algorithm in order to cluster the various trip records using the GPS coordinates of the starting points. Your program must be a Java application, named TaxiClusters, that is run by specifying the dataset filename and the values of the parameters mints and eps. This program should produce as output the list of clusters contained in a csv file specifying, for each cluster, its position (average value of the GPS coordinates of its point set) and the number of points it contains. The outliers points are discarded. Since this dataset is very large, we give you a reduced version of it containing all the trip records for January 15, 2009 between 12pm and 1pm. O . Since this solution must follow the object-oriented paradigm, your program must be composed of a set of classes. Specifically, it must include, among others, the following classes: class GPScoord class TripRecord having the following attributes : pickup_DateTime (String) pickup_Location (GPScoord) dropoff_Location (GPScoord) trip_Distance (float) class Cluster O O o