Question

1 Approved Answer

Posted on May 19, 2024

Yourcabs.com was an aggregation portal for car rentals launched in the early 2010's in the South Indian city of Bengaluru (formerly, Bangalore), which is a

Yourcabs.com was an aggregation portal for car rentals launched in the early 2010's in the South Indian city of Bengaluru (formerly, Bangalore), which is a major tech center in India. The tech industry was transforming a lot of things, transportation being one of them and Yourcabs offered customers a variety of means of ordering a taxi/car; either online, by phone (landline) or on a mobile device. (Uber did not launch its service in Bengaluru till mid 2014). The cars could be ordered at the time of travel or in advance. One of the issues that Yourcabs faced was that the drivers using their platform would sometimes cancel their rides; if the cancellation did not occur sufficiently in advance, the customer's trip would be delayed, thus harming the company's reputation and business. Yourcabs had collected data on its rides between 2011 and 2013 and posted a challenge, in coordination with the Indian School of Business, to see what could be learned about the factors affecting cancellations of rides. The file Taxicancellationsoriginaldata.csv has the raw data on approximately 43,000 rides booked through Yourcabs along with various features of each of these rides. Approximately 10,000 of these rides were between Bengaluru and another city and I decided to remove these and focus only on rides within the city. I deleted variables such as customer id, vehicle model id, etc that I felt very clearly had no bearing on driver cancellation. I then reformatted the data and created the features described below (The reformatted data with these features can be seen in the file TaxiCancellationsDataToFitTree.csv). I recommend that you take a look both at the original file as well as the reformatted data to get an appreciation of the idea that in the vast majority of ML applications, a great deal of time and effort has to be spent on cleaning and formatting the data into a usable form as well as potentially creating new features from the existing ones in the raw data (the latter generally needs some brainstorming with domain experts). Features: 1) Location of where the customer is picked up (in latitude and longitude) 2) Destination (in latitude and longitude). Note that unlike with Uber/Lyft etc., the driver here knows the destination for which the booking has been made and so it makes sense to include this information in the modelling. There may be some unpopular destinations that drivers may not want to go to, specially during rush hour 3) The day the booking was made 4) The time within the day that the booking was made (expressed in minutes, starting from 00 minutes and going to 1439 minutes, which would be one minute before the next day starts) 5) The day of travel 6) The time of travel within the day of travel (expressed in minutes, starting from 00 minutes and going to 1439 minutes, which would be one minute before the next day starts) 7) The difference, in minutes, between when the booking was made and the ride starts (this feature was not present in the original raw data and had to be computed) 8) How the booking was made (by landline phone or on a computer or on a mobile device) There were two potential features that I decided not to use in the modelling: i) The month in which the trip is made. There do seem to be a couple of months during which cancellations are unusually high relative to the others, but in the absence of more knowledge about some structural reason for this, I decided to not use this feature ii) The distance between the pick up and drop off point. In general, a reasonable way to compute this when one has the precise GPS location of both is the Manhattan distance, but the layout of cities in India is far removed from the grid like nature of cities in the U.S. and so I felt that this distance might be a very noisy measure of actual distance I split the data randomly into two pieces; 80% of the data was used as the training set to build a model, including using CV to pick the right model. The remaining 20% was used to evaluate the model (through ROC curves, etc) built on the training set. We can expect that if location (both of the pickup as well as the destination) is related to cancellations, the relationship is not going to be linear. Though non-linearity can be modelled via logistic regression by incorporating squares of features etc., I chose to use a classification tree for the modelling approach (Note that one could have also tried random forests, k NN, Neural Nets, etc). The tree that was built on the 80% of the training data is shown on the next page. The depth of the tree (its complexity) was chosen by using CV. a) What would best describe rides that have a high probability of being cancelled? (There could be multiple things that together describe this. Note that lower latitude values imply a location that is further South in the city; a location with latitude 12 is further South than one with latitude 13. Similarly, a larger longitude value implies a location further East) b) A ride has been booked online and the driver is supposed to pick up the passenger at a location with latitude 12.53 and drop them off at a location with latitude 12.38. The ride is supposed to start at 3:30pm (i.e. 15.5 hours from midnight and so travelstart= 930 minutes ) and was booked 2 hours before the pick up time.