Answered step by step

Verified Expert Solution

Link Copied!

Question

1 Approved Answer

Posted on Oct 16, 2024

INSTRUCTIONS This assignment testsyour ability to create simple regression solution to a prediction problem. You will use a dataset of bike rentals fromCapital Bikeshare system,

INSTRUCTIONS

This assignment testsyour ability to create simple regression solution to a prediction problem. You will use a dataset of bike rentals fromCapital Bikeshare system, Washington D.C., USA which ispublicly available athttp://capitalbikeshare.com/system-data.You will do regression that will predict the number of bike rentals on a particular day given the weather conditions.

There are two questions in this assignment and one optional True and False section.

The file bikeshare_data.csvin Vocareum contains an extract of the data downloaded from link provided in the description.

About the data

In any machine learning problem, it is very interesting to get a feel of the data in its entirety. Although we might not be using all the columns in the data provided, it is generally a good practice to know what the different columns are because understanding data is the key. The data consists of the following fields:

- instant: record index

- dteday : date

- season : season (1:spring, 2:summer, 3:fall, 4:winter)

- yr : year (0: 2011, 1:2012)

- mnth : month ( 1 to 12)

- hr : hour (0 to 23)

- holiday : whether the day is a holiday or not (extracted from http://dchr.dc.gov/page/holiday-schedule)

- weekday : day of the week

- workingday : if the day is neither weekend nor holiday-1, otherwise is 0.

+ weathersit :

- 1: Clear, Few clouds, Partly cloudy, Partly cloudy

- 2: Mist + Cloudy, Mist + Broken clouds, Mist + Few clouds, Mist

- 3: Light Snow, Light Rain + Thunderstorm + Scattered clouds, Light Rain + Scattered clouds

- 4: Heavy Rain + Ice Pallets + Thunderstorm + Mist, Snow + Fog

- temp : Normalized temperature in Celsius. The values are divided into 41 (max)

- atemp: Normalized feeling temperature in Celsius. The values are divided into 50 (max)

- hum: Normalized humidity. The values are divided into 100 (max)

- windspeed: Normalized wind speed. The values are divided into 67 (max)

- casual: count of casual users

- registered: count of registered users

- cnt: count of total rental bikes including both casual and registered

Goal: use this data to do regression analysis that focuses on predicting the number of bike rentals for a particular day.

Question 1: Preprocessing

In this question, you will prepare the data before building your regression model. After preparing the data, you will save it as a CSV.

Follow these steps to prepare the data:

Read the data with the pandas read_csv function
Let us see how many rows and columns the data has for the sanity check. The total should be 17379 rows and 17 columns.
Our objective here is to cleanse this data.We aim to do linear regression analysis on the data only for working days between 9 AM to 6 PM. This means that we will have to remove some rows which are not of our interest. The following steps would demonstrate what we need to do.
Let us first remove all the data for the holidays. This would mean removing all the rows where the holidayfield is one. Let us first see how many such rows are there.There should be 500 such rows. Now remove them.
Now, let us remove all the days which were not a working day. This would mean removing all the rows where the 'workingday' field is 0. There should be 5014 such rows. Remove all of them.
Now, we would take only the data for times 9 AM to 6 PM. This would mean that we only need to take the rows where the value of the 'hr' column is between [9,17] as 17 depicts the time frame 5 PM to 6 PM. Remove all the rows that do not satisfy this condition. You should get 4477 rows.
Since we want to see the impact of weather conditions on the number of booking, create subset of this data now that contains temp, hum, windspeed, and cnt.
Save this subset of data as a CSV called 'filtered.csv'. The output should look something like that (don't forget the header):

temp,hum,windspeed,cnt

0.16,0.43,0.3881,88

0.18,0.43,0.2537,44

0.20,0.40,0.3284,51

0.22,0.35,0.2985,61

WARNING: Do not change the order of the rows. If you do, the grader won't recognize the data and you will get a low grade.

HINT: The file 'filtered.csv' should have4478 rows, counting the header.

Question 2: prediction

Uselinearregression topredict the count in the 'topredict.csv' dataset. To do so, you should train a model on the data you saved in the 'filtered.csv' file from the preprocessing step

1.Use sklearn to load the data and train a linear regression model on it. We would use temperature(temp), humidity(hum) and windspeed as the independent variables and cnt as the dependent variable.

2. Save the predictions as a CSV called 'predictions.csv'.

You should have two columns, the row number ('index') of therowin the 'topredict.csv' file and the prediction ('final_prediction') you make for that index.

We would predict it to be a high demand day if the number of predicted bookings is greater than or equal to 170. We would predict it to be a low demand day if the number of predicted bookings isless than 170. Now assuming that a high demand day is tagged as 1 and a low demand day is tagged as 0, convert your predicted demand to high demand or a low demand day andgenerate the resulting CSV as specified below.

The results should look like the following (don't forget the header):

index,final_prediction

0,0

1,0

2,1

3,0

4,0

5,1

INSTRUCTIONS This assignment tests your ability to create a simple regression solution to a prediction problem. You will use a dataset of bike rentals from Capital Bikeshare system, Washington DC, USA which is publicly available at http://capitalbikeshare.com/system-data. You will do a regression that will predict the number of bike rentals on a particular day given the weather conditions. There are two questions in this assignment and one optional True and False section. The le bikeshare_data.csv in Vocareum contains an extract of the data downloaded from link provided in the description. About the data In any machine learning problem, it is very interesting to get a feel of the data in its entirety. Although we might not be using all the columns in the data provided, it is generally a good practice to know what the different columns are because understanding data is the key. The data consists of the following elds: - instant: record index - dteday : date - season : season (1 :spring, 2:summer, 3:fall, 4zwinter) - yr : year (0: 2011, 122012) - mnth : month ( 1 to 12) - hr: hour (0 to 23) - holiday : whether the day is a holiday or not (extracted from http://dchr.dc.gov/page/holiday-schedule) - weekday : day of the week - workingday: ifthe day is neither weekend nor holiday-1, otherwise is O. + weathersit: - 1: Clear, Few clouds, Partly cloudy, Partly cloudy - 2: Mist + Cloudy, Mist + Broken clouds, Mist + Few clouds, Mist - 3: Light Snow, Light Rain + Thunderstorm + Scattered clouds, Light Rain + Scattered clouds - 4: Heavy Rain + Ice Pallets + Thunderstorm + Mist, Snow + Fog - temp : Normalized temperature in Celsius. The values are divided into 41 (max) - atemp: Normalized feeling temperature in Celsius. The values are divided into 50 (max) - hum: Normalized humidity. The values are divided into 100 (max) - windspeed: Normalized wind speed. The values are divided into 67 (max) - casual: count of casual users - registered: count of registered users - cnt: count of total rental bikes including both casual and registered Goal: use this data to do a regression analysis that focuses on predicting the number of bike rentals for a particular day. Question 1: Preprocessing In this question, you will prepare the data before building your regression model. After preparing the data, you will save it as a CSV. Follow these steps to prepare the data: 1. Read the data with the pandas read_csv function 2. Let us see how many rows and columns the data has for the sanity check. The total should be 17379 rows and 17 columns. 3. Our objective here is to cleanse this data. We aim to do a linear regression analysis on the data only for working days between 9 AM to 6 PM. This means that we will have to remove some rows which are not of our interest. The following steps would demonstrate what we need to do. 4. Let us first remove all the data for the holidays. This would mean removing all the rows where the holiday eld is one. Let us first see how many such rows are there. There should be 500 such rows. Now remove them. 5. Now, let us remove all the days which were not a working day. This would mean removing all the rows where the 'workingday' field is 0. There should be 5014 such rows. Remove all of them. 6. Now, we would take only the data for times 9 AM to 6 PM. This would mean that we only need to take the rows where the value ofthe 'hr' column is between [9.17] as 17 depicts the time frame 5 PM to 6 PM. Remove all the rows that do not satisfy this condition. You should get 4477 rows. 7. Since we want to see the impact of weather conditions on the number of booking, create a subset of this data now that contains temp, hum, windspeed, and cnt. 8. Save this subset of data as a CSV called 'filtered.csv'. The output should look something like that (don't forget the header): temp, hum, windspeed, cnt O.16,0.43,0.3881,88 0.18,0.43,0.2537,44 O.20,0.40,0.3284,51 O.22,0.35,0.2985,6l WARNING: Do not change the order of the rows. lfyou do, the grader won't recognize the data and you will get a low grade. HINT: The le 'filtered.csv' should have 4478 rows, counting the header. Question 2: prediction Use linear regression to predict the count in the 'topredict.csv' dataset. To do so, you should train a model on the data you saved in the 'ltered.csv' le from the preprocessing step 1. Use sklearn to load the data and train a linear regression model on it. We would use temperature(temp), humidity(hum) and windspeed as the independent variables and cnt as the dependent variable. 2. Save the predictions as a CSV called 'predictions.csv'. You should have two columns, the row number ('index') of the row in the 'topredict.csv' le and the prediction ('na|_prediction') you make for that index. We would predict it to be a high demand day if the number of predicted bookings is greater than or equal to 170. We would predict it to be a low demand day if the number of predicted bookings is less than 170. Now assuming that a high demand day is tagged as 1 and a low demand day is tagged as 0, convert your predicted demand to high demand or a low demand day and generate the resulting CSV as specied below. The results should look like the following (don't forget the header): index, final_prediction 0 , 0 1 , 0 2 , 1 3, 0 4,0 2. Save the predictions as a CSV called 'predictions.csv'. You should have two columns, the row number ('index') of the row in the 'topredict.csv' le and the prediction ('na|_prediction') you make for that index. We would predict it to be a high demand day if the number of predicted bookings is greater than or equal to 170. We would predict it to be a low demand day if the number of predicted bookings is less than 170. Now assuming that a high demand day is tagged as 1 and a low demand day is tagged as 0, convert your predicted demand to high demand or a low demand day and generate the resulting CSV as specified below. The results should look like the following (don't forget the header): index, final_prediction o, 0 1, 0 2 , 1 3, 0 4,0