Answered step by step
Verified Expert Solution
Link Copied!

Question

1 Approved Answer

DSO 530: Applied Modern Statistical Learning Methods Logistic Regression Dataset In this document, we will use the Smarket dataset, which is part of our text

DSO 530: Applied Modern Statistical Learning Methods Logistic Regression Dataset In this document, we will use the Smarket dataset, which is part of our text book R package ISLR. This dataset consists of percentage returns for the S&P 500 stock index over 1,250 days, from the beginning of 2001 until the end of 2005. For each date, the percentage returns for each of the five previous trading dats was recorded (Lag1, Lag2, Lag3, Lag4, Lag5). Other variables have been recorded as well: Volume (the number of shares traded on the previous day, in billions), Today (the percentage return on the date in question), and Dierction(whether the market was Up or Down on this date). Splitting into Training and Testing The following command attaches the data frame for Smarket to Rs working directory (memory). This will make us able to access variables in the data set directly without having to specify the name of the data frame, so instead of typing Smarket$Year we can directly type Year which is a variable in the Smarket data set. Notice that we are not splitting the dataset randomly, because we are dealing with time series data, and it does not make sense to randomly select observations. library(ISLR) attach(Smarket) The following commands creates two boolean vectors. A boolean vector has either TRUE or FALSE as its value. For the train vector, a TRUE will be assigned to the cell that has the same index as the observation in Smarket which has Year < 2005. The test vector is exactly the oposite of train. The ! negates what is in train. Select the observations in Smarket that will go into the training and testing data set. Notice that we got rid of the 8th variable Today because it is similar to direction. Actually, thats how Direction is computed: "Up" if Today is positive, and "Down" if Today is negative. For model assesment purposes, we are going to create a vector that has all the y values in the testing data set. The model assessment will happen later on, after we create our model using the training data. We used Direction because this is what we are trying to predict (our y variable). Notice here that we did not put a comma , when we indexed Direction because it is a vector (one column) and not a dataframe! train = Year < 2005 test = !train training_data = Smarket[train, -8] testing_data = Smarket[test, -8] testing_y = Direction[test] Training the Model Now it is time to train our model using the training data set: logistic_model = glm(Direction ~ . , data = training_data, family = "binomial") summary(logistic_model) ## ## ## ## ## ## ## ## ## ## ## ## ## ## ## ## ## ## ## ## ## ## ## ## ## ## ## ## Call: glm(formula = Direction ~ ., family = "binomial", data = training_data) Deviance Residuals: Min 1Q Median 3Q Max -1.382 -1.184 1.030 1.146 1.451 Coefficients: Estimate Std. Error z value Pr(>|z|) (Intercept) -2.990e+02 1.185e+02 -2.523 0.0116 * Year 1.495e-01 5.922e-02 2.525 0.0116 * Lag1 -5.824e-02 5.200e-02 -1.120 0.2627 Lag2 -5.378e-02 5.210e-02 -1.032 0.3019 Lag3 -1.059e-03 5.190e-02 -0.020 0.9837 Lag4 -2.359e-03 5.199e-02 -0.045 0.9638 Lag5 -1.074e-02 5.139e-02 -0.209 0.8344 Volume -2.665e-01 2.481e-01 -1.074 0.2828 --Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 (Dispersion parameter for binomial family taken to be 1) Null deviance: 1383.3 on 997 degrees of freedom Residual deviance: 1374.7 on 990 degrees of freedom AIC: 1390.7 Number of Fisher Scoring iterations: 3 library(coefplot) ## Warning: package 'coefplot' was built under R version 3.1.3 ## Loading required package: ggplot2 ## Warning: package 'ggplot2' was built under R version 3.1.3 coefplot(logistic_model) # with all coefficient except intercept coefplot(logistic_model, coefficients=c("Year", "Lag1", "Lag2", "Lag3", "Lag4", "Lag5", "Volume")) In the above logistic regression model, we are using the glm() function, which is a general linear model. The first argumet is our regression forumla, which specifies that we are predicting Direction using all predictor variables in our date set (the . means to use all variables). If you want to use specific variables, lets say Lag1 and Lag2, then the formula would be Direction ~ Lag1 + Lag2. We are using the training data set to train our model, and we specified the family of the model to be family = "binomial", because we are running logistic regression. If we dont specify the family of the linear model, then the model will be regular linear regression. Looking at the predictor variables, we can see that only Year is statistically significant. From the coefficient plot, we can probably leave Year, Lag1, Lag2 and Volume in the model. We did this since we are interested in prediction, and not in interpreting the variables. Remember that the coefficient plot helps us to distiguish which of the coefficients are close to zero or not. logistic_model = glm(Direction ~ Year + Lag1 + Lag2 + Volume , data = training_data, family = "binomial") summary(logistic_model) ## ## ## ## ## ## ## ## ## ## ## ## ## ## ## ## ## ## ## ## ## ## ## ## ## ## Call: glm(formula = Direction ~ Year + Lag1 + Lag2 + Volume, family = "binomial", data = training_data) Deviance Residuals: Min 1Q Median 3Q Max -1.375 -1.186 1.033 1.146 1.449 Coefficients: Estimate Std. Error z value Pr(>|z|) (Intercept) -297.39983 117.89180 -2.523 0.0116 * Year 0.14871 0.05891 2.524 0.0116 * Lag1 -0.05821 0.05201 -1.119 0.2630 Lag2 -0.05362 0.05206 -1.030 0.3030 Volume -0.26273 0.24597 -1.068 0.2855 --Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 (Dispersion parameter for binomial family taken to be 1) Null deviance: 1383.3 on 997 degrees of freedom Residual deviance: 1374.7 on 993 degrees of freedom AIC: 1384.7 Number of Fisher Scoring iterations: 3 library(coefplot) # with all coefficient except intercept coefplot(logistic_model, coefficients=c("Year", "Lag1", "Lag2", "Lag3", "Lag4", "Lag5", "Volume")) # Assessing the Model Next, we want to assess our model logistic_model. To do so, we will predict the y values for the testing data set, and then compare the predicted ys with the real one that we saved under the name testing_y earlier. When the predict() function is used in logistic regression, it computes the predicted probablities of being in one class or another (Down or Up in our case). logistic_probs = predict(logistic_model, testing_data, type = "response") head(logistic_probs) ## 999 1000 1001 1002 1003 1004 ## 0.6368295 0.6031295 0.6035583 0.5957632 0.5860797 0.5895771 Since predict() computes probablities, then we have to convert them to the actual classes (Up or Down). Unfortunately, in logistic regression, predict() function does not produce the categories. So, lets convert those probablities. We will first start by creating a vector to hold those classes. This array will have the same length of the testing_y (252 in this example), and we will initialize it to have all of its cells marked as Down, and then we will update this vector to have Up in cells where the corresponding predicted probablities is greater than 0.5 (this threshold could change based on the application). logistic_pred_y = rep("Down", length(testing_y)) logistic_pred_y[logistic_probs > 0.5] = "Up" The function rep(), repeats "Down" the length of testing_y (i.e. 252 times). R will first evaluate logistic_probs >0.5, and it will be a boolean vector of TRUE and FALSE. It will be TRUE when the value of the cell in logistic_prob > 0.5, otherwise FALSE. R will replace all the "Down" values in logistic_pred_y vector with "Up" when logistic_probs >0.5 is TRUE. The last few steps in assessment includes finding the confusion matrix for the model using the table() command. table() uses the cross-classifying factors to build a contingency table of the counts at each combination of factor levels. conf_matrix = table(testing_y, logistic_pred_y) In order to compute the misclassification error rate: error_rate = (1 + 109)/ 252 error_rate ## [1] 0.4365079 ## OR mean(testing_y != logistic_pred_y) ## [1] 0.4365079 The misclassification error rate is The model has predicted 109 times in which the stocks would go up, but in fact they did not. And at the same time, it prediced that 140 times the stocks would go up and in fact it did go up. Is this profitable for us? Let's assume that the stock price is $1, and it would increase by $0.05 if they went up, otherwise, they would decrease by $0.05. On the long run we can compute the expected value of the stock: E( X)=p 1 X 1+ p 2 X 2 E( X)= 109 140 0.05+ 0.05 249 249 E( X)=+ 0.0062 In the long run, if we use this model to predict the price of the stocks, we will end up having a profit of $0.0062 per stock

Step by Step Solution

There are 3 Steps involved in it

Step: 1

blur-text-image

Get Instant Access to Expert-Tailored Solutions

See step-by-step solutions with expert insights and AI powered tools for academic success

Step: 2

blur-text-image

Step: 3

blur-text-image

Ace Your Homework with AI

Get the answers you need in no time with our AI-driven, step-by-step assistance

Get Started

Recommended Textbook for

Introduction to Probability

Authors: Mark Daniel Ward, Ellen Gundlach

1st edition

716771098, 978-1319060893, 1319060897, 978-0716771098

More Books

Students also viewed these Mathematics questions