Question

1 Approved Answer

Posted on Sep 11, 2024

https://www3.nd.edu/~busiforc/problems/DataMining/Accidents.xls you can access data from this address. The file Accidents.xls contains information on 42,183 actual automobile accidents in 2001 in the US that involved

image text in transcribed

https://www3.nd.edu/~busiforc/problems/DataMining/Accidents.xls you can access data from this address.

The file Accidents.xls contains information on 42,183 actual automobile accidents in 2001 in the US that involved one of three levels of injury: no injury, injury", or "fatality. For each accident, additional information is recorded such as day of week, weather conditions, and road type. A firm might be interested in developing a system for quickly classifying the severity of an accident, based upon initial reports and associated data in the system (some of which rely on GPS-assisted reporting). Our goal here is to predict whether a new accident just reported will involve an injury (MAX_SEV_IR=1 or 2) or not (MAX_SEV_IR=0). For this purpose, create a new variable called INJURY that takes the value l" that means "with injury" if MAX_SEV_IR=1 or 2, and otherwise the value is O refereeing "no injury". Partition the data into training (60%) and validation (40%) sets. a) Compute the accuracy rate of each class for the validation set based on the Nave Rule. You can present the accuracy rates using a matrix (called confusion matrix) as show in the example below: Predicted Class-1 Class-2 10 3 Class-1 Actual Class-2 2 12 Here 10 out of 13 data points in Class-1 are correctly predicted and 12 out of 14 data points in Class-2 are correctly predicted. Overall accuracy rate is 22/27 = 82%. Accuracy rate in Class-1 = 10/13 = 77%, accuracy rate in Class-2 is 12/14=86% b) Assume that no information or initial reports about the accident itself are available at the time of prediction (only location characteristics, weather conditions, road conditions etc.) which predictors can we include in the analysis? (Please read the Data_Codes sheet). c) Run a Nave Bayes classifier on the complete training set by choosing the relevant predictors (continue from part-b), use INJURY as the response variable. Notice that all predictors are categorical. Show the classification matrix (confusion matrix) for the training and validation data. d) Is there any percent improvement relative to the Nave Rule? e) Run a Nave Bayes classifier using all predictors and INJURY as the response variable. Report again your error rates in both training and validation set with using confusion matrix. 1) Which analysis in part-b or in part-e would be appropriate if you consider applying Nave Bayes model that you created for the future accidents? Please explain your reasoning. g) Run a Nave Bayes classifier with the variables in part-b and response variable INJURY after partitioning the data into training (60%) and validation (40%) sets. Is there any affect of different partitioning on the accuracy results? If you observe a chance, please explain the possible reason. Note : I have posted a guideline for the usage of Naive Bayes in XLMiner. It might be helpful while using XL Miner. The file Accidents.xls contains information on 42,183 actual automobile accidents in 2001 in the US that involved one of three levels of injury: no injury, injury", or "fatality. For each accident, additional information is recorded such as day of week, weather conditions, and road type. A firm might be interested in developing a system for quickly classifying the severity of an accident, based upon initial reports and associated data in the system (some of which rely on GPS-assisted reporting). Our goal here is to predict whether a new accident just reported will involve an injury (MAX_SEV_IR=1 or 2) or not (MAX_SEV_IR=0). For this purpose, create a new variable called INJURY that takes the value l" that means "with injury" if MAX_SEV_IR=1 or 2, and otherwise the value is O refereeing "no injury". Partition the data into training (60%) and validation (40%) sets. a) Compute the accuracy rate of each class for the validation set based on the Nave Rule. You can present the accuracy rates using a matrix (called confusion matrix) as show in the example below: Predicted Class-1 Class-2 10 3 Class-1 Actual Class-2 2 12 Here 10 out of 13 data points in Class-1 are correctly predicted and 12 out of 14 data points in Class-2 are correctly predicted. Overall accuracy rate is 22/27 = 82%. Accuracy rate in Class-1 = 10/13 = 77%, accuracy rate in Class-2 is 12/14=86% b) Assume that no information or initial reports about the accident itself are available at the time of prediction (only location characteristics, weather conditions, road conditions etc.) which predictors can we include in the analysis? (Please read the Data_Codes sheet). c) Run a Nave Bayes classifier on the complete training set by choosing the relevant predictors (continue from part-b), use INJURY as the response variable. Notice that all predictors are categorical. Show the classification matrix (confusion matrix) for the training and validation data. d) Is there any percent improvement relative to the Nave Rule? e) Run a Nave Bayes classifier using all predictors and INJURY as the response variable. Report again your error rates in both training and validation set with using confusion matrix. 1) Which analysis in part-b or in part-e would be appropriate if you consider applying Nave Bayes model that you created for the future accidents? Please explain your reasoning. g) Run a Nave Bayes classifier with the variables in part-b and response variable INJURY after partitioning the data into training (60%) and validation (40%) sets. Is there any affect of different partitioning on the accuracy results? If you observe a chance, please explain the possible reason. Note : I have posted a guideline for the usage of Naive Bayes in XLMiner. It might be helpful while using XL Miner