Question

1 Approved Answer

Posted on Oct 10, 2024

pplication Case 2.4 Predicting NCAA Bowl Game Outcomes Predicting the outcome of a college football game (or any sports game, for that matter) is an

pplication Case 2.4 Predicting NCAA Bowl Game Outcomes Predicting the outcome of a college football game (or any sports game, for that matter) is an interesting and challenging problem. Therefore, challenge-seeking researchers from both academics and industry have spent a great deal of effort on forecasting the outcome of sporting events. Large quantities of historic data exist in different media outlets (often publicly available) regarding the structure and outcomes of sporting events in the form of a variety of numerically or symbolically represented factors that are assumed to contribute to those outcomes.

The end-of-season bowl games are very important to colleges both financially (bringing in millions of dollars of additional revenue) as well as reputationalfor recruiting quality students and highly regarded high school athletes for their athletic programs (Freeman & Brewer, 2016). Teams that are selected to compete in a given bowl game split a purse, the size of which depends on the specific bowl (some bowls are more prestigious and have higher payouts for the two teams), and therefore securing an invitation to a bowl game is the main goal of any division I-A college football program. The decision makers of the bowl games are given the authority to select and invite bowl-eligible (a team that has six wins against its Division I-A opponents in that season) successful teams (as per the ratings and rankings) that will play in an exciting and competitive game, attract fans of both schools, and keep the remaining fans tuned in via a variety of media outlets for advertising.

In a recent data mining study, Delen, Cogdell, and Kasap (2012) used 8 years of bowl game data along with three popular data mining techniques (decision trees, neural networks, and support vector machines) to predict both the classification-type outcome of a game (win versus loss) as well as the regression-type outcome (projected point difference between the scores of the two opponents). What follows is a shorthand description of their study.

Methodology In this research, Delen and his colleagues followed a popular data mining methodology called CRISP-DM (Cross-Industry Standard Process for Data Mining), which is a six-step process. This popular methodology, which is covered in detail in Chapter 4, provided them with a systematic and structured way to conduct the underlying data mining study and hence improved the likelihood of obtaining accurate and reliable results. To objectively assess the prediction power of the different model types, they used a cross-validation methodology, called k-fold cross-validation. Details on k-fold cross-validation can be found in Chapter 4. Figure 2.16 graphically illustrates the methodology employed by the researchers.

Figure 2.16 The Graphical Illustration of the Methodology Employed in the Study. Figure 2.16 Full Alternative Text Data Acquisition and Data Preprocessing The sample data for this study is collected from a variety of sports databases available on the Web, including jhowel.net, ESPN.com, Covers.com, ncaa.org, and rauzulusstreet.com. The data set included 244 bowl games, representing a complete set of eight seasons of college football bowl games played between 2002 and 2009. We also included an out-of-sample data set (2010-2011 bowl games) for additional validation purposes. Exercising one of the popular data mining rules-of-thumb, they included as much relevant information into the model as possible. Therefore, after an in-depth variable identification and collection process, they ended up with a data set that included 36 variables, of which the first 6 were the identifying variables (i.e., name and the year of the bowl game, home and away team names and their athletic conferencessee variables 1-6 in Table 2.5), followed by 28 input variables (which included variables delineating a team's seasonal statistics on offense and defense, game outcomes, team composition characteristics, athletic conference characteristics, and how they fared against the oddssee variables 7-34 in Table 2.5), and finally the last two were the output variables (i.e., ScoreDiffthe score difference between the home team and the away team represented with an integer number, and WinLosswhether the home team won or lost the bowl game represented with a nominal label).

Table 2.5 Description of the Variables Used in the Study No Cat Variable Name Description 1 ID YEAR Year of the bowl game 2 ID BOWLGAME Name of the bowl game 3 ID HOMETEAM Home team (as listed by the bowl organizers) 4 ID AWAYTEAM Away team (as listed by the bowl organizers) 5 ID HOMECONFERENCE Conference of the home team 6 ID AWAYCONFERENCE Conference of the away team 7 I1 DEFPTPGM Defensive points per game 8 I1 DEFRYDPGM Defensive rush yards per game 9 I1 DEFYDPGM Defensive yards per game 10 I1 PPG Average number of points a given team scored per game 11 I1 PYDPGM Average total pass yards per game 12 I1 RYDPGM Team's average total rush yards per game 13 I1 YRDPGM Average total offensive yards per game 14 I2 HMWIN% Home winning percentage 15 I2 LAST7 How many games the team won out of their last 7 games 16 I2 MARGOVIC Average margin of victory 17 I2 NCTW Nonconference team winning percentage 18 I2 PREVAPP Did the team appeared in a bowl game previous year 19 I2 RDWIN% Road winning percentage 20 I2 SEASTW Winning percentage for the year 21 I2 TOP25 Winning percentage against AP top 25 teams for the year 22 I3 TSOS Strength of schedule for the year 23 I3 FR% Percentage of games played by freshmen class players for the year 24 I3 SO% Percentage of games played by sophomore class players for the year 25 I3 JR% Percentage of games played by junior class players for the year 26 I3 SR% Percentage of games played by senior class players for the year 27 I4 SEASOvUn% Percentage of times a team went over the O/U* in the current season 28 I4 ATSCOV% Against the spread cover percentage of the team in previous bowl games 39 I4 UNDER% Percentage of times a team went under in previous bowl games 30 I4 OVER% Percentage of times a team went over in previous bowl games 31 I4 SEASATS% Percentage of covering against the spread for the current season 32 I5 CONCH Did the team win their respective conference championship game 33 I5 CONFSOS Conference strength of schedule 34 I5 CONFWIN% Conference winning percentage 35 O1 ScoreDiffo Score difference (HomeTeamScore - AwayTeamScore) 36 O2 WinLosso Whether the home team wins or loses the game *Over/UnderWhether or not a team will go over or under of the expected score difference.

o Output variablesScoreDiff for regression models and WinLoss for binary classification models.

I1: Offense/defense;

I2: game outcome;

I3: team configuration;

I4: against the odds;

I5: conference stats.

ID: Identifier variables;

O1: output variable for regression models;

O2: output variable for classification models.

In the formulation of the data set, each row (a.k.a. tuple, case, sample, example, etc.) represented a bowl game, and each column stood for a variable (i.e., identifier/input or output type). To represent the game-related comparative characteristics of the two opponent teams, in the input variables,