Question

1 Approved Answer

Posted on Jun 01, 2024

Purpose: Demonstrate your ability to apply the modeling tools and methods to a data set of your own interest. The project is way to put

Purpose: Demonstrate your ability to apply the modeling tools and methods to a data set of your own interest. The project is way to put it all together in terms of material taught to you. The project is not something to agonize over. Learn, experience, and have fun with data! Ultimately, the project should be about building a model. As you will learn in this course, there are two goals from modeling: explanatory and/or prediction. However, I want you to focus on the predictive goal for the project. The first step is the identification of variables. In this course, we will study predictive models for predicting a Y variable. Y variables come in two types: 1) Continuous Y variable. This is a Y variable that takes on a range of values (e.g., $, weight, rating scale, temperature, winning %). For such a variable, we will use the regression model and regression trees. 2) Binary Y variable that takes on two values (yes/no). Examples might be bank customer defaults on loan or not, student graduates or not, customer returns or not, etc. For such a variable, we will use logistic regression model and classification trees. Our coverage of the continuous Y variable will make up about 90% of the course. We will not get to the binary Y scenario until the very end of the course. So, I would like your project experience to be the building of models for a continuous Y variable (not a binary Y variable). Your project should minimally do a complete regression model build; you can consider adding regression trees if your data set is not a time series. I will expect your model searching to include some model selection approaches (e.g., AIC, data splitting, or k-fold cross validation). Determine the Predictor Variables After you have determined the target Y variable that you are trying to predict, the next step is to determine your X variables. X variables are the information that you think might helpful in the prediction of the Y variable. A useful approach to this problem is to draw up a wish list of X variables using subject matter considerations. Consult the subject matter research and knowledgeable experts in the area. For example, you might ask current brand managers how they forecast future sales, listening carefully to what factors they use in forming the forecasts. It is not necessary for your experts to understand regression analysis; you are merely milking them for information on the factors that might be predictive of your Y variable. Avoid the mistake of collecting many variables without thinking carefully about whether these variables could have any relevance I had that information so I threw it in anyway. With this said, you should target at least 5 predictor variables.

2 Note: If your Y variable and X variables are time-oriented (e.g., Y is monthly sales), then the time frequency should be the same for the Y and X variables (month vs. month, week vs. week, daily vs. daily, etc.). Dont mix time frequencies (e.g., dont collect annual Y data and try to predict the Y variable with monthly X variables). Examples Let me emphasize that the goal is to show me your ability to model a set of data. I am open to your data being from your company or some personal interest (data gathered on your own or from the internet). Here are some examples of past projects: Y variable: Monthly Claims Expenses of a particular insurance company. X variables: Variety of monthly economic indicators such as medical CPI. Y variable: Selling price of a used Acuras. X variables: Mileage, age, transmission type, and a variety of accessory information. Y variable: MLB teams winning percentages. X variables: A slew of statistics (RBI, saves, etc.). There are millions of possibilities. I have not had a student not come up with a data set to analyze. Finance applications: Some students explore stock-related data. For such data, we generally look at the changes in prices relative to changes to prices rather than the prices per se. Looking at a single stock series and its changes is not sufficient. Given our lives dominated by the pandemic, it is tempting to do a COVID-based data project. I would suggest avoiding COVID analysis given the complexity, fluidity, and tons of uncertainty associated with the pandemic (both at the biological level and at the human behavioral level). When selecting variables, remember that the project is a statistical investigation. How much insight do you learn if Y is winning % of a team and your Xs are the number of points scored by the team and number of points given up by the team? Or, I had a project for which I wrote to the student: The questionable beginning of your analysis is to have (Rev/empl) as a Y vs Rev and Headcount as X variables. What are you learning from the regression of this Y and on those X's? Nothing is learned because Y is defined by the X's. There is no statistical relationship to study. You do consider other Y variables thereafter but it is the same problem. Gross Margin is defined as Revenue minus COS so what is learned by running a regression of Gross Margin on these variables? In the end, these regressions are not revealing.

3 Data Sources If the data are from your company, then that is your source. If you are looking externally on the internet, there are tens of thousands of sites. I cannot claim to know where to find all data. Please do not email me to ask Where can I find data on...? I am probably not going to know the answer. Google is your best friend. How Much Data? In the software, each column represents a variable and each row represents an observation. The number of rows you have in the data table represents the sample size n. The question is how large should n be? In general, more is better. Software (JMP or R) is designed to do big data analysis so you could literally have millions of data points. I would hope that you can get as much as you can especially if you are building a prediction model. You will learn that one way to home in on a predictive model is to split the data between training and validation sets. This requires larger data sets, at least in the 100s of data points. This does not mean that smaller data sets cant be analyzed as I often do in the lectures. If you have an interesting application, then go for it. My suggestion here is purely rule of thumb. I would minimally target 30-50 observations. So, if you are dealing with a time-series on monthly sales, then get at least 3-5 years worth which will give you 36-60 observations; this gives the opportunity to see a repeat of a given month for seasonal estimation. But 30 observations with many X variables is pushing the limits of the multiple regression. One very rough rule is to have 30 observations and then 3-5 observations for every additional X variable brought into the regression. If you are looking at sports data, I would avoid collecting data over many seasons. For example, I have had students wanting to predict annual winning % of the Bucks over last 50 years. Such a project is going to be relevantly useless because games styles, rules, etc. have dramatically changed over time. It is better to study winning % of different teams over a short window of 2-3 seasons