As millions of people in the U.S. are crippled by student loan debt and high unemployment, policymakers
Question:
As millions of people in the U.S. are crippled by student loan debt and high unemployment, policymakers are raising the question of whether college is even a good investment. Richard Clancy, a sociology graduate student, is interested in developing a model for predicting an individual’s income for his master’s thesis. He found a rich data set maintained by the U.S. Bureau of Labor Statistics, called the National Longitudinal Surveys (NLS), that follows over 12,000 individuals in the United States over time. The data set focuses on labor force activities of these individuals, but also includes information on a wide range of variables, such as income, education, sex, race, personality trait, health, and marital history.
According to the U.S. Bureau of the Census, the median personal income in 2000 was $29,998. Because an individual’s personal income is determined by a large number of factors, it would be useful to build a model that helps predict whether an individual will earn an income that is at or above the median personal income. Based on the literature concerning personal income and data available from the U.S. Bureau of Labor Statistics, the following predictor variables are used to explain an individual’s personal income:
- Years of education ∙ Mother’s years of education ∙ Father’s years of education
- Urban (equals 1 if the individual lived in an urban area at the age of 14, 0 otherwise)
- Black (equals 1 if the individual is black, 0 otherwise)
- Hispanic (equals 1 if the individual is Hispanic, 0 otherwise)
- White (equals 1 if the individual is white, 0 otherwise)
- Male (equals 1 if the individual is male, 0 otherwise)
- Self-Esteem (the individual’s self-esteem using the Rosenberg Self-Esteem Scale; a higher score indicates a higher self-esteem.)
- Outgoing kid (equals 1 if the individual was outgoing at the age of six, 0 otherwise)
- Outgoing adult (equals 1 if the individual is outgoing as an adult, 0 otherwise)
As the objective is to determine whether an individual has an income that is at or above the median personal income rather than an individual’s actual income, the target variable income is converted into a categorical variable that assumes the value one if income is greater than or equal to $29,998 and 0 otherwise. The final data set includes 5,821 observations, after observations with missing values were removed from the analysis.
To build a predictive model and assess the model’s performance, the data are partitioned into training (60%) and validation (40%) sets. As the predictor variables include both numerical and categorical data types, the decision tree methodology is a suitable technique to build a classification model for this application. Figure 10.31 shows the best-pruned classification tree with eight decision nodes and nine leaf nodes. A number of conclusions can be drawn from the classification tree that shows that personal income level can be predicted using an individual’s.
FIGURE 10.31 Best-pruned classification tree
sex, education, race, and mother’s education. For example, if an individual is male and had more than 16 years of education, he has a high probability of earning an income that is at or above the U.S. median income. If a person is female and has less than 16 years of education, she has a high probability of earning an income that is below the U.S. median income.
The performance of the classification tree is evaluated based on how accurately the model classifies cases in the validation data. The results from the confusion matrix, shown in Table 10.16, can be used to calculate an overall accuracy rate of 69.5%, sensitivity of 0.5994, specificity of 0.7730, and precision of 0.6830. Overall, the model demonstrates reasonably good performance in predicting an individual’s income level.
TABLE 10.16 Confusion Matrix
In an attempt to improve the predictive performance, a random forest ensemble model is also developed. The ensemble model shows an accuracy rate of 69.72%, sensitivity of 0.6138, specificity of 0.7652, and precision of 0.6808, which represents only a slight improvement over the single-tree model. Because the single-tree classification model is easier to interpret than the ensemble model, a single decision tree model is used as the final predictive model.
The conclusions drawn from the classification tree in Figure 10.31 are not surprising. For example, it has been well documented that higher education leads to a higher income. Similarly, while the salary difference based on sex and race has shrunk over the years, it still persists. According to a Pew Research Center analysis, women earned 85% of what men earned in 2018 and households headed by a black person earned, on average, little more than half of what the average white household earned.
What is surprising, however, is that an individual’s self-esteem level and whether a person is outgoing as a kid or adult do not impact an individual’s income level. Perhaps these psychological factors are related to multiple indicators of later-year academic achievement. Therefore, an individual with high self-esteem may go for higher education, which in turn impacts the individual’s income.
The classification model developed in this report offers important and actionable insights for policymakers. The model highlights that sex and racial gaps continue to exist. Males and nonblack people are more likely to earn an above-median income than females and black people do. Special attention should be given to groups identified by the model as likely to earn below-median income when developing equitable policies for improving economic prospects.
Step by Step Answer:
Business Analytics Communicating With Numbers
ISBN: 9781260785005
1st Edition
Authors: Sanjiv Jaggia, Alison Kelly, Kevin Lertwachara, Leida Chen