All Matches
Solution Library
Expert Answer
Textbooks
Search Textbook questions, tutors and Books
Oops, something went wrong!
Change your search query and then try again
Toggle navigation
FREE Trial
S
Books
FREE
Tutors
Study Help
Expert Questions
Accounting
General Management
Mathematics
Finance
Organizational Behaviour
Law
Physics
Operating System
Management Leadership
Sociology
Programming
Marketing
Database
Computer Network
Economics
Textbooks Solutions
Accounting
Managerial Accounting
Management Leadership
Cost Accounting
Statistics
Business Law
Corporate Finance
Finance
Economics
Auditing
Hire a Tutor
AI Study Help
New
Search
Search
Sign In
Register
study help
business
business analytics data
Questions and Answers of
Business Analytics Data
Suppose we wish to test for difference in population means among three groups.a. Explain why it is not sufficient to simply look at the differences among the sample means, without taking into account
Our partition shows that 800 of the 2000 customers in our test set own a tablet, while 230 of the 600 customers in our training set own a tablet. Test whether the partition is valid for this
In Chapter 7, we will learn to split the data set into a training data set and a test data set. To test whether there exist unwanted differences between the training and test set, which hypothesis
How is the bias–variance trade-off related to the issue of overfitting and underfitting? Is high bias associated with overfitting and underfitting, and why? High variance? 170 CHAPTER 7 PREPARING
Work with international minutes as follows:a. Construct a normal probability plot of international minutes.b. What is preventing this variable from being normally distributed.c. Construct a flag
Identify the range of customer service calls that should be considered outliers, using:a. the Z-score method;b. the IQR method.
Explain why we might not want to remove a variable just because it is highly correlated with another variable. EXERCISES 53 HANDS-ON ANALYSIS Use the churn data set14 on the book series web site for
Clarify why each of the binning solutions above are not optimal.
Bin the data into three bins of two records each.
What are the four common methods for binning numerical predictors? Which of these are preferred? Use the following data set for Exercises 28–30: 111337
Investigate how the outlier affects the mean and median by doing the following:a. Find the mean score and the median score, with and without the outlier.b. State which measure, the mean or the
Identify all possible stock prices that would be outliers, using:a. The Z-score method.b. The IQR method.
Do the following.a. Identify the outlier.b. Verify that this value is an outlier, using the Z-score method.c. Verify that this value is an outlier, using the IQR method.
What do we look for in a normal probability plot to indicate nonnormality? Use the stock price data for Exercises 24–26.
Find the decimal scaling stock price for the stock price $20.
Compute the Z-score standardized stock price for the stock price $20.
Calculate the midrange stock price.
Find the min–max normalized stock price for the stock price $20.
Compute the SD of the stock price. Interpret what this number means.
Calculate the mean, median, and mode stock price.
Make up a classification scheme that is inherently flawed, and would lead to misclassification, as we find in Table 2.2. For example, classes of items bought in a grocery store.
Which of the four methods for handling missing data would tend to lead to an underestimate of the spread (e.g., SD) of the variable? What are some benefits to this method?
Discuss the similarities and differences with CRISP-DM.
CRISP-DM is not the only standard process for data mining. Research an alternative methodology (Hint: Sample, Explore, Modify, Model and Assess (SEMMA), from the SAS Institute).
For each of the following meetings, explain which phase in the CRISP-DM process is represented:a. Managers want to know by next week whether deployment will take place. Therefore, analysts meet to
On your own, recapitulate the trinary classification analysis undertaken in this chapter using the Loans4 data sets. (Note that the results may differ slightly due to different settings in the CART
On your own, recapitulate the trinary classification analysis undertaken in this chapter using the Loans3 data sets. (Note that the results may differ slightly due to different settings in the CART
Using the results in Tables 17.12 and 17.14, confirm the values for the evaluation measures in Table 17.15.
Adjust Table 17.13 so that there are zeroes on the diagonal and the matrix is scaled, similarly to Table 17.7.
Provide justifications for each of the direct costs given in Table 17.5.
When misclassification costs are involved, what is the best metric for comparing model performance?
Which cost matrix should we use when comparing models?
Why do we adjust our cost matrix so that there are zeroes on the diagonal?
Explain how we determine the principal and interest amounts for the Loans problem.
Express in your own words how we interpret the following measures:a. D-sensitivity, where D represents the denied class in the Loans problemb. False D ratec. Proportion of true Dsd. Proportion of
Use the term “diagonal elements of the contingency table” to define (i) accuracy and (ii) overall error rate.
Interpret the proportion of true As and the proportion of false As.
What is the relationship between the proportion of true As and the proportion of false As?
Why do we avoid the term positive predictive value in this book?
How are A-sensitivity and false A rate interpreted?
What is the relationship between false A rate and A-sensitivity?
Explain the Σ notation used in the notation in this chapter, for the marginal totals and the grand total of the contingency tables.
Explain why the true positive/false positive/true negative/false negative usage is not applicable to classification models with trinary targets.
Finally, assume that 50% of those customers who are in danger of churning, and with whom the company intervenes, will stay with the company, and 50% will churn anyway. Redo Exercises 45–50 under
Next, assume the company’s intervention strategy is perfect, and that everyone the company intervenes with to stop churning will not churn. Redo Exercises 45–50 under this assumption.
Construct a table of evaluation measures for the two models, similarly to Table 16.13.
Using the training set, and the cost matrix, develop a CART model for predicting Churn. Call this Model 2.
Using the training set, develop a CART model for predicting Churn. Do not use misclassification costs. Call this Model 1.
Partition the Churn data set into a training data set and a test data set.
Why don’t we rebalance the test data set?
Suppose the classification algorithm of choice had no method of applying misclassification costs.a. What would be the resampling ratio for using rebalancing as a surrogate for misclassification
Revenue per customer.
Model cost.
Use Result 3 to readjust the adjusted misclassification costs, so that the readjusted false negative cost is $1. Interpret the readjusted false positive and false negative costs. For Exercises
Use Result 3 to readjust the adjusted misclassification costs, so that the readjusted false positive cost is $1. Interpret the readjusted false positive and false negative costs.
Calculate the positive confidence threshold. Use Result 2 to state when the model will make a positive classification.
Use Result 1 to construct the adjusted cost matrix. Interpret the adjusted costs.
Construct the cost matrix. Provide rationales for each cost.
Explain why (i) misclassification costs are needed in this scenario, and (ii) the overall error rate is not the best measure of a good model.
Why does rebalancing work as a surrogate for misclassification costs? Use the following information for Exercises 27–44. Suppose that our client is a retailer seeking to maximize revenue from a
What does it mean to say that the resampling ratio is data-driven?
Explain how we do such rebalancing when the adjusted false positive cost is greater than the adjusted false negative cost.
Why might we need rebalancing as a surrogate for misclassification costs?
What do we mean when we say that the misclassification costs in the case study are data-driven?
What are direct costs? Opportunity costs? Why should we not include both when constructing our cost matrix?
How might Result 3 be of use to an analyst making a presentation to a client?
Explain what is meant by decision invariance under scaling.
Clearly explain how Figure 16.1 demonstrates the positive classification criterion for a C5.0 binary classifier.
Explain the positive classification criterion.
What is the positive confidence threshold?
What is the adjusted false positive cost? The adjusted false negative cost?
What is the difference between confidence and positive confidence?
True or false: We can always adjust the costs in our cost matrix so that the two cells representing correct decisions have zero cost.
Explain decision invariance under row adjustment.
Describe what is meant by the minimum expected cost principle.
True or false: The overall error rate is always the best indicator of a good model
Proportion of false negatives.
Proportion of true negatives
Proportion of false positives.
Proportion of true positives.
False negative rate.
Specificity.
False positive rate.
For Exercises 1–8, state what you would expect to happen to the indicated classification evaluation measure, if we increase the false negative misclassification cost, while not increasing the
Recall the WEKA Logistic example for classifying cereals as either high or low. Compute the probability that the fourth instance from the test set is classified either high or low. Does your
Open the breast cancer data set. Investigate, for each significant predictor, whether the linearity assumption is warranted. If not, ameliorate the situation using the methods discussed in this
Open the data set, German, which is provided on the textbook website. The data set consists of 20 predictors, both continuous and categorical, and a single response variable, indicating whether the
Find the probability of high income for a 50-year-old married male with 16 years education working 40 hours per week with capital gains of $6000.
Find the probability of high income for a 20-year-old single female with 12 years education working 20 hours per week with no capital gains or losses.
Construct and interpret 95% confidence intervals for the coefficients for age, sex-male, and educ-squared. Verify that these predictors belong in the model.
Find the estimated logit.
For indicator categories that are not significant, collapse the categories with the reference category. (How are you handling the category with the 0.083 p-value?) Rerun the logistic regression with
Consider the results from Table 13.26. Construct the logistic regression model that produced these results.
Construct and interpret a 95% confidence interval for each coefficient. Use Table 13.26 for Exercises 29–31.
Find the probability of high income for someone working 30, 40, 50, and 60 hours per week.
Find the form of the estimated logit.
Construct the logistic regression model developed in the text, with the age2 term and the indicator variable age 33–65. Verify that using the quadratic term provides a higher estimate of the
Clearly interpret the value of the coefficients for the following predictors:a. Bland chromatinb. Normal nucleoli
Calculate the 95% confidence intervals for the following predictor coefficients.a. Clump thicknessb. Mitosesc. Comment as to the evidence provided by the confidence interval for the mitoses
Find the probability that a tumor is malignant, given the following:a. The values for all predictors are at the minimum (1).b. The values for all predictors are at a moderate level (5).c. The values
Showing 1200 - 1300
of 2834
First
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
Last