Answered step by step
Verified Expert Solution
Question
1 Approved Answer
Q. 2.4 & 2.11 (A+B) 48 OVERVIEW OF THE DATA MINING PROCESS TABLE 2.7 Age Income ($) 49,000 156,000 99,000 192,000 39,000 57,000 b. We
Q. 2.4 & 2.11 (A+B)
48 OVERVIEW OF THE DATA MINING PROCESS TABLE 2.7 Age Income ($) 49,000 156,000 99,000 192,000 39,000 57,000 b. We plan to analyze the data using various data mining techniques described in future chapters. Prepare the dataset for data mining techniques of supervised learning by creating partitions using the JMP Pro Make Validation Column utility (from the Cols menu). Use the following partitioning percentages: training (50%). validation (30%), and test (20%). Describe the roles that these partitions will play in modeling. 2.4 PROBLEMS 45 23 Consider the sample from a database of credit applicants in Table 2.5. Comment on the likelihood that it was sampled randomly, and whether it is likely to be a useful sample. Consider the sample from a bank database shown in Table 2.6, it was selected randomly from a larger database to be the training set. Personal Loan indicates whether a solicitation for a personal loan was accepted and is the response variable. A campaign is planned for a similar solicitation in the future and the bank is looking for a model that will identify likely responders. Examine the data carefully and indicate what your next step would be. 2.5 Using the concept of overfitting, explain why when a model is fit to training data, zero error with those data is not necessarily good. 2.6 In fitting a model to classify prospects as purchasers or nonpurchasers, a certain company drew the training data from internal data that include demographic and purchase information. Future data to be classified will be lists purchased from other sources, with demographic (but not purchase) data included. It was found that "refund issued" was a useful predictor in the training data. Why is this not an appropriate variable to include in the model? 2.7 A dataset has 1000 records and 50 variables with 5% of the values missing, spread randomly throughout the records and variables. An analyst decides to remove records that have missing values. About how many records would you expect would be removed? 2.8 Normalize the data in Table 2.7, showing calculations. Confirm your results in JMP (create a JMP data table, then use the Formula Editor or the dynamic transformation feature) 2.9 Statistical distance between records can be measured in several ways. Consider Euclidean distance, measured as the square root of the sum of the squared differences. For the first two records in Table 2.7, it is V(25 - 56)2 + (49,000 - 156,000). Can normalizing the data change which two records are farthest from each other in terms of Euclidean distance? 2.10 Two models are applied to a dataset that has been partitioned. Model A is considerably more accurate than model B on the training data, but slightly less accurate than model B on the validation data. Which model are you more likely to consider for final deployment? 2.11 The dataset Toyota Corolla.jmp contains data on used cars on sale during the late summer of 2004 in the Netherlands. It has 1436 records containing details on 38 attributes, including Price, Age, Kilometers, HP, and other specifications. a. Explore the data using the data visualization (e.g., Graph > Scatterplot Matrix and Graph > Graph Builder) capabilities of JMP. Which of the pairs among the variables seem to be correlated? (Refer to the guides and videos at jmp.com/learn. under Graphical Displays and Summaries, for basic information on how to use these platforms.)Step by Step Solution
There are 3 Steps involved in it
Step: 1
Get Instant Access to Expert-Tailored Solutions
See step-by-step solutions with expert insights and AI powered tools for academic success
Step: 2
Step: 3
Ace Your Homework with AI
Get the answers you need in no time with our AI-driven, step-by-step assistance
Get Started