All Matches
Solution Library
Expert Answer
Textbooks
Search Textbook questions, tutors and Books
Oops, something went wrong!
Change your search query and then try again
Toggle navigation
FREE Trial
S
Books
FREE
Tutors
Study Help
Expert Questions
Accounting
General Management
Mathematics
Finance
Organizational Behaviour
Law
Physics
Operating System
Management Leadership
Sociology
Programming
Marketing
Database
Computer Network
Economics
Textbooks Solutions
Accounting
Managerial Accounting
Management Leadership
Cost Accounting
Statistics
Business Law
Corporate Finance
Finance
Economics
Auditing
Hire a Tutor
AI Study Help
New
Search
Search
Sign In
Register
study help
mathematics
statistics
Questions and Answers of
Statistics
A geneticist self-pollinated pink-flowered snapdragon plants and produced 97 progeny with the following colors: 22 red plants, 52 pink plants, and 23 white plants. The purpose of this experiment was
Consider the data of Exercise 13.2.5. Conduct an appropriate complete analysis of the data that also includes a graphical display and discussion of how the data do or do not meet the necessary
The effect of diet on heart disease has been widely studied. As part of this general area of investigation, researchers were interested in the short-term effect of diet on endothelial function, such
Biologists were interested in the distribution of trees in a wooded area. They intended to use the number of trees per 100-square meter plot as their unit of measurement. However, they were concerned
Consider the data of Exercise 13.2.8. Conduct an appropriate complete analysis of the data that also includes a graphical display and discussion of how the data do or do not meet the necessary
Discuss whether or not each of the following activities is a data mining task. (a) Dividing the customers of a company according to their gender. (b) Dividing the customers of a company according to
Suppose that you are employed as a data mining consultant for an Internet search engine company. Describe how data mining can help the company by giving specific examples of how techniques, such as
For each of the following data sets, explain whether or not data privacy is an important issue. (a) Census data collected from 1900-1950. (b) IP addresses and visit times of Web users who visit your
In the initial example of Chapter 2, the statistician says, "Yes, fields 2 and 3 are basically the same." Can you tell from the three lines of sample data that are shown why she says that?
Discuss the difference between the precision of a measurement and the terms single and double precision, as they are used in computer science, typically to represent floating-point numbers that
Give at least two advantages to working with data stored in text files instead of in a binary format.
Distinguish between noise and outliers. Be sure to consider the following questions. (a) Is noise ever interesting or desirable? Outliers? (b) Can noise objects be outliers? (c) Are noise objects
Consider the problem of finding the K nearest neighbors of a data object. A programmer designs Algorithm 2.1 for this task. Algorithm 2.1 Algorithm for finding K nearest neighbors. 1: for i = 1 to
The following attributes are measured for members of a herd of Asian elephants: weight, height, tusk length, trunk length, and ear area. Based on these measurements, what sort of similarity measure
You are given a set of m objects that is divided into K groups, where the ith group is of size mi. If the goal is to obtain a sample of size n < m, what is the difference between the following two
Consider a document-term matrix, where tfij is the frequency of the ith word (term) in the jth document and m is the number of documents. Consider the variable transformation that is defined bywhere
This exercise compares and contrasts some similarity and distance measures. (a) For binary data, the L1 distance corresponds to the Hamming distance; that is, the number of bits that are different
For the following vectors, x and y, calculate the indicated similarity or distance measures. (a) x = (1, 1, 1, 1), y = (2, 2, 2, 2) cosine, correlation, Euclidean (b) x = (0, 1, 0, 1), y = (1, 0, 1,
Classify the following attributes as binary, discrete, or continuous. Also classify them as qualitative (nominal or ordinal) or quantitative (interval or ratio). Some cases may have more than one
Here, we further explore the cosine and correlation measures. (a) What is the range of values that are possible for the cosine measure? (b) If two objects have a cosine measure of 1, are they
Show that the set difference metric given byd(A, B) = size(A B) + size(B A)satisfies the metric axioms given on page 70. A and B are sets and A B is the set
Discuss how you might map correlation values from the interval [−1,1] to the interval [0,1]. Note that the type of transformation that you use might depend on the application that you have in mind.
Proximity is typically defined between a pair of objects. (a) Define two ways in which you might define the proximity among a group of objects. (b) How might you define the distance between two sets
You are given a set of points S in Euclidean space, as well as the distance of each point in S to a point x. (It does not matter if x ∈ S.) (a) If the goal is to find all points within a specified
Show that 1 minus the Jaccard similarity is a distance measure between two data objects, x and y, that satisfies the metric axioms given on page 70. Specifically, d(x, y) = 1 − J(x, y).
Show that the distance measure defined as the angle between two data vectors, x and y, satisfies the metric axioms given on page 70. Specifically, d(x, y) = arccos(cos(x, y)).
Explain why computing the proximity between two attributes is often simpler than computing the similarity between two objects.
You are approached by the marketing director of a local company, who believes that he has devised a foolproof way to measure customer satisfaction. He explains his scheme as follows: "It's so simple
A few months later, you are again approached by the same marketing director as in Exercise 3. This time, he has devised a better approach to measure the extent to which a customer prefers one product
An educational psychologist wants to use association analysis to analyze test results. The test consists of 100 questions with four possible answers each. (a) How would you convert this data into a
Which of the following quantities is likely to show more temporal autocorrelation: daily rainfall or daily temperature? Why?
Discuss why a document-term matrix is an example of a data set that has asymmetric discrete or asymmetric continuous features.
Many sciences rely on observation instead of (or in addition to) designed experiments. Compare the data quality issues involved in observational science with those of experimental science and data
Comment on the use of a box plot to explore a data set with four attributes: age, weight, height, and income.
Give a possible explanation as to why most of the values of petal length and width fall in the buckets along the diagonal in Figure 3.9.
Use Figures 3.14 and 3.15 to identify a characteristic shared by the petal width and petal length attributes.
Describe the types of situations that produce sparse or dense data cubes. Illustrate with examples other than those used in the book.
How might you extend the notion of multidimensional data analysis so that the target variable is a qualitative variable? In other words, what sorts of summary statistics or data visualizations would
Construct a data cube from Table 3.1. Is this a dense or sparse data cube? If it is sparse, identify the cells that are empty.Table 3.1
Discuss the differences between dimensionality reduction based on aggregation and dimensionality reduction based on techniques such as PCA and SVD.
Identify at least two advantages and two disadvantages of using color to visually represent information.
What are the arrangement issues that arise with respect to three-dimensional plots?
Discuss the advantages and disadvantages of using sampling to reduce the number of data objects that need to be displayed. Would simple random sampling (without replacement) be a good approach to
Describe how you would create visualizations to display information that describes the following types of systems. Be sure to address the following issues: • Representation. How will you map
How might you address the problem that a histogram depends on the number and location of the bins?
Describe how a box plot can give information about whether the value of an attribute is symmetrically distributed. What can you say about the symmetry of the distributions of the attributes shown in
Compare sepal length, sepal width, petal length, and petal width, using Figure 3.12.
Draw the full decision tree for the parity function of four Boolean attributes, A, B, C, and D. Is it possible to simplify the tree?
While the .632 bootstrap approach is useful for obtaining a reliable estimate of model accuracy, it has a known limitation. Consider a two-class problem, where there are equal number of positive and
Consider the following approach for testing whether a classifier A beats another classifier B. Let N be the size of a given data set, pA be the accuracy of classifier A, pB be the accuracy of
Let X be a binomial random variable with mean Np and variance Np(1−p). Show that the ratio X/N also has a binomial distribution with mean p and variance p(1 − p)/N.
Consider the training examples shown in Table 4.1 for a binary classification problem.a) Compute the Gini index for the overall collection of training examples. (b) Compute the Gini index for the
Consider the training examples shown in Table 4.2 for a binary classification problem. (a) What is the entropy of this collection of training examples with respect to the positive class? (b) What are
Show that the entropy of a node never increases after splitting it into smaller successor nodes.
Consider the following data set for a binary class problem.(a) Calculate the information gain when splitting on A and B. Which attribute would the decision tree induction algorithm choose? (b)
Consider the following set of training examples.(a) Compute a two-level decision tree using the greedy approach described in this chapter. Use the classification error rate as the criterion for
The following table summarizes a data set with three attributes A, B, C and two class labels +, . Build a two-level decision tree.(a) According to the classification error rate, which
Consider the decision tree shown in Figure 4.2. (a) Compute the generalization error rate of the tree using the optimistic approach. (b) Compute the generalization error rate of the tree using the
Consider the decision trees shown in Figure 4.3. Assume they are generated from a data set that contains 16 binary attributes and 3 classes, C1, C2, and C3. Compute the total description length of
Consider a binary classification problem with the following set of attributes and attribute values:¢ Air Conditioner = {Working, Broken}¢ Engine = {Good, Bad}¢
Repeat the analysis shown in Example 5.3 for finding the location of a decision boundary using the following information: (a) The prior probabilities are P(Crocodile) = 2 × P(Alligator). (b) The
Figure 5.3 illustrates the Bayesian belief network for the data set shown in Table 5.3. (Assume that all the attributes are binary).(a) Draw the probability table for each node in the network. (b)
Given the Bayesian network shown in Figure 5.4, compute the following probabilities: (a) P(B = good, F = empty, G = empty, S = yes). (b) P(B = bad, F = empty, G = not empty, S = no). (c) Given that
Consider the one-dimensional data set shown in Table 5.4.(a) Classify the data point x = 5.0 according to its 1-, 3-, 5-, and 9-nearest neighbors (using majority vote). (b) Repeat the previous
The nearest-neighbor algorithm described in Section 5.2 can be extended to handle nominal attributes. A variant of the algorithm called PEBLS (Parallel Examplar-Based Learning System) by Cost and
For each of the Boolean functions given below, state whether the problem is linearly separable. (a) A AND B AND C (b) NOT A AND B (c) (A OR B) AND (A OR C) (d) (A XOR B) AND (A OR B)
(a) Demonstrate how the perceptron model can be used to represent the AND and OR functions between a pair of Boolean variables. (b) Comment on the disadvantage of using linear functions as activation
You are asked to evaluate the performance of two classification models, M1 and M2. The test set you have chosen contains 26 binary attributes, labeled as A through Z.Table 5.5 shows the posterior
Following is a data set that contains two attributes, X and Y, and two class labels, "+" and "". Each attribute can take three different values: 0, 1, or 2.The concept for the "+" class
a) Consider the cost matrix for a two-class problem. Let C(+, +) = C(−,−) = p, C(+,−) = C(−, +) = q, and q > p. Show that minimizing the cost function is equivalent to maximizing the
The RIPPER algorithm (by Cohen [1]) is an extension of an earlier algorithm called IREP (by F¨urnkranz and Widmer [3]). Both algorithms apply the reduced-error pruning method to determine whether
Consider the task of building a classifier from random data, where the attribute values are generated randomly irrespective of the class labels. Assume the data set contains records from two classes,
Derive the dual Lagrangian for the linear SVM with nonseparable data where the objective function is
Given the data sets shown in Figures 5.6, explain how the decision tree, naive Bayes, and k-nearest neighbor classifiers would perform on these data sets.Figures 5.6
C4.5rules is an implementation of an indirect method for generating rules from a decision tree. RIPPER is an implementation of a direct method for generating rules directly from data. (a) Discuss the
Consider a training set that contains 100 positive examples and 400 negative examples. For each of the following candidate rules, R1: A −→ + (covers 4 positive and 1 negative examples), R2: B
Figure 5.1 illustrates the coverage of the classification rules R1, R2, and R3. Determine which is the best and worst rule according to: (a) The likelihood ratio statistic. (b) The Laplace
Consider the data set shown in Table 5.1Table 5.1. Data set for Exercise 7. (a) Estimate the conditional probabilities for P(A|+), P(B|+), P(C|+), P(A|), P(B|), and
Consider the data set shown in Table 5.2. (a) Estimate the conditional probabilities for P(A = 1|+), P(B = 1|+), P(C = 1|+), P(A = 1|−), P(B = 1|−), and P(C = 1|−) using the same approach as in
(a) Explain how naive Bayes performs on the data set shown in Figure 5.2. (b) If each class is further divided such that there are four classes (A1, A2, B1, and B2), will naive Bayes perform
For each of the following questions, provide an example of an association rule from the market basket domain that satisfies the following conditions. Also, describe whether such rules are
Consider the following set of candidate 3-itemsets: {1, 2, 3}, {1, 2, 6}, {1, 3, 4}, {2, 3, 4}, {2, 4, 5}, {3, 4, 6}, {4, 5, 6} (a) Construct a hash tree for the above candidate 3-itemsets. Assume
Given the lattice structure shown in Figure 6.4 and the transactions given in Table 6.3, label each node with the following letter(s):¢ M if the node is a maximal frequent
The original association rule mining formulation uses the support and confidence measures to prune uninteresting rules.(a) Draw a contingency table for each of the following rules using the
Given the rankings you had obtained in Exercise 12, compute the correlation between the rankings of confidence and the other five measures. Which measure is most highly correlated with confidence?
Answer the following questions using the data sets shown in Figure 6.6. Note that each data set contains 1000 items and 10,000 transactions. Dark cells indicate the presence of items and white cells
(a) Prove that the Ï coefficient is equal to 1 if and only if f11 = f1+ = f+1.(b) Show that if A and B are independent, then P(A,B) Ã P(A, ) = P(A,) Ã P(, B).(c)
Consider the interestingness measure, M = P(B|A) − P(B)/1 − P(B) , for an association rule A → B. (a) What is the range of this measure? When does the measure attain its maximum and minimum
Suppose we have market basket data consisting of 100 transactions and 20 items. If the support for item a is 25%, the support for item b is 90% and the support for itemset {a, b} is 20%. Let the
Table 6.5 shows a 2 Ã 2 Ã 2 contingency table for the binary variables A and B at different values of the control variable C.(a) Compute the Ï coefficient for A
Consider the contingency tables shown in Table 6.6.Table 6.6(a) For table I, compute support, the interest measure, and the Ï correlation coefficient for the association pattern {A, B}.
Consider the data set shown in Table 6.1.(a) Compute the support for itemsets {e}, {b, d}, and {b, d, e} by treating each transaction ID as a market basket. (b) Use the results in part (a) to compute
Consider the relationship between customers who buy high-definition televisions and exercise machines as shown in Tables 6.19 and 6.20. (a) Compute the odds ratios for both tables. (b) Compute the
(a) What is the confidence for the rules ∅ → A and A → ∅? (b) Let c1, c2, and c3 be the confidence values of the rules {p} → {q}, {p} → {q, r}, and {p, r} → {q}, respectively. If we
For each of the following measures, determine whether it is monotone, anti-monotone, or non-monotone (i.e., neither monotone nor anti-monotone).Example: Support, s = Ï(X)/|T| is
Consider the market basket transactions shown in Table 6.2.(a) What is the maximum number of association rules that can be extracted from this data (including rules that have zero support)? (b) What
Consider the following set of frequent 3-itemsets: {1, 2, 3}, {1, 2, 4}, {1, 2, 5}, {1, 3, 4}, {1, 3, 5}, {2, 3, 4}, {2, 3, 5}, {3, 4, 5}. Assume that there are only five items in the data set. (a)
The Apriori algorithm uses a generate-and-count strategy for deriving frequent itemsets. Candidate itemsets of size k + 1 are created by joining a pair of frequent itemsets of size k (this is known
The Apriori algorithm uses a hash tree data structure to efficiently count the support of candidate itemsets. Consider the hash tree for candidate 3-itemsets shown in Figure 6.2. (a) Given a
Consider the traffic accident data set shown in Table 7.1.(a) Show a binarized version of the data set.(b) What is the maximum width of each transaction in the binarized data?(c) Assuming that
Showing 67700 - 67800
of 88274
First
671
672
673
674
675
676
677
678
679
680
681
682
683
684
685
Last