All Matches
Solution Library
Expert Answer
Textbooks
Search Textbook questions, tutors and Books
Oops, something went wrong!
Change your search query and then try again
Toggle navigation
FREE Trial
S
Books
FREE
Tutors
Study Help
Expert Questions
Accounting
General Management
Mathematics
Finance
Organizational Behaviour
Law
Physics
Operating System
Management Leadership
Sociology
Programming
Marketing
Database
Computer Network
Economics
Textbooks Solutions
Accounting
Managerial Accounting
Management Leadership
Cost Accounting
Statistics
Business Law
Corporate Finance
Finance
Economics
Auditing
Hire a Tutor
AI Tutor
New
Search
Search
Sign In
Register
study help
computer science
data mining concepts and techniques
Questions and Answers of
Data Mining Concepts And Techniques
A data warehouse can be modeled by either a star schema or a snowflake schema. Briefly describe the similarities and the differences of the two models, and then analyze their advantages and
Design a data warehouse for a regional weather bureau. The weather bureau has about 1000 probes, which are scattered throughout various land and ocean locations in the region to collect basic weather
A popular data warehouse implementation is to construct a multidimensional database, known as a data cube. Unfortunately, this may often generate a huge, yet very sparse, multidimensional matrix.a.
Regarding the computation of measures in a data cube:a. Enumerate three categories of measures, based on the kind of aggregate functions used in computing a data cube.b. For a data cube with the
Suppose a company wants to design a data warehouse to facilitate the analysis of moving vehicles in an online analytical processing manner. The company registers huge amounts of automovement data in
Radio-frequency identification is commonly used to trace object movement and perform inventory control. An RFID reader can successfully read an RFID tag from a limited distance at any scheduled time.
In many applications, new data sets are incrementally added to the existing large data sets. Thus, an important consideration is whether a measure can be computed efficiently in an incremental
Suppose that we need to record three measures in a data cube: min (), a verage(), and median( ). Design an efficient computation and storage method for each measure given that the cube allows data to
In data warehouse technology, a multiple dimensional view can be implemented by a relational database technique (ROLAP), by a multidimensional database technique (MOLAP), or by a hybrid database
Suppose that a data warehouse contains 20 dimensions, each with about five levels of granularity.a. Users are mainly interested in four particular dimensions, each having three frequently accessed
A data cube, \(C\), has \(n\) dimensions, and each dimension has exactly \(p\) distinct values in the base cuboid. Assume that there are no concept hierarchies associated with the dimensions.a. What
Assume that a 10-D base cuboid contains only three base cells: (1) \(\left(a_{1}, d_{2}, d_{3}, d_{4}, \ldots, d_{9}, d_{10}ight)\), (2) \(\left(d_{1}, b_{2}, d_{3}, d_{4}, \ldots, d_{9},
There are several typical cube computation methods, such as MultiWay [ZDN97], BUC [BR99], and Star-Cubing [XHLW03]. Briefly describe these three methods (i.e., use one or two lines to outline the key
Suppose a data cube, \(C\), has \(D\) dimensions, and the base cuboid contains \(k\) distinct tuples.a. Present a formula to calculate the minimum number of cells that the cube, \(C\), may contain.b.
Suppose that a base cuboid has three dimensions, \(A, B, C\), with the following number of cells: \(|A|=1,000,000,|B|=100\), and \(|C|=1000\). Suppose that each dimension is evenly partitioned into
When computing a cube of high dimensionality, we encounter the inherent curse of dimensionality problem: There exists a huge number of subsets of combinations of dimensions.a. Suppose that there are
Propose an algorithm that computes closed iceberg cubes efficiently.
Suppose that we want to compute an iceberg cube for the dimensions, \(A, B, C, D\), where we wish to materialize all cells that satisfy a minimum support count of at least \(v\), and where
Discuss how you might extend the Star-Cubing algorithm to compute iceberg cubes where the iceberg condition tests for an avg that is no bigger than some value, \(v\).
A flight data warehouse for a travel agent consists of six dimensions: traveler, departure (city), departure_time, arrival, arrival_time, and fight; and two measures: count ( ) and avg_fare( ), where
(Implementation project) There are four typical data cube computation methods: MultiWay [ZDN97], BUC [BR99], H-Cubing [HPDW01], and Star-Cubing [XHLW03].a. Implement any one of these cube computation
The sampling cube was proposed for multidimensional analysis of sampling data (e.g., survey data). In many real applications, sampling data can be of high dimensionality (e.g., it is not unusual to
The ranking cube was designed to support top- \(k\) (ranking) queries in relational database systems. However, ranking queries are also posed to data warehouses, where ranking is on multidimensional
Recently, researchers have proposed another kind of query, called a skyline query. A skyline query returns all the objects \(p_{i}\) such that \(p_{i}\) is not dominated by any other object
The prediction cube is a good example of multidimensional data mining in cube space.a. Propose an efficient algorithm that computes prediction cubes in a given multidimensional database.b. For what
Multifeature cubes allow us to construct interesting data cubes based on rather sophisticated query conditions. Can you construct the following multifeature cube by translating the following user
Discovery-driven cube exploration is a desirable way to mark interesting points among a large number of cells in a data cube. Individual users may have different views on whether a point should be
Suppose you have the set \(\mathcal{C}\) of all frequent closed itemsets on a data set \(D\), as well as the support count for each frequent closed itemset. Describe an algorithm to determine whether
An itemset \(X\) is called a generator on a data set \(D\) if there does not exist a proper subitemset \(Y \subset\) \(X\) such that \(\operatorname{support}(X)=\operatorname{support}(Y)\). A
The Apriori algorithm makes use of prior knowledge of subset support properties.a. Prove that all nonempty subsets of a frequent itemset must also be frequent.b. Prove that the support of any
Let cc be a candidate itemset in CkCk generated by the Apriori algorithm. How many length- (k−1)(k−1) subsets do we need to check in the prune step? Per your previous answer, can you give an
Section 4.2.2 describes a method for generating association rules from frequent itemsets. Propose a more efficient method. Explain why it is more efficient than the one proposed there. (consider
A database has five transactions. Let \(\min _{-} \sup =60 \%\) and \(\min _{-}\)conf \(=80 \%\).a. Find all frequent itemsets using Apriori and FP-growth, respectively. Compare the efficiency of the
(Implementation project) Using a programming language that you are familiar with, such as \(\mathrm{C}++\) or Java, implement three frequent itemset mining algorithms introduced in this chapter: (1)
A database has four transactions. Let min_sup \(^{\text {4 }}=60 \%\) and min_conf \(^{\text {s }}=80 \%\).a. At the granularity of item_category (e.g., item \({ }_{i}\) could be "Milk"), for the
Suppose that a large store has a transactional database that is distributed among four locations. Transactions in each component database have the same format, namely \(T_{j}:\left\{i_{1}, \ldots,
Suppose that frequent itemsets are saved for a large transactional database, \(D B\). Discuss how to efficiently mine the (global) association rules under the same minimum support threshold, if a set
Most frequent pattern mining algorithms consider only distinct items in a transaction. However, multiple occurrences of an item in the same shopping basket, such as four cakes and three jugs of milk,
(Implementation project) Many techniques have been proposed to further improve the performance of frequent itemset mining algorithms. Taking FP-tree-based frequent pattern growth algorithms (e.g.,
Give a short example to show that items in a strong association rule actually may be negatively correlated.
The following contingency table summarizes supermarket transaction data, where hot dogs refers to the transactions containing hot dogs, \(\overline{\text { hot dogs }}\) refers to the transactions
(Implementation project) The DBLP data set (https://dblp.uni-trier.de/xml/) consists of over three million entries of research papers published in computer science conferences and journals. Among
Propose and outline a level-shared mining approach to mining multilevel association rules in which each item is encoded by its level position. Design it so that an initial scan of the database
Suppose, as manager of a chain of stores, you would like to use sales transactional data to analyze the effectiveness of your store's advertisements. In particular, you would like to study how
Quantitative association rules may disclose exceptional behaviors within a data set, where "exceptional" can be defined based on statistical theory. For example, Section 5.1.3 shows the association
In multidimensional data analysis, it is interesting to extract pairs of similar cell characteristics associated with substantial changes in measure in a data cube, where cells are considered similar
Section 5.1.5 presented various ways of defining negatively correlated patterns. Consider Definition 5.3: "Suppose that itemsets \(X\) and \(Y\) are both frequent, that is, \(\sup (X) \geq\) min_sup
Prove that each entry in the following table correctly characterizes its corresponding rule constraint for frequent itemset mining. Rule Constraint a. ve S b. SCV c. min(S) O d. range(S) U
The price of each item in a store is nonnegative. The store manager is only interested in mining the rules, following the constraints given below. For each of the following cases, identify the kinds
Section 5.1.4 introduced a core Pattern-Fusion method for mining high-dimensional data. Explain why a long pattern, if existing in the data set, is likely to be discovered by this method.Section
Section 5.2.1 defined a pattern distance measure between closed patterns \(P_{1}\) and \(P_{2}\) as\[\text { Pat_Dist }\left(P_{1}, P_{2}ight)=1-\frac{\left|T\left(P_{1}ight) \cap
Association rule mining often generates a large number of rules, many of which may be similar, thus not containing much novel information. Design an efficient algorithm that compresses a large set of
Frequent pattern mining may generate many superfluous patterns. Therefore, it is important to develop methods that mine compressed patterns. Suppose a user would like to obtain only \(k\) patterns
Sequential pattern mining is to mine sequential patterns for a set of items occurring in sequence order. In practice, people may like to find sequential patterns for types of items instead of for
At studying customer shopping sequences, one may find if a customer buys a sequence of products from one company, the chance for him/her to buy the products of the similar kind from another company
Our study of subgraph pattern mining has been on how to mine frequent substructures from a collection of graph data sets. The current Web page structures (e.g., Wikipedia) or social networks may form
In this chapter, we introduce an effective method for mining copy-and-paste bugs in software programs. Typically, a software program may take different inputs which may lead to different program
Briefly outline the major steps of decision tree classification.
Why is tree pruning useful in decision tree induction? What is a drawback of using a separate set of tuples to evaluate pruning?
Given a decision tree, you have the option of (a) converting the decision tree to rules and then pruning the resulting rules, or (b) pruning the decision tree and then converting the pruned tree to
It is important to calculate the worst-case computational complexity of the decision tree algorithm. Given data set, DD, the number of attributes, nn, and the number of training tuples, |D||D|,
Given a 5-GB data set with 50 attributes (each containing 100 distinct values) and 512MB512MB of main memory in your laptop, outline an efficient method that constructs decision trees in such large
Why is naïve Bayesian classification called "naïve"? Briefly outline the major ideas of naïve Bayesian classification.
The following table consists of training data from an employee database. The data have been generalized. For example, " 31 ... 35" for age represents the age range of 31 to 35 . For a given row
Compare the advantages and disadvantages of eager classification (e.g., decision tree, Bayesian, neural network) vs. lazy classification (e.g., kk-nearest neighbor, case-based reasoning).
Write an algorithm for kk-nearest-neighbor classification given kk, the nearest number of neighbors, and nn, the number of attributes describing each tuple.
Rain Forest is a scalable algorithm for decision tree induction. Develop a scalable naïve Bayesian classification algorithm that requires just a single scan of the entire data set for most
Design an efficient method that performs effective naïve Bayesian classification over an infinite data stream (i.e., you can scan the data stream only once). If we wanted to discover the evolution
The perceptron model y=f(x)=sign(wTx+b)y=f(x)=sign(wTx+b) can be used to learn a binary classifier from training data.a. Assume there are two training samples. The positive one is x1=(2,1)Tx1=(2,1)T;
Suppose we have three positive examples x1=(1,0,0),x2=(0,0,1)x1=(1,0,0),x2=(0,0,1) and x3=(0,1,0)x3=(0,1,0) and three negative examples x4=(−1,0,0),x5=(0,−1,0)x4=(−1,0,0),x5=(0,−1,0) and
Suppose that we are training a naïve Bayes classifier and a logistic regression classifier: ff : X→YX→Y, which maps a dd-dimensional real-valued feature vector X∈RdX∈Rd to a binary class
Show that accuracy is a function of sensitivity and specificity, that is, prove Eq. (6.25). P (P + N) accuracy = sensitivity. N (P+N) + specificity. (6.25)
The harmonic mean is one of several kinds of averages. Chapter 2 discussed how to compute the arithmetic mean, which is what most people typically think of when they compute an average. The harmonic
The data tuples of Fig. 6.28 are sorted by decreasing probability value, as returned by a classifier. For each tuple, compute the values for the number of true positives (TP)(TP), false positives
It is difficult to assess classification accuracy when individual data objects may belong to more than one class at a time. In such cases, comment on what criteria you would use to compare different
Suppose that we want to select between two prediction models, M1M1 and M2M2. We have performed 10 rounds of 10 -fold cross-validation on each model, where the same data partitioning in round ii is
What is boosting? State why it may improve the accuracy of decision tree induction.
Outline methods for addressing the class imbalance problem. Suppose a bank wants to develop a classifier that guards against fraudulent credit card transactions. Illustrate how you can induce a
XGBoost is a scalable machine learning system for tree boosting. Its objective function has a training loss and a regularization term: L=∑il(),^yi)+∑kΩ(fk)L=∑il(yi,y^i)+∑kΩ(fk). Read the
What is data mining? In your answer, address the following:(a) Is it a simple transformation or application of technology developed from databases, statistics, machine learning, and pattern
Define each of the following data mining functionalities: association and correlation analysis, classification, regression, clustering, and outlier analysis.Give examples of each data mining
Present an example where data mining is crucial to the success of a business. What data mining functionalities does this business need (e.g., think of the kinds of patterns that could be mined)? Can
Explain the difference and similarity between correlation analysis and classification, between classification and clustering, and between classification and regression.
Based on your observations, describe another possible kind of knowledge that needs to be discovered by data mining methods but has not been listed in this chapter. Does it require a mining
Outliers are often discarded as noise. However, one person’s garbage could be another’s treasure. For example, exceptions in credit card transactions can help us detect the fraudulent use of
Outline the major research challenges of data mining in one specific application domain, such as stream/sensor data analysis, spatiotemporal data analysis, or bioinformatics.
Showing 100 - 200
of 186
1
2