Question 1: Which of the following are issues in data integration? (Which would actually cause conflicts) (Choose all that apply): A. Two different databases may have different column names for the same actual information B. Two databases on related subjects that you want to integrate may have different number of columns or rows C. An attribute named "weight" may be in different databases. D. There may be discrepancies between entries in two different databases for the same actual real- life entity. Question 2: Match the type of normalization to its property: Decimal Scaling 1. The new values tell how many standard deviations from the sample is from the mean of the original data Min-Max Normalization 2. result is the greatest to be between -1 and 1, but original zeros stay zero Z-score normalization 3. the values are linearly scaled from one interval into another, the middle value Question 3: Which of the following are True about Forward Selection? (Select all that apply) A. Forward Selection is a feature selection method, keeping a subset of original values to make a reduced-complexity model The best results from forward selection will be the same as for PCA because it chooses the set of variables to keep based on variance Question 4: Which of the following are ways to deal with missing data values? (Choose all that apply): A. Use a special value like "unknown" to capture that there is meaning to the fact that value is missing B. All you can do is use the only data mining algorithms that can handle data with values missing C. Replace with the average value of the attribute among data points with the same class D. Predict missing values with a model based on the data that you have (Ex: Classification of regression) Question 5: Text data can be stored in a matrix with "bag-of-words" model. This means: A. Each document is assigned a column to keep track of when it is needed. B. Each row represents a unit of text (Ex: Document) and each column represents a word C. The words are all put in one set and the set information is held per unit of text (Ex: Document) D. A graph is constructed to represent how one unit of text (Ex: Document] contains wordsQuestion 6: Which of the following are true about Forward Selection (Select all that apply): A. Forward Selection is a feature selection method, keeping a subset of the original variables to make a reduced-complexity model B. Forward Selection is a greedy algorithm that runs a classification algorithm over and over as part of evaluating subsets of features C. The best results from forward selection will be the same as for PCA because it chooses the set of variables to keep based on variance D. Using forward selection can result in a model that generalizes better (Ex: Is less subject to overfitting) Question 7: Which of these are true of using clustering for smoothing? A. Clustering is used for replacing missing values, not smoothing B. We replace data points by an average or representatives of points in their cluster C. Each cluster must have the same number of data points D. The best smoothing for a point uses centers of other clusters