Question

1 Approved Answer

Posted on Sep 10, 2024

1. Data preparation- integration We want to merge the two datasets into one, in a step called data integration. Revise arff notation from the tutorial,

1. Data preparation- integration

We want to merge the two datasets into one, in a step called data integration. Revise arff notation from the tutorial, which is Weka data representation language. Answer the following questions:

a. Define what data integration means.

b. Is there an entity identification or schema integration problem in this dataset? If yes, how to fix it?

c. Is there a redundancy problem in this dataset? If yes, how to fix it?

d. Are there data value conflicts in this dataset? If yes, how to fix it?

e. Integrate the two datasets into one single dataset, which will be used as a starting point for the next questions, and load it in the Explorer. How many instances do you have? How many attributes?

2. Descriptive data summarization

Before preprocessing the data, an important step is to get acquainted with the data also called data understanding in CRISP-DM.

a. Stay in the Preprocess tab for now. Study for example the age attribute. What is its mean? Its standard deviation? Its min and max?

b. Provide the five-number summary of this attribute. Is this figure provided in Weka?

c. Specify which attributes are numeric, which are ordinal, and which are categorical/nominal.

d. Interpret the graphic showing in the lower right corner of the Explorer. How can you name this graphic? What do the red and blue colors mean (pay attention to the pop-up messages that appear when dragging the mouse over the graphic)? What does this graphic represent?

e. Visualize all the attributes in graphic format. Paste a screenshot.

f. Comment on what you learn from these graphics.

g. Switch to the Visualize tab. What is the term used in the textbook to name the series of boxplots represented? By selecting the maximum jitter, and looking at the num column the last one can you determine which attributes seem to be the most linked to heart disease? Paste the boxplot representing the attribute you find the most predictive of heart disease (Y) as a function of num (X).

h. Does any pair of different attributes seem correlated?

3. Data preparation selection

The datasets studied have already been processed by selecting a subset of attributes relevant for the data mining project.

a. From the documentation provided in the dataset, how many attributes were originally in these datasets?

b. With Weka, attribute selection can be achieved either from the specific Select attributes tab, or within Preprocess tab. List the different options in Weka for selecting attributes, with a short explanation about the corresponding method.

c. In comparison with the methods for attribute selection detailed in the textbook, are any missing? Are any provided in Weka not provided in the textbook?

4. Data preparation - cleaning

Data cleaning deals with such defaults of real-world data as incompleteness, noise, and inconsistencies. In Weka, data cleaning can be accomplished by applying filters to the data in the Preprocess tab.

a. Missing values. List the methods seen in class for dealing with missing values, and which Weka filters implement them if available. Remove the missing values with the method of your choice, explaining which filter you are using and why you make this choice. If a filter is not available for your method of choice, develop a new one that you add to the available filters as a Java class.

b. Noisy data. List the methods seen in class for dealing with noisy data, and which Weka filters implement them if available.

c. Outlier detection. List the methods seen in class for detecting outliers. How would you detect outliers with Weka? Are there any outliers in this dataset, and if yes, list some of them.

d. Save the cleaned dataset into heart-cleaned.arff, and paste here a screenshot showing at least the first 10 rows of this dataset with all the columns.

5. Data preparation - transformation

Among the different data transformation techniques, explore those available through the Weka Filters. Stay in the Preprocess tab for now. Study the following data transformation only:

a. Attribute construction for example adding an attribute representing the sum of two other ones. Which Weka filter permits to do this?

b. Normalize an attribute. Which Weka filter permits to do this? Can this filter perform Min-max normalization? Z-score normalization? Decimal normalization? Provide detailed information about how to perform these in Weka.

c. Normalize all real attributes in the dataset using the method of your choice state which one you choose.

d. Save the normalized dataset into heart-normal.arff, and paste here a screenshot showing at least the first 10 rows of this dataset with all the columns.

6. Data preparation- reduction

Often, data mining datasets are too large to process directly. Data reduction techniques are used to preprocess the data. Once the data mining project has been successful on these reduced data, the larger dataset can be processed too.

a. Stay in the Preprocess tab for now. Beside attribute selection, a reduction method is to select rows from a dataset. This is called sampling. How to perform sampling with Weka filters? Can it perform the two main methods: Simple Random Sample Without Replacement, and Simple Random Sample With Replacement?

Email ThisBlogThis!Share to TwitterShare to FacebookShare to Pinterest