Question

1 Approved Answer

Posted on Sep 23, 2024

* no imports please private int[] map; / * Constructor for VirtualDataSet. There are two important considerations here: * (1) Make sure that you keep

*** no imports please image text in transcribed

private int[] map; /** * Constructor for VirtualDataSet. There are two important considerations here: * (1) Make sure that you keep COPIES of the "rows" and "attributes" passed as * formal parameters. Do not, for example, say this.map = rows. Instead, create * a copy of rows before assigning that copy to this.map. (2) Prune the value * sets of the attributes. Since a virtual dataset is only a subset of an actual * dataset, it is likely that some or all of its attributes may have smaller * value sets. * * @param source is the source dataset (always an instance of ActualDataSet) * @param rows is the set of rows from the source dataset that belong to * this virtual dataset * @param attributes is the set of attributes belonging to this virtual dataset. * IMPORTANT: you need to recalculate the unique value sets * for these attributes according to the rows. Why? Because * this virtual set is only a subset of the source dataset and * its attributes potentially have fewer unique values. */ public VirtualDataSet(ActualDataSet source, int[] rows, Attribute[] attributes) { // WRITE YOUR CODE HERE! }

Task 4.1. Implement the constructor of VirtualDataSet. As you can see from the template code, VirtualDataSet has an instance variable named map (an array of integers). This variable will store the indices for the rows of the actual dataset that are included in the virtual dataset. You are prohibited from copying the data matrix of an actual dataset into a virtual dataset. Doing so is not only time-consuming but can also cause the Java Virtual Machine to quickly run out of memory if your data matrix is large. Important details about the implementation of the VirtualDataSet constructor have been provided as comments in the template code. Please read the comments carefully. A key detail is that the VirtualDataSet constructor is in charge of checking the value sets of its attributes and pruning these value sets (if necessary) according to the data rows contained in the dataset. This is illustrated in the example output of Figure 13

THE WEATHER-NOMINAL DATASET: (1) toStringo has different implementations in ActualDataSet and VirtualDataSet Actual dataset (weather-nominal.csv) with 5 attribute(s) and 14 row(s) Metadata for attributes (absolute index: 0] outlook (nominal): {'Sunny', 'overcast', 'rainy'} (absolute index: 1) temperature (nominal): {'hot', 'mild", 'cool'} (absolute index: 2] humidity (nominal): {'high', 'normal'} (absolute index: 3) windy (nominal): {"FALSE, TRUE) [absolute index: 4] class (nominal): {'no', yes') (2) A virtual dataset maintains a map to the rows of an actual dataset. JAVA IMPLEMENTATION OF THE SPLIT IN FIGURE 5: (3) When we split over a nominal (here, "humidity"), that nominal is no longer part of the resulting partitions. Partition 0: Virtual dataset with 4 attribute(s) and 5 row(s) - Dataset is a view over weather-nominal.csv - Row indices in this dataset (w.r.t. its source dataset) [0, 1, 7, 8, 10) Metadata for attributes: [absolute index: 1) temperature (nominal): {'hot', 'mild', 'cool') (absolute index: 2] humidity (nominal): {"high', 'normal'} (absolute index: 3] windy (nominal): {"FALSE", "TRUE) [absolute index: 4] class (nominal): {'no', 'yes'} Partition 1: Virtual dataset with 4 attribute(s) and 4 row(s) Dataset is a view over weather-nominal.csv - Row indices in this dataset (w.r.t. its source dataset): [2, 6, 11, 12 Metadata for attributes: absolute index temporetane (nominal): {'hot', 'cool', 'mild'} absolute index: 2 humidity (nominal): {"high', 'normal'} Tabsolute Tox.swindy (nominal): {"FALSE, TRUE" (absolute index: 4] class (nominal): {'yes'} Partition 2: Virtual dataset with 4 attribute(s) and 5 row(s) Dataset is a view over weather-nominal.csv Row indices in this dataset (w.r.t. its source dataset): [3, 4, 5, 9, 13] - Metadata for attributes: [absolute index: 1] temperature (nominal): {"mild', 'cool'} (absolute index: 2] humidity (nominal): {"high', 'normal'} [absolute index: 3] windy (nominal): {"FALSE', TRUE) (absolute index: 40 class (nominal): {'yes', 'no' (4) The Attribute class maintains an absolute index (as an instance variable). This absolute index is an actual column number in a data matrix. The -highlighted attribute (humidity) has an index of "1" in this partition's array of attributes (this is because "outlook" disappeared). However, the absolute index (absolutelndex variable) is "2". THE WEATHER-NUMERIC DATASET: (5) The value sets of ALL attributes need to be recomputed (pruned) after a split! Actual dataset (weather-numeric.csv) with 5 attribute(s) and 14 row(s) - Metadata for attributes: (absolute index: 0) outlook (nominal): {'Sunny', 'overcast', 'rainy} (absolute index: 1j temperature (numeric): (85, 80, 83, 70, 68, 65, 64, 72, 69, 75, 81, 71) (absolute index: 2] humidity (numeric): (85, 90, 86, 96, 80, 70, 65, 95, 75, 91} (absolute index: 3] windy (nominal): {'FALSE', 'TRUE'} (absolute index: 4] play (nominal): ('no', 'yes' JAVA IMPLEMENTATION OF THE SPLIT IN FIGURE 9 (6) Unlike the situation for nominal attributes, when the split is over a numeric attribute, the attribute does NOT disappear from the resulting partitions. Partition 0: Virtual dataset with 5 attribute(s) and 7 row(s) - Dataset is a view over weather-numeric.csv - Row indices in this dataset (w.r.t. its source dataset) (4, 5, 6, 8, 9, 10, 12] - Metadata for attributes: [absolute index: 0) outlook (nominal): {"rainy', 'overcast', 'sunny'} (absolute index: 11 temperature (numeric): (68. 65. 64. 69, 75, 811 [absolute index: 2 humidity (numeric): {80, 70, 65, 75) (absolute index:3 windy nominal):{"FALSE", "TRUE) (absolute index: 4) play (nominal): {'yes', 'no' Partition 1: Virtual dataset with 5 attribute(s) and 7 ow(s) Dataset is a view over weather-numeric.csv - Row indices in this dataset (w.r.t. its source sataset) [0, 1, 2, 3, 7, 11, 13] - Metadata for attributes: (absolute index: 0) outlook (nominal): ('subny', 'overcast', 'rainy'} (absolute index: 1 temperature (numeric. 15 80 83 70 72 71} (absolute index: 2 humidity (numeric): {85, 90, 86, 96, 95, 91) (absolute index: 3) Windy (nominal):{"FALSE, "TRUE"} [absolute index: 4) play (nominal): {'no', 'yes'} (7) Notice that the row numbers in these two partitions are with respect to the dataset of Figure 3 (and not Figure 9). Recall that the dataset in Figure 9 was sorted by humidity for illustration (thus having a different order of data points). THE WEATHER-NOMINAL DATASET: (1) toStringo has different implementations in ActualDataSet and VirtualDataSet Actual dataset (weather-nominal.csv) with 5 attribute(s) and 14 row(s) Metadata for attributes (absolute index: 0] outlook (nominal): {'Sunny', 'overcast', 'rainy'} (absolute index: 1) temperature (nominal): {'hot', 'mild", 'cool'} (absolute index: 2] humidity (nominal): {'high', 'normal'} (absolute index: 3) windy (nominal): {"FALSE, TRUE) [absolute index: 4] class (nominal): {'no', yes') (2) A virtual dataset maintains a map to the rows of an actual dataset. JAVA IMPLEMENTATION OF THE SPLIT IN FIGURE 5: (3) When we split over a nominal (here, "humidity"), that nominal is no longer part of the resulting partitions. Partition 0: Virtual dataset with 4 attribute(s) and 5 row(s) - Dataset is a view over weather-nominal.csv - Row indices in this dataset (w.r.t. its source dataset) [0, 1, 7, 8, 10) Metadata for attributes: [absolute index: 1) temperature (nominal): {'hot', 'mild', 'cool') (absolute index: 2] humidity (nominal): {"high', 'normal'} (absolute index: 3] windy (nominal): {"FALSE", "TRUE) [absolute index: 4] class (nominal): {'no', 'yes'} Partition 1: Virtual dataset with 4 attribute(s) and 4 row(s) Dataset is a view over weather-nominal.csv - Row indices in this dataset (w.r.t. its source dataset): [2, 6, 11, 12 Metadata for attributes: absolute index temporetane (nominal): {'hot', 'cool', 'mild'} absolute index: 2 humidity (nominal): {"high', 'normal'} Tabsolute Tox.swindy (nominal): {"FALSE, TRUE" (absolute index: 4] class (nominal): {'yes'} Partition 2: Virtual dataset with 4 attribute(s) and 5 row(s) Dataset is a view over weather-nominal.csv Row indices in this dataset (w.r.t. its source dataset): [3, 4, 5, 9, 13] - Metadata for attributes: [absolute index: 1] temperature (nominal): {"mild', 'cool'} (absolute index: 2] humidity (nominal): {"high', 'normal'} [absolute index: 3] windy (nominal): {"FALSE', TRUE) (absolute index: 40 class (nominal): {'yes', 'no' (4) The Attribute class maintains an absolute index (as an instance variable). This absolute index is an actual column number in a data matrix. The -highlighted attribute (humidity) has an index of "1" in this partition's array of attributes (this is because "outlook" disappeared). However, the absolute index (absolutelndex variable) is "2". THE WEATHER-NUMERIC DATASET: (5) The value sets of ALL attributes need to be recomputed (pruned) after a split! Actual dataset (weather-numeric.csv) with 5 attribute(s) and 14 row(s) - Metadata for attributes: (absolute index: 0) outlook (nominal): {'Sunny', 'overcast', 'rainy} (absolute index: 1j temperature (numeric): (85, 80, 83, 70, 68, 65, 64, 72, 69, 75, 81, 71) (absolute index: 2] humidity (numeric): (85, 90, 86, 96, 80, 70, 65, 95, 75, 91} (absolute index: 3] windy (nominal): {'FALSE', 'TRUE'} (absolute index: 4] play (nominal): ('no', 'yes' JAVA IMPLEMENTATION OF THE SPLIT IN FIGURE 9 (6) Unlike the situation for nominal attributes, when the split is over a numeric attribute, the attribute does NOT disappear from the resulting partitions. Partition 0: Virtual dataset with 5 attribute(s) and 7 row(s) - Dataset is a view over weather-numeric.csv - Row indices in this dataset (w.r.t. its source dataset) (4, 5, 6, 8, 9, 10, 12] - Metadata for attributes: [absolute index: 0) outlook (nominal): {"rainy', 'overcast', 'sunny'} (absolute index: 11 temperature (numeric): (68. 65. 64. 69, 75, 811 [absolute index: 2 humidity (numeric): {80, 70, 65, 75) (absolute index:3 windy nominal):{"FALSE", "TRUE) (absolute index: 4) play (nominal): {'yes', 'no' Partition 1: Virtual dataset with 5 attribute(s) and 7 ow(s) Dataset is a view over weather-numeric.csv - Row indices in this dataset (w.r.t. its source sataset) [0, 1, 2, 3, 7, 11, 13] - Metadata for attributes: (absolute index: 0) outlook (nominal): ('subny', 'overcast', 'rainy'} (absolute index: 1 temperature (numeric. 15 80 83 70 72 71} (absolute index: 2 humidity (numeric): {85, 90, 86, 96, 95, 91) (absolute index: 3) Windy (nominal):{"FALSE, "TRUE"} [absolute index: 4) play (nominal): {'no', 'yes'} (7) Notice that the row numbers in these two partitions are with respect to the dataset of Figure 3 (and not Figure 9). Recall that the dataset in Figure 9 was sorted by humidity for illustration (thus having a different order of data points)