Question
Consider the Communities and Crime Unnormalized Data Set available at the UCI Machine Learning Repository. Convert the dataset to the arff format. The arff header
Consider the Communities and Crime Unnormalized Data Set available at the UCI Machine Learning Repository. Convert the dataset to the arff format. The arff header is provided in the dataset webpage. Load this dataset into Weka by opening your arff dataset from the "Explorer" window in Weka. Increase the memory available to Weka as needed.
Use Excel, Matlab, your own code, Weka, or other software, to complete the following parts. Please state in your report which tool from the above list you used for each part.
For the murdPerPop attribute:
(5 points) Calculate the percentiles (in increments of 10, as in Table 3.2 of the textbook, page 101), mean, median, range, and variance of the attribute.
(5 points) Plot a histogram of the attribute using 10 or 20 bins (you choose the best value for the attribute). For examples, see Figures 3.7 and 3.8 in the textbook, page 113.
For the following set of 21 continuous attributes, calculate (1) (10 points) the covariance matrix and (2) (10 points) the correlation matrix of these attributes.
-- population -- householdsize -- racepctblack -- racePctWhite -- racePctAsian -- racePctHisp -- agePct12t21 -- agePct12t29 -- agePct16t24 -- agePct65up -- numbUrban -- pctUrban -- medIncome -- pctWWage -- pctWFarmSelf -- pctWInvInc -- pctWSocSec -- pctWPubAsst -- pctWRetire -- medFamInc -- perCapInc
(5 points) If you had to remove 4 of the continuous attributes above from the dataset based on these two matrices, which attributes would you remove and why? Explain your answer.
Dimensionality Reduction. (10 points) Upload the entire dataset onto Weka. Apply Principal Components Analysis to reduce the dimensionality of the full dataset. For this, use Weka's PrincipalComponents option from the "Select attributes" tab. Use parameter values: centerData=True, varianceCovered=0.95. How many dimensions (= attributes) does the original dataset contain? How many dimensions are obtained after PCA? How much of the variance do they explain? Include in your report the linear combinations that define the first two new attributes(= components) obtained. Look at the results and elaborate on any interesting observations you can make about the results.
Feature Selection. (10 points) Using the full original dataset, discretize the murdPerPop attribute into 10 equal frequency bins using unsupervized discretization. Use this discretized attribute as the target classification attribute. Apply Correlation Based Feature Selection (see Witten's and Frank's textbook slides - Chapter 7 Slides 5-6). For this, use Weka's CfsSubsetEval available under the Select attributes tab with default parameters. Include in your report which attributes were selected by this method. Look at the results and elaborate on any interesting observations you can make about the results.
Step by Step Solution
There are 3 Steps involved in it
Step: 1
Get Instant Access to Expert-Tailored Solutions
See step-by-step solutions with expert insights and AI powered tools for academic success
Step: 2
Step: 3
Ace Your Homework with AI
Get the answers you need in no time with our AI-driven, step-by-step assistance
Get Started