University Rankings. The dataset (Universities.csv) on American College and University Rankings (available from www.dataminingbook.com) contains information on

Question:

University Rankings. The dataset (Universities.csv) on American College and University Rankings (available from www.dataminingbook.com) contains information on 1302 American colleges and universities offering an undergraduate program. For each university, there are 17 measurements, including continuous measurements (such as tuition and graduation rate) and categorical measurements (such as location by state and whether it is a private or public school).

Note that many records are missing some measurements. Our first goal is to estimate these missing values from “similar” records. This will be done by clustering the complete records and then finding the closest cluster for each of the partial records. The missing values will be imputed from the information in that cluster.

a. Remove all records with missing measurements from the dataset. (Hint: Use the Filter Examples operator. Remember to set College Name as id role.)

b. For all the continuous measurements, run hierarchical clustering using complete linkage and Euclidean distance. Make sure to normalize the measurements. From the dendrogram: How many clusters seem reasonable for describing these data?

c. Compare the summary statistics for each cluster, and describe each cluster in this context (e.g., “Universities with high tuition, low acceptance rate…”). (Hint: To obtain cluster statistics for hierarchical clustering, first use the Flatten Clustering operator to obtain the clustered data as an Example Set. Then, use the De-Normalize and Apply Model operators on the flattened clustered data, and then compute cluster characteristics using the Aggregate operator.)

d. Use the categorical measurements that were not used in the analysis (State and Private/Public) to characterize the different clusters. Is there any relationship between the clusters and the categorical information? (Hint: Use the Join operator (inner) to combine the de-normalized clustered data with the original data containing the State and Private/Public attributes. Then use Turbo Prep or the Pivot operator.)

e. What other external information can explain the contents of some or all of these clusters?

f. Consider Tufts University, which is missing some information. Compute the Euclidean distance of this record from each of the clusters that you found above (using only the measurements that you have). Which cluster is it closest to? Impute the missing values for Tufts by taking the average of the cluster on those measurements. (Hint: Use the Cross Distances operator to compute distance, and use the Replace Missing Values operator for missing value imputation.)

Fantastic news! We've Found the answer you've been seeking!

Step by Step Answer:

Related Book For  book-img-for-question

Machine Learning For Business Analytics

ISBN: 9781119828792

1st Edition

Authors: Galit Shmueli, Peter C. Bruce, Amit V. Deokar, Nitin R. Patel

Question Posted: