Answered step by step

Verified Expert Solution

Link Copied!

Question

1 Approved Answer

Posted on Sep 26, 2024

1 6 . 1 University Rankings. The dataset miba:: Universities on American College and Uni - verity Rankings contains information on 1 3 0 2

16.1

University Rankings. The dataset miba:: Universities on American College and Uni

-

verity Rankings contains information on

1302

American colleges and universities ofering an undergraduate program. For each university, there are

17

measurements, including continuous variables

(

such as tuition and graduation rate

)

and categorical variables

(

such as location by state and whether it is a private or public school

) .

Note that many records are missing some measurements. Our first goal is to estimate these missing values from "similar" records. This will be done by clustering the complete records and then finding the closest cluster for each of the partial records.

The missing values will be imputed from the information in that cluster.

.

Remove all records with missing measurements from the dataset.

.

For all the continuous variables, run hierarchical clustering using complete linkage and Euclidean distance. Make sure to normalize the variables. From the dendro

-

gram: how many clusters seem reasonable for describing these data?

.

Compare the summary statistics for each cluster, and describe each cluster in this context

(

.

.,

"Universities with high tuition, low acceptance rate..."

) .

Hint: To obtain cluster statistics for hierarchical clustering, use the aggregate function.

.

Use the categorical variables that were not used in the analysis

(

State and Private

/

Public

)

to characterize the different clusters. Is there any relationship between the clusters and the categorical information?

.

What other external information can explain the contents of some or all of these clusters?

.

Consider Tufts University, which is missing some information. Compute the Euclidean distance of this record from each of the clusters that you found above

(

using only the variables that you have

) .

Which cluster is it closest to

?

Impute the missing values for Tufts by taking the average of the cluster on those variables.

# a

data

-

7

ba::Universities

# Remove records with missing measurements

complete

_

data

-

.

omit

(

data

)

# b

continuous

_

vars

-

complete

_

data

[,

(

"

. .

app

7

. .

rec.d

", "

. .

app

1 . .

accepted",

"

. .

new.stud..enro

7

led",

"

. .

new. stud.

.

from. top

. 10 . ", "

. .

new.stud.

.

from. top

. 25 . ", "

. .

.

undergrad",

"

. .

.

undergrad",

"

.

state.tuition", "out.of

.

state.tuition", "room",

"board", "add.

.

fees", "estim.

.

book. costs", "estim. persona

7 . . ",

"

. .

fac..w

.

PHD

",

"stud..fac..ratio", "Graduation.rate"

]

# Normalize the continuous variables

normalized

_

continuous

-

scale

(

continuous

_

vars

)

# Perform hierarchical clustering

hclust

_

result

-

hclust

(

dist

(

normalized

_

continuous

),

method

=

"complete"

)

# Plot dendrogram

plot

(

hclust

_

result

)

C

num

_

clusters

-4

cluster

_

assignments

-

cutree

(

hclust

_

result,

k =

num

_

clusters

)

# Combine cluster assignments with data

clustered

_

data

-

cbind

(

complete

_

data, Cluster

=

cluster

_

assignments

)

# Compute summary statistics for each cluster

cluster

_

stats

-

aggregate

(

continuous

_

vars, by

=

list

(

Cluster

=

cluster

_

assignments

),

FUN

=

summary

)

# d

categorical

_

vars

-

complete

_

data

[,

("

State

",

"Public.

. 1 . . .

Private..

2 . ")]

clustered

_

categorical

-

cbind

(

categorical

_

vars, Cluster

=

cluster

_

assignments

)

f

(

Top Level

)

f

tufts

-

complete

_

data

[

complete

_

data$col

7

ege. Name

= =

"Tufts University",

]

tufts

_

continuous

-

tufts

[,

"

. .

app

71 . . .

rec.d

", "

. .

app

7 . .

accepted",

"

. .

new.stud.. enro

77

",

"

.

new. stud.

.

from. top

. 10 . ", "

. .

new.stud..from.top

. 25 . ", "

. .

.

undergrad",

"

. .

.

undergrad",

"

.

state.tuition", "out. of

.

state.tuition", "room", "board", "add..fees", "estim. book.costs", "estim..persona

7 . . ",

"

. .

fac..w

.

PHD

",

"stud..fac..ratio", "Graduation. rate"

]

tufts

_

distances

-

sapply

(1

:num

_

clusters, function

(

) {

cluster

_

mean

-

colMeans

(

continuous

_

vars

[

cluster

_

assignments

= i,

Step by Step Solution

There are 3 Steps involved in it

Step: 1

Get Instant Access to Expert-Tailored Solutions

See step-by-step solutions with expert insights and AI powered tools for academic success

Step: 2

Step: 3

Ace Your Homework with AI

Get the answers you need in no time with our AI-driven, step-by-step assistance

Get Started

Recommended Textbook for

The Temple Of Django Database Performance

Authors: Andrew Brookins

1st Edition

★★★★★

How wide are Salary Structure Ranges?

Answered: 1 week ago

Previous Question Next Question