The Kinder Institute at Rice University in Houston has surveyed people in the Houston area for several decades on a multitude of issues. An explanation

The Kinder Institute at Rice University in Houston has surveyed people in the Houston area for several decades on a multitude of issues. An explanation of the research, along with a “codebook” for the survey questions through the years, is available in the file Houston Area Survey Codebook (1982-2014).pdf. A subset of the data from the most recent survey in 2014 is available in the file C14_01.xlsx. This file contains data on about 1750 responders on over 80 variables (survey questions).

An explanation of the variables is included in the Variables sheet. These variables have been color-coded according to several broad topics and categories within these topics. This sheet lists the survey questions and the possible responses. Most of the questions have only a few possible responses, such as Male/Female or Better off/About the same/Worse off, but a few have numeric responses with many possible values.

In the full data set from 2014, there were many more variables, but about a third of them were deleted to produce the Excel file for this case. Many of the deleted variables had a large number of missing values, which made them unsuitable for our purposes. However, a few of the remaining variables also have some missing values. In addition, most of the remaining variables have several RF/DK values. These correspond to “Refused to answer” or “Didn’t know” responses. The last two columns of the Variables sheet indicate the number of RF/DK and Blank cells for each variable.

The last sheet in the file contains a simple pivot table for your convenience. It lets you see the counts of possible responses for any of the questions. In this way, you can quickly get some insights into the data.

In terms of real-world surveys of this type, this Excel file is not at all large. It has “only” about 1750 rows and about 80 columns. Still, this data set is large enough to raise the questions, “What data mining questions should I ask” or “Where do I begin?” We won’t provide answers to these questions. The only guidelines we will provide are the following:

Use pivot tables (or PowerPivot) to find interesting breakdowns of the data.

Choose a dependent variable with only two possible values and a set of explanatory variables (which doesn’t need to be all of the potential variables). Then use one or more classification methods to “explain” the dependent variable.

Use a subset (of your choosing) to run a cluster analysis on the data. You can choose the number of clusters. Once you have created the clusters, try to explain what each one is all about.

Step by Step Solution

★★★★★

3.48 Rating (161 Votes )

There are 3 Steps involved in it

Step: 1

Question 1 Use pivot tables or PowerPivot to find interesting breakdowns of the data The simplest way to create a pivot table is to select all of the data in the sheet including the headers and then c... blur-text-image