Framingham Heart Study: Data Preparation Activity This activity is comprised of two parts. Part one outlines how to explore the data to understand the variables

Framingham Heart Study: Data Preparation Activity This activity is comprised of two parts. Part one outlines how to explore the data to understand the variables for analysis. Part two outlines how to prepare the data for future analyses by creating new variables and subsetting the data. Part 1: Understanding the Variables Deciding an appropriate path for analysis often requires many steps. An important first step is exploring and examining the data. An initial exploratory data analysis provides understanding of the meaning of study variables and can provide crucial clues into data preparations needed before analyzing the data. 1. Open and examine the dataset and its variables. Familiarize yourself with the context and meanings behind the variables and their values. a. How many observations are in the dataset? How many variables are in the dataset? How many are numeric? How many are character? Exploring the assigned values of character variables can demonstrate patterns and inherent orderings. The default ordering of levels in SAS is alphabetical order. The levels of many character variables have an inherent ordering of magnitude. For example, non-smokers smoke less than light smokers who smoke less than moderate smokers. 2. Tabulate the levels of the character variables in the dataset. For each of the character variables: a. What data values or levels are observed for each? b. Which variables have an inherent ordering of magnitude? Does alphabetical order of the levels correspond to ordering levels by magnitude for any of these character variables? Examining the values of numeric variables can provide insights into their magnitude, spread, and symmetry. Variables with a symmetric distribution will have roughly equal mean and median, so can be summarized with either statistic. Variables with substantially different mean and median values indicate a non-symmetric distribution. Such variables may be better summarized with a median. Additionally, some numeric variables may have few unique values, so could be better summarized as categorical variables. 3. Generate descriptive statistics and histograms for the numeric variables in the dataset. a. What is the minimum, maximum, median, mean, standard deviation, skew and kurtosis of each variable? b. Describe the distribution of each variable. c. Are there any variables that may be better suited to be analyzed as a categorical variable rather than a continuous variable? Defend your choice(s). The dataset contains several categorical variables whose levels were originally created from values of continuous variables in the dataset. Understanding the relationships between related continuous and categorical predictors in a dataset can inform choices of predictors in later statistical analyses. 4. EXPLORE the variables Weight_Status, Smoking_Status, Chol_Status, and BP_Status as follows: a. Variables Weight_Status, MRW, and Weight: i. What are the ranges (minimum and maximum) of variables MRW and Weight for each level of Weight_Status? ii. Are the ranges of MRW for levels of Weight_Status overlapping?

iii. Are the ranges of Weight for levels of Weight_Status overlapping? iv. Using your answers to the previous two questions, when this dataset was created which values, MRW or Weight, were used to create the levels for Weight_Status? b. Variables Smoking_Status and Smoking: i. Which values of Smoking are categorized as Smoking_Status=Non-smoker? Light? Moderate? Heavy? Very Heavy? ii. Are any values of Smoking categorized into more than one level of Smoking_Status? c. Variables Chol_Status and Cholesterol: i. What are the ranges (minimum and maximum) of Cholesterol for each level of Chol_Status? ii. Are the ranges of Cholesterol for levels of Chol_Status overlapping? d. Variable BP_Status: i. What are the ranges (minimum and maximum) of Diastolic and Systolic for each level of BP_Status? ii. Are the ranges of Diastolic for levels of BP_Status overlapping? iii. Are the ranges of Systolic for levels of BP_Status overlapping? iv. Normal levels of blood pressure are usually defined as under 120 for systolic blood pressure and under 80 for diastolic blood pressure. Based on your answers to the previous questions, are one or both of systolic and diastolic blood pressure required to be high for the individual to be categorized as BP_Status=High? Exploring patterns of missingness in a dataset gives insight into data collection procedures for the study generating the dataset and may also indicate data entry or data collection errors. 5. Examine missing data. a. Which variables have no missing data? b. Which variables have missing data? c. For each variable with missing data, what percent of the data is missing? d. Analyze DeathCause and AgeAtDeath grouped by Status. i. Are DeathCause and AgeAtDeath ever missing when Status=Dead? ii. Are DeathCause and AgeAtDeath ever non-missing when Status=Alive? Missing values can also impact later statistical analyses. SAS statistical procedures perform what is called a complete case analysis, which is to say that analyses will exclude any observation with a missing value for any variable involved in the analysis. Such exclusions can substantially decrease the number of observations in a dataset that are used in a later statistical analysis.