Answered step by step
Verified Expert Solution
Link Copied!

Question

00
1 Approved Answer

CCST 4085 Biostatistics Assignment One Due date Instruction Task: You need to write up an experimental design based on the Golub Case Study. You may

image text in transcribedimage text in transcribedimage text in transcribedimage text in transcribedimage text in transcribedimage text in transcribedimage text in transcribedimage text in transcribedimage text in transcribed
CCST 4085 Biostatistics Assignment One Due date Instruction Task: You need to write up an experimental design based on the Golub Case Study. You may use the statistical techniques from Week I to 3 Format: 1. Background 2 . Experimental objective 3. Experimental Design 4 . Descriptive Statistics Inferential Statistics Future direction Grading: Refer to Rubric Background information The 1999 Golub leukemia study represents one of the earliest applications of microarray tech- nology for diagnostic purposes. At the time of the Golub study, no single diagnostic test was sufficient for distinguishing between acute myeloid leukemia (AML) and acute lymphoblastic leukemia (ALL). To investigate whether gene expression profiling could be a tool for classify- ing acute leukemia type, Golub and co-authors used Affymetrix DNA microarrays to measure the expression level of 7, 129 genes from children known to have either AML or ALL. The goal of the study was to develop a procedure for distinguishing between AML and ALL based only on the gene expression levels of a patient. There are two major issues to be addressed: 1. Which genes are the most informative for making a prediction? If a gene is differentially ex- pressed between individuals with AML versus ALL, then measuring the expression level of that gene may be informative for diagnosing leukemia type. For example, if a gene tends tobe highly expressed in AML individuals, but only expressed at low levels in ALL individu-als, it is more likely to be a reliable predictor of leukemia type than a gene that is expressedat similar levels in both AML and ALL patients. 2. How can leukemia type be predicted from expression data? Suppose that a patient's expression profile is measured for a group of genes. In an ideal scenario, all the genes measured would express AML-like expression, or ALL-like expression, making a prediction obvious. In real- ity, however, a patient's expression profile will not follow an idealized pattern. Some of the genes may have expression levels more typical of AML, while others may suggest ALL. It is necessary to clearly define a strategy for translating raw expression data into a prediction of leukemia type. All datasets used in this lab are available from the oibiostat package. Phenotypic and expression data have been collected for 72 patients. The expression data from the 62 patients in golub.train will be used to identify informative genes for making a prediction. The prediction strategy willthen be tested on the remaining 10 patients in golub.test. Identifying informative genesThe discussion in the text begins by illustrating concepts using a simplified version of the dataset (golub.small) that contains only data from the 10 patients and 10 genes. Here, instead of starting with golub.small, we will examine a random sample of 100 genes for all patients in golub.train. The methods from the initial analysis can then be applied to the data from all 7,129 genes. 1. Run the following code to load golub. train and create gene.matrix, which contains only the expression data and not the phenotype information in the first 6 columns. #load the data library (oibiostat) data (golub. train) gene. matrix = as.matrix (golub. trainf . -(1:6)1) By using the - in front of the column numbers, the matrix notation specifies that columns I through 6 should not be included. The same matrix could be created by specifying that columns 7 through 7, 135 should be included, with [, 7:7135] 2. Draw a random sample of 100 genes from the dataset. #create a vector of integers from 1 to the total number of genes gene. columns = 1:ncol (gene. matrix) #set the seed for a pseudo-random sample set.seed (2401) #sample 100 numbers from gene. columns, without replacement gene. index. set = sample (gene. columns, size = 100, replace = FALSE) #creeate a matrix with expression data from the rows specified by gene. index. set gene. matrix. sample = gene. matrix , gene. index. set] a) What are the first five values of gene.index.set? How were the numbers in gene. index.set chosen? b) Why is it important to sample without replacement? c) View gene. matrix.sample; what does it contain? How is gene.matrix.sample relatedto gene. index.set? d) What are the first five gene names of the 100 genes sampled? e) Plot a histogram showing the distribution of the expression levels of the second gene across patients. Describe the distribution.3. Create a logical variable, leuk.type, that has value TRUE for AML and value FALSE for any- thing that is not AML (i.e., allT and allB). For a logical variable, R interprets TRUE as l and FALSEas 0. #create logical variable leuk. type : (golub. train$cancer == \"aml\") #view table of leukemia types table (l euk. type) #calculate sum of leuk. type sum (leuk. type) a] When creating the logical variable, why not write "allT\"or "allB"instead of "aml"? b] How many patients have AML? How many have ALL? 4. Summarize the data separately for AML patients and for ALL patients. a] The following code calculates the mean expression level for each sampled gene across AML patients, storing it in the variable aml.mean.expression. The apply() function executes a function across a matrixin this case, the function is mean, and the 2 in the argument indicates that the function should be applied on each column (replacing the 2with a lwould result in the mean being calculated across the rows). #calculate mean expression level for each sampled gene across AML patients aml. mean. expression : apply (gene. matrix. samplelleuk. type == TRUE, l, 2, mean) Run the code to create aml.mean.expression, then create all.mean.expression, a vector con- taining the mean expression levels for each gene in ALL patients. b] Explain the logic behind the code to generate aml.mean.expressionand all.mean.expression.ln other words, what do the separate components instruct Rto do? c] View the contents of aml.mean.expression. What is the average expression level of thefirst sampled gene across AML patients? 5. For each gene, compare the mean expression value among AML patients to the mean among ALL patients; calculate the differences in mean expression levels between AML and ALL patients. #calculate the differences diff. mean. expression : (aml. mean. expression - all. mean. expression) #view list as a matrix diff. mean. expression. matrix : as.matrix (diff. mean. expression) diff. mean. expression. matrix a] What is the difference in mean expression level between AML and ALL for the rstgene on the list; on average, is this gene more highly expressed in AML patients or ALL patients? Does it seem like this gene could be a good predictor of leukemia type? Whyor why not? b) Using numerical and graphical summaries, describe the distribution of differences inmean expression levels 6. Identify the outliers. Run the following code to set up the definition of outliers as specifiedin Chapter 1 of OpenIntro Biostatistics: #define 3rd and Ist quartiles quart. 3 = quantile (diff. mean. expression. matrix[ , 1], 0. 75, na. rm = TRUE) quart. 1 = quantile (diff. mean. expression. matrix[ , 1], 0. 25, na. rm = TRUE) #define interquartile range iqr = quart. 3 - quart. 1 #define upper and lower bound for outliers 1b. outlier = quart. 1 - 1. 5*iqr ub. outlier = quart. 3 + 1. 5*iqr The following code creates a list of the large outliers, genes with expression differences largerthan ub. outlier: #creates list of large outliers which. large. out = diff. mean. expression > ub. outlier large. out as.matrix (diff. mean. expression. matrix[which. large. out, ]) large. out #creates ordered list of large outliers, from largest to smallest order. large. out = order (large. out [ , 1], decreasing = TRUE) #assigns ordering to rowsordered. large. out. = as.matrix (large out forder large out. 1) #sorts a) What are the upper and lower outlier bounds? b) How many large outliers are present in the sample? c) How many rows and columns does large.outhave? Explain why. d) View order.large. out. What do these numbers represent? e) Modify the code to find small outliers. How many small outliers are present in thesample? f) Which gene has the largest positive difference in mean expression between AML and ALL samples? Which gene has the largest negative difference in mean expression be- tween AML and ALL samples? g) In a research setting, it can also be useful to inspect the entire list and examine genes that are close to the outlier cutoff. Run the following code to order the entire list of expression differences in decreasing order: order. decreasing = order (diff. mean. expression. matrix[ , 1], decreasing = TRUE) ordered. outliers = as. matrix (diff. mean. expression. matrix[order. decreasing, ])Which gene just missed the cutoff to qualify as a large outlier? Which gene is closest to the cutoff for qualifying as a small outlier? 7. The genes previously identified as outliers are only outliers of the specific 100 genes chosenin the sample. From the complete set of data in golub.train, identify the five largest outliers and five smallest outliers out of all 7,129 genes. (Hint: this can be done with only a few modifications to the code run for the initial analysis.)[SAMPLE] Physicians' Reactions to Patient Size Research Question Do physicians discriminate against overweight patients? This study indicates that, at least in one respect, they do. Background Currently, almost one in every two Americans is overweight and one in every five is obese. These individuals face discrimination on a daily basis in employment, education, and relationship contexts. They are viewed as having a physical, moral and emotional impairment and there is a tendency for others to hold them responsible for their condition. Physicians -- people who are trained to treat all their patients warmly and have access to literature suggesting uncontrollable and hereditary aspects of obesity -- also believe obese individuals are undisciplined and suffer from controllability issues. The current research, conducted by Mikki Hebl and Jingping Xu, examines physicians' treatment of obesity in their patients more systematically by extending past research to look at physicians' behavioral intentions as well as their expressed attitudes toward male and female patients who are of average weight, overweight, or obese. Although past studies tend to compare only overweight and average- weight individuals, this study provides a novel look at multiple increments of overweight by including both overweight and obesity. However, to simplify the presentation of this case study, only the average and overweight conditions will be presented. Experimental Design A total of 122 primary care physicians afliated with one of three major hospitals in the Texas Medical Center of Houston participated in the study. These physicians were sent a packet containing a medical chart similar to the one they View upon seeing a patient. This chart portrayed a patient who was displaying symptoms of a migraine headache but was otherwise healthy. Two variables (the gender and the weight of the patient) were manipulated across six different versions of the medical charts. The weight of the patient, described in terms of Body Mass Index (BMI), was average (BMI : 23), overweight (BMI : 30), or obese (BMI : 36). Physicians were randomly assigned to receive one of the six charts, were asked to look over the chart carefully, and then complete two medical forms. The first form asked physicians which of 42 tests that they would recommend giving to the patient (see materials section for a copy of the medical form). The second form asked physicians to indicate how much time they believed they would spend with the patient, and to describe the reactions that they would have toward this patient. In this presentation, only the question on how much time the physicians believed they would spend with the patient is analyzed. Although three patient weight conditions were used in the study (average, overweight, and obese) only the average and overweight conditions will be analyzed. Therefore, there are two levels of patient weight (average and overweight) and one dependent variable (time spent). 20 Average Weight Frequency O S 10 15 20 25 30 35 40 45 50 60 20 Overweight 16 12 Frequency O A 10 15 20 25 30 35 40 45 50 60 Time Box plots comparing the time expected to be spent with the average-weight and overweight patients are shown below. 60 50 Time Expected to Spend 30 10 - 0 Average OverweightAnalysis 1. Expected time spent was generally higher for the average-weight patients or overweight patients 2. The means, median and outlier of the box plot 3. The highest expected time was for a patient in the average weight or overweight group 4. Approximately what proportion of the average weight patients had higher scores than the median for the overweight patients? 5. The percentage of standard deviation Statistics Average Overweight N 33 33 Mean 31.3636 24.7368 Median 30.0000 25.0000 Trimean 31 2500 25.0000 Minimum 15. none 5. 0000 Maximum 50. 0000 60.0000 25th Perc 25.0000 20.0000 75th Perc 40.0000 30.0000 56 9.8641 9.6526 sem 1.7171 1.5559 Skew 0.2541 1.1562 Kurtosis -0.3646 3.03?6 Inferential Statistics Statistics Average Overweight N 33 38 Mean 31.3636 24.7368 Median 30.0000 25.0000 Trimean 31.2500 25.0000 Minimum 15.0000 5.0000 neximum 50.0000 60.0000 25th Perc 25.0000 20.0000 75th Perc 40.0000 30.0000 36 9.8641 9.6526 sen 1. 71?1 1. 5659 Skew 0.2541 1.1562 Kurtosis -0.8646 3.08?6 An independent t test was used to test for differences between groups. This test assumes normality and homogeneity of variance. Although the distributions are not quite normal, they are not so deviant as to make the test invalid. The standard deviations for the average weight and overweight conditions are 9.86 and 9.65 respectively. Therefore there is no reason to suspect a violation of the homogeneity of variance assumption. Conclusion The difference between means is signicant, t(69) = 2.856, p = 0.0057. The 95% condence interval on the difference between means extends from 1.9980 to 11.2556 Therefore, there is strong evidence that physicians expect to spend less time with overweight patients. Future studies Your recommendation Reference https ://onlinestatbook.com/case_studies_rvls/weight/index.html

Step by Step Solution

There are 3 Steps involved in it

Step: 1

blur-text-image

Get Instant Access with AI-Powered Solutions

See step-by-step solutions with expert insights and AI powered tools for academic success

Step: 2

blur-text-image

Step: 3

blur-text-image

Ace Your Homework with AI

Get the answers you need in no time with our AI-driven, step-by-step assistance

Get Started

Recommended Textbook for

Accounting for Decision Making and Control

Authors: Jerold Zimmerman

8th edition

78025745, 978-0078025747

Students also viewed these Mathematics questions