Answered step by step

Verified Expert Solution

Link Copied!

Question

1 Approved Answer

Posted on Jul 03, 2024

I have an assignment where you need an experiment design study from Golub study. (NEED TO USE R PROGRAMMING). The script of the R program

I have an assignment where you need an experiment design study from Golub study. (NEED TO USE R PROGRAMMING). The script of the R program for the assignment is below. The assignment is attached below. What we need to do is make an experiment design study, the requirement, and sample of the assignment are attached below. Please look at the sample file and do the same for assignment 1. need to find a research question based on the R script and the data provided. The script for the R studio for assignment 1 is attached; [Please help, I'm very new to this course.

Script:

#Q1 genematrix =as.matrix(golub.train[ ,-(1:6)])

#Q2 #create vector of integers from 1 to the number of genes gene.columns = 1:ncol(genematrix)

#set seed for a pseudo-random sample set.seed(2401)

#sample 100 numbers from gene.columns, without replacement gene.index.set = sample(gene.columns,size = 100, replace = FALSE)

#create matrix with expression data from the rows specificed by gene.index.set gene.matrix.sample = genematrix[,gene.index.set]

#2(a) #view gene.index.set gene.index.set[1:5]

#The first five values of the gene.index.set are 4841 6731 6535 6743 4804. The #the numbers in gene.index.set were randomly sampled from the integers between #1 and 7129.

#2(b) without replacement because we don't want the same number to be chosen twice. #since the numbers are used to identify the column numbers of particular genes, #sampling without replacement allows for 100 distinct genes to be drawn from the #pool of 7129 gene.

#2(c) #the matrix contains gene expression data for the 100 genes corresponding to #columns specified by the numbers in gene.index.set, for each of the 62 patient.

#2(d) colnames(gene.matrix.sample)[1:5] #These are the column names "X95406_at", "S82297_at", "X82850_s_at" ,"HG2239-HT2324_at" "X92475_at"

#(e) Plot a histogram showing the distribution of the expression levels of the second gene across patients. Describe the distribution. hist(gene.matrix.sample[ ,2])

#3 #create logical variable leuk.type = (golub.train$cancer == "aml") #view table of leukemia types table(leuk.type) #calculate sum of leuk.type sum(leuk.type)

#3A When creating the logical variable, why not write "allT"or "allB"instead of "aml"? #The purpose of logical variable is to separate patients with AML AND ALL #only the AML category is TRUE and all other patients is false.

#3B How many patients have AML? How many have ALL? # 42 Patients have ALL and 20 patients have AML (look above for the data)

#4 Summarize the data separately for AML patients and for ALL patients. #calculate mean expression level for each sampled gene across AML patients aml.mean.expression = apply(gene.matrix.sample[leuk.type == TRUE, ], 2, mean) aml.gene.expression = apply(gene.matrix.sample[leuk.type==TRUE,],2,mean)

#calculate mean expression ALL all.gene.expression = apply(gene.matrix.sample[leuk.type==FALSE,],2,mean)

#4(b) #the command for calculating aml.mean.expression instructs R to calculate the mean down each column #of gene.matrix.sample, selectively including only values from rows for which #leuk.type is TRUE.

#4C as.matrix(aml.mean.expression) [1,] #the average expression level of the first sampled gene is -483.96

#5 #calculate the differences diff.mean.expression = (aml.mean.expression - all.mean.expression) #view list as a matrix diff.mean.expression.matrix = as.matrix(diff.mean.expression) diff.mean.expression.matrix all.mean.expression = apply(gene.matrix.sample[leuk.type==FALSE,],2,mean) #view list as a matrix diff.mean.expression.matrix = as.matrix(diff.mean.expression) diff.mean.expression.matrix[1,] #The difference in mean expression level between AML and ALL for the first gene on the list is 129.51. #on average the gene is more highly expressed in AML patients. it is #not possible to tell whether this gene could be a good predictor, without knowing the distribution of differences #a difference of 129.51 could very well be a typical mean expression difference between the AML and ALL, or it #might be extreme.

#5B summary(diff.mean.expression) par(mfrow=c(1,2)) hist(diff.mean.expression) boxplot(diff.mean.expression) #From the histogram, a vast majority of the observations are between -2500 and 2500; #more precisely, the middle 50% of the data are between -90.44 and 129.34 #However, the boxplot indicated the presence of both small and large outliers; #these genes may be ones useful for differentiating between AML and ALL.

#6. quart.3=quantile(diff.mean.expression.matrix [,1], 0.75, na.rm = TRUE) quart.1=quantile(diff.mean.expression.matrix [,1], 0.25, na.rm = TRUE)

iqr = quart.3-quart.1

lb.outlier = quart.1 -1.5*iqr ub.outlier = quart.3 +1.5*iqr

#create list of large outliers, genes with expression differences larger than #ub.outlier which.large.out = diff.mean.expression>ub.outlier large.out = as.matrix(diff.mean.expression.matrix[which.large.out,]) large.out

#create ordered list of large outliers, from largest to smallest order.large.out=order(large.out[,1], decreasing = TRUE)

#assigns ordering to rows ordered.large.out = as.matrix(large.out[order.large.out,])

#sorts outlier list ordered.large.out = as.matrix(large.out[order.large.out,])

#sorts outlier list ordered.large.out

#6(a) #upper bound is 459.01 #lower bound -420.11

#(b) table(which.large.out) #there are 10 large outliters

#(c) #The matrix large.out has one column and 10 rows.

#(d) order.large.out

#these numbers represent the row ordering if the rows were to be sorted in #decreasing order, from largest to smallest

#(e) #create list of small outliers which.small.out= diff.mean.expression.matrix

#create the ordered list of small outliers, from smallest to largest order.small.out = order(small.out[ ,1], decreasing = FALSE) ordered.small.out=as.matrix(small.out[order.small.out])

table(which.small.out)

#largest large outlier ordered.large.out[1,]

#smallest small outlier ordered.small.out[1,]

#(g)

order.decreasing = order(diff.mean.expression.matrix[,1], decreasing = TRUE) ordered.outliers = as.matrix(diff.mean.expression.matrix[order.decreasing,])

#pull the 11th largest value form the lest where decreasing = TRUE ordered.outliers[11,]

#pull the 13th largest value form the lest where decreasing = FaLSE order.increasing= order(diff.mean.expression.matrix[,1], decreasing = FALSE) ordered.outliers.inc = as.matrix(diff.mean.expression.matrix[order.increasing, ]) ordered.outliers.inc[13, ]

# the gene from U38980_at with a difference of 457.43 just missed the cutoff #to qualify as an outlier, where U47414_at with a difference of -333.81 is #closest to the cutoff for qualifying as a small outlier.

#7 #calculate mean expression level for each gene across AML patients and ALL patients aml.mean.expression = apply(genematrix[leuk.type == TRUE, ], 2, mean) all.mean.expression = apply(genematrix[leuk.type == FALSE, ], 2, mean)

#calculate the difference diff.mean.expression = (aml.mean.expression-all.mean.expression) diff.mean.expression.matrix = as.matrix((diff.mean.expression))

#define outlier bounds quart.3= quantile(diff.mean.expression.matrix[, 1], 0.75, na.rm = TRUE) quart.1= quantile(diff.mean.expression.matrix[, 1], 0.25, na.rm = TRUE)

iqr= quart.3 - quart.1 lb.outlier = quart.1 - 1.5*iqr ub.outlier = quart.3 + 1.5*iqr

#identify large outliers which.large.out = diff.mean.expression.matrix > ub.outlier large.out = as.matrix(diff.mean.expression.matrix[which.large.out,])

order.large.out= order(large.out[,1], decreasing = TRUE) ordered.large.out= as.matrix(large.out[order.large.out]) ordered.large.out[1:5, ]

#identify small outliers which.small.out = diff.mean.expression.matrix

order.small.out = order(small.out[,1], decreasing = FALSE) ordered.small.out=as.matrix(small.out[order.small.out]) ordered.small.out [1:5,]

[SAMPLE] Physicians' Reactions to Patient Size Research Question Do physicians discriminate against overweight patients? This study indicates that, at least in one respect, they do. Background Currently, almost one in every two Americans is overweight and one in every five is obese. These individuals face discrimination on a daily basis in employment, education, and relationship contexts. They are viewed as having a physical, moral and emotional impairment and there is a tendency for others to hold them responsible for their condition. Physicians -- people who are trained to treat all their patients warmly and have access to literature suggesting uncontrollable and hereditary aspects of obesity -- also believe obese individuals are undisciplined and suffer from controllability issues. The current research, conducted by Mikki Hebl and Jingping Xu, examines physicians' treatment of obesity in their patients more systematically by extending past research to look at physicians' behavioral intentions as well as their expressed attitudes toward male and female patients who are of average weight, overweight, or obese. Although past studies tend to compare only overweight and average- weight individuals, this study provides a novel look at multiple increments of overweight by including both overweight and obesity. However, to simplify the presentation of this case study, only the average and overweight conditions will be presented. Experimental Design A total of 122 primary care physicians afliated with one of three major hospitals in the Texas Medical Center of Houston participated in the study. These physicians were sent a packet containing a medical chart similar to the one they View upon seeing a patient. This chart portrayed a patient who was displaying symptoms of a migraine headache but was otherwise healthy. Two variables (the gender and the weight of the patient) were manipulated across six different versions of the medical charts. The weight of the patient, described in terms of Body Mass Index (BMI), was average (BMI : 23), overweight (BMI : 30), or obese (BMI : 36). Physicians were randomly assigned to receive one of the six charts, were asked to look over the chart carefully, and then complete two medical forms. The first form asked physicians which of 42 tests that they would recommend giving to the patient (see materials section for a copy of the medical form). The second form asked physicians to indicate how much time they believed they would spend with the patient, and to describe the reactions that they would have toward this patient. In this presentation, only the question on how much time the physicians believed they would spend with the patient is analyzed. Although three patient weight conditions were used in the study (average, overweight, and obese) only the average and overweight conditions will be analyzed. Therefore, there are two levels of patient weight (average and overweight) and one dependent variable (time spent). Example of a Female Patient Chart Record (all weight manipulations included): Weigh: White Female Single m \"m U\" \"\" 93.? 121."?8 72 Albrghx Medication: Form it 06-14-004 1f99 None known Tylenol 3 Raul-tor Visit l cc: Recent (single) episode of severe migraine headache one week ago. Aura preceded by vomiting for several hours. Patient who is 25 had earlier migraine two years ago. Was concerned about level of pain and possible problem. Pain gone at present. Heedac hes not typical. No family history of migraine headaches and no single event seemed to precipitate onset. Him; Medical: No severe disease reported except episode of migraine headache. No past surgical history. No known allergies. Family. No family history of migraines. No heart problems and cancer history. Social: Social drinker. Non-smoker. Medical Procedure Form Directions: Indicate all the procedures you would recommend for the patient. Office procedures Lab Procedure Treatments and Referrals Problem focused history Bbod hormone level Prophylactic therapy Comprehensive history Beta strep Prescribe beta blockers. Problem focused exam Cholesterol Prescribe anti-depressants Comprehensive physical Tryalinerides Prescribe pain pills Menstrual cycle info CBC with diff Refer to newologist Pelvic exam Body fat Refer to cardiologis Dietary info Glucose Refer to exercise program Eye test Metabolic panel Refer for professional consultation Reflextest Bbod typing Genetic consultation Stress assessment Pregnancy test Dietary consultation Hearing exam Urina lysis Weight loss consultation Visual screen X-ray Relaxation consultation Skin test CT Scan Psychologr al consultation MRI Mental healh evaluation Ultrasound Preventive medic me consultation What is your overall medical evaluation of this patient? Diagnosis Medication Follow up schedule Refer to MD. Physician signature Descriptive Statistics Histograms of the time expected to be spent with the average-weight and overweight patients are shown below.20 Average Weight Frequency O S 10 15 20 25 30 35 40 45 50 60 20 Overweight 16 12 Frequency O A 10 15 20 25 30 35 40 45 50 60 Time Box plots comparing the time expected to be spent with the average-weight and overweight patients are shown below. 60 50 Time Expected to Spend 30 10 - 0 Average OverweightAnalysis 1. Expected time spent was generally higher for the average-weight patients or overweight patients 2. The means, median and outlier of the box plot 3. The highest expected time was for a patient in the average weight or overweight group 4. Approximately what proportion of the average weight patients had higher scores than the median for the overweight patients? 5. The percentage of standard deviation Statistics Average Overweight N 33 33 Mean 31.3636 24.7368 Median 30.0000 25.0000 Trimean 31 2500 25.0000 Minimum 15. none 5. 0000 Maximum 50. 0000 60.0000 25th Perc 25.0000 20.0000 75th Perc 40.0000 30.0000 56 9.8641 9.6526 sem 1.7171 1.5559 Skew 0.2541 1.1562 Kurtosis -0.3646 3.03?6 Inferential Statistics Statistics Average Overweight N 33 38 Mean 31.3636 24.7368 Median 30.0000 25.0000 Trimean 31.2500 25.0000 Minimum 15.0000 5.0000 neximum 50.0000 60.0000 25th Perc 25.0000 20.0000 75th Perc 40.0000 30.0000 36 9.8641 9.6526 sen 1. 71?1 1. 5659 Skew 0.2541 1.1562 Kurtosis -0.8646 3.08?6 An independent t test was used to test for differences between groups. This test assumes normality and homogeneity of variance. Although the distributions are not quite normal, they are not so deviant as to make the test invalid. The standard deviations for the average weight and overweight conditions are 9.86 and 9.65 respectively. Therefore there is no reason to suspect a violation of the homogeneity of variance assumption. Conclusion The difference between means is signicant, t(69) = 2.856, p = 0.0057. The 95% condence interval on the difference between means extends from 1.9980 to 11.2556 Therefore, there is strong evidence that physicians expect to spend less time with overweight patients. Future studies Your recommendation Reference https ://onlinestatbook.com/case_studies_rvls/weight/index.html Student's name: Appropriate content for background information. Little or no analysis of the issues in case study. Incomplete analysis of the issues. case study. Supercial analysis of all the issues. the case study. Insightful and thorough analysis ofall the issues. Criterion Poor Acceptable Excellent Grade Rem ark/Comments Identication of the Main Identies and Identies and Identies and Total Issues/ Problems understands few of the understands some of understands all of marks Analysis of the Issues issues in the case study. the main issues in the the main issues in (35%) and formatting. and unstructured content. text and visual data formattin g Professionally formatted with text and Visual data Accurate use and Limited use of the Used some of the Used all the Total analysis of statistical proposed statistical proposed statistical proposed statistical marks techniques. techniques accurately. techniques techniques (25%) Limited analysis of the accurately. accurately. : proposed statistical Analyzed some of the Analyzed all the techniques accurately. proposed statistical proposed statistical techniques techniques accurately. accurately. Link to course materials Incomplete research and Limited research and Excellent research Total and Additional Research links to any readings documented links to into the issues with marks Comments on effective any readings. clearly (25%) solutions/ strategies. documented links : Comments to future to class (and/or work. outside readings) Professionalism and Numerous errors, Few errors, direct and No errors, direct Total accuracy of grammar, unprofessional concise writing style. and concise marks spelling, writing style, formatting, confusing Questionable choices writing style. (15%) CCST 4085 Biostatistics Assignment One Due date Instruction Task: You need to write up an experimental design based on the Golub Case Study. You may use the statistical techniques from Week I to 3 Format: 1. Background 2 . Experimental objective 3. Experimental Design 4 . Descriptive Statistics Inferential Statistics Future direction Grading: Refer to Rubric Background information The 1999 Golub leukemia study represents one of the earliest applications of microarray tech- nology for diagnostic purposes. At the time of the Golub study, no single diagnostic test was sufficient for distinguishing between acute myeloid leukemia (AML) and acute lymphoblastic leukemia (ALL). To investigate whether gene expression profiling could be a tool for classify- ing acute leukemia type, Golub and co-authors used Affymetrix DNA microarrays to measure the expression level of 7, 129 genes from children known to have either AML or ALL. The goal of the study was to develop a procedure for distinguishing between AML and ALL based only on the gene expression levels of a patient. There are two major issues to be addressed: 1. Which genes are the most informative for making a prediction? If a gene is differentially ex- pressed between individuals with AML versus ALL, then measuring the expression level of that gene may be informative for diagnosing leukemia type. For example, if a gene tends tobe highly expressed in AML individuals, but only expressed at low levels in ALL individu-als, it is more likely to be a reliable predictor of leukemia type than a gene that is expressedat similar levels in both AML and ALL patients. 2. How can leukemia type be predicted from expression data? Suppose that a patient's expression profile is measured for a group of genes. In an ideal scenario, all the genes measured would express AML-like expression, or ALL-like expression, making a prediction obvious. In real- ity, however, a patient's expression profile will not follow an idealized pattern. Some of the genes may have expression levels more typical of AML, while others may suggest ALL. It is necessary to clearly define a strategy for translating raw expression data into a prediction of leukemia type. All datasets used in this lab are available from the oibiostat package. Phenotypic and expression data have been collected for 72 patients. The expression data from the 62 patients in golub.train will be used to identify informative genes for making a prediction. The prediction strategy willthen be tested on the remaining 10 patients in golub.test. Identifying informative genesThe discussion in the text begins by illustrating concepts using a simplified version of the dataset (golub.small) that contains only data from the 10 patients and 10 genes. Here, instead of starting with golub.small, we will examine a random sample of 100 genes for all patients in golub.train. The methods from the initial analysis can then be applied to the data from all 7,129 genes. 1. Run the following code to load golub. train and create gene.matrix, which contains only the expression data and not the phenotype information in the first 6 columns. #load the data library (oibiostat) data (golub. train) gene. matrix = as.matrix (golub. trainf . -(1:6)1) By using the - in front of the column numbers, the matrix notation specifies that columns I through 6 should not be included. The same matrix could be created by specifying that columns 7 through 7, 135 should be included, with [, 7:7135] 2. Draw a random sample of 100 genes from the dataset. #create a vector of integers from 1 to the total number of genes gene. columns = 1:ncol (gene. matrix) #set the seed for a pseudo-random sample set.seed (2401) #sample 100 numbers from gene. columns, without replacement gene. index. set = sample (gene. columns, size = 100, replace = FALSE) #creeate a matrix with expression data from the rows specified by gene. index. set gene. matrix. sample = gene. matrix , gene. index. set] a) What are the first five values of gene.index.set? How were the numbers in gene. index.set chosen? b) Why is it important to sample without replacement? c) View gene. matrix.sample; what does it contain? How is gene.matrix.sample relatedto gene. index.set? d) What are the first five gene names of the 100 genes sampled? e) Plot a histogram showing the distribution of the expression levels of the second gene across patients. Describe the distribution.3. Create a logical variable, leuk.type, that has value TRUE for AML and value FALSE for any- thing that is not AML (i.e., allT and allB). For a logical variable, R interprets TRUE as l and FALSEas 0. #create logical variable leuk. type : (golub. train$cancer == \"aml\") #view table of leukemia types table (l euk. type) #calculate sum of leuk. type sum (leuk. type) a] When creating the logical variable, why not write "allT\"or "allB"instead of "aml"? b] How many patients have AML? How many have ALL? 4. Summarize the data separately for AML patients and for ALL patients. a] The following code calculates the mean expression level for each sampled gene across AML patients, storing it in the variable aml.mean.expression. The apply() function executes a function across a matrixin this case, the function is mean, and the 2 in the argument indicates that the function should be applied on each column (replacing the 2with a lwould result in the mean being calculated across the rows). #calculate mean expression level for each sampled gene across AML patients aml. mean. expression : apply (gene. matrix. samplelleuk. type == TRUE, l, 2, mean) Run the code to create aml.mean.expression, then create all.mean.expression, a vector con- taining the mean expression levels for each gene in ALL patients. b] Explain the logic behind the code to generate aml.mean.expressionand all.mean.expression.ln other words, what do the separate components instruct Rto do? c] View the contents of aml.mean.expression. What is the average expression level of thefirst sampled gene across AML patients? 5. For each gene, compare the mean expression value among AML patients to the mean among ALL patients; calculate the differences in mean expression levels between AML and ALL patients. #calculate the differences diff. mean. expression : (aml. mean. expression - all. mean. expression) #view list as a matrix diff. mean. expression. matrix : as.matrix (diff. mean. expression) diff. mean. expression. matrix a] What is the difference in mean expression level between AML and ALL for the rstgene on the list; on average, is this gene more highly expressed in AML patients or ALL patients? Does it seem like this gene could be a good predictor of leukemia type? Whyor why not? b) Using numerical and graphical summaries, describe the distribution of differences inmean expression levels 6. Identify the outliers. Run the following code to set up the definition of outliers as specifiedin Chapter 1 of OpenIntro Biostatistics: #define 3rd and Ist quartiles quart. 3 = quantile (diff. mean. expression. matrix[ , 1], 0. 75, na. rm = TRUE) quart. 1 = quantile (diff. mean. expression. matrix[ , 1], 0. 25, na. rm = TRUE) #define interquartile range iqr = quart. 3 - quart. 1 #define upper and lower bound for outliers 1b. outlier = quart. 1 - 1. 5*iqr ub. outlier = quart. 3 + 1. 5*iqr The following code creates a list of the large outliers, genes with expression differences largerthan ub. outlier: #creates list of large outliers which. large. out = diff. mean. expression > ub. outlier large. out as.matrix (diff. mean. expression. matrix[which. large. out, ]) large. out #creates ordered list of large outliers, from largest to smallest order. large. out = order (large. out [ , 1], decreasing = TRUE) #assigns ordering to rowsordered. large. out. = as.matrix (large out forder large out. 1) #sorts a) What are the upper and lower outlier bounds? b) How many large outliers are present in the sample? c) How many rows and columns does large.outhave? Explain why. d) View order.large. out. What do these numbers represent? e) Modify the code to find small outliers. How many small outliers are present in thesample? f) Which gene has the largest positive difference in mean expression between AML and ALL samples? Which gene has the largest negative difference in mean expression be- tween AML and ALL samples? g) In a research setting, it can also be useful to inspect the entire list and examine genes that are close to the outlier cutoff. Run the following code to order the entire list of expression differences in decreasing order: order. decreasing = order (diff. mean. expression. matrix[ , 1], decreasing = TRUE) ordered. outliers = as. matrix (diff. mean. expression. matrix[order. decreasing, ])Which gene just missed the cutoff to qualify as a large outlier? Which gene is closest to the cutoff for qualifying as a small outlier? 7. The genes previously identified as outliers are only outliers of the specific 100 genes chosenin the sample. From the complete set of data in golub.train, identify the five largest outliers and five smallest outliers out of all 7,129 genes. (Hint: this can be done with only a few modifications to the code run for the initial analysis.)