Question
Python Problem 6: Sanity-check the data Modify dna_analysis.py so that it will calculate and print the following three quantities: the sum of: the A count,
Python Problem 6: Sanity-check the data
Modify dna_analysis.py so that it will calculate and print the following three quantities:
- the sum of: the A count, the C count, the G count, and the T count (store this in a new variable called sum_counts)
- the total_count variable (total number of nucleotides)
- the length of the nucleotides string variable. You can compute this with len(nucleotides).
As you modify your program, make sure you are producing output that matches the output format shown above.
Then run dna_analysis.py on each of the eleven .fastq files you have been given.
For at least one .fastq file, at least one of these quantities will be different from the other two. In your answers.txt file, state which .fastq file(s) and which quantities differ. (If all three quantities are equal for each .fastq file, then your code contains a mistake.) In your answers.txt file, write a short paragraph that explains why these differ.
Explaining why (or debugging your code if all three quantities were the same in all .fastq files) might require you to do some detective work.
This exercise is meant to expose you to a situation you might encounter when processing a data file of your own. When your program does not give the results you expect, there are two likely sources of the problem. One is that your program contains a bug! Check your code carefully to be sure you are calculating all values correctly. We will talk about testing in more detail later but for now, try walking through your code with a very small data set and calculating values by hand. A second source of unexpected results that is very common with data files is that there is something you were assuming about the contents of the data files that was an incorrect assumption. This could include things like assuming each line would contain a certain number of characters or words, or that all characters would be uppercase or lowercase, or that values might only be in a certain range. If you wrote your program assuming something about your data files that was not correct, your program may not give correct results.
To track down a wrong assumption about a data file, think about ways you can modify your program to help you determine what is happening. This could include having it print out values when they do not meet some assumption you are making about the file. You could also try just loading a data file into a text editor and examining it with your eyes to see if you see something you did not expect. (Although if you try this approach we strongly suggest that you start with the smallest data file for which the three quantities are not all the same.) Another approach would be to modify your program, or create a new program, to compute the three quantities for each line of a data file separately (as opposed to for the file as a whole as you have been doing): if the quantities differ for an entire file, then they must differ for at least one specific line in that file. Examining that/those line(s) will help you understand the problem.
If all of the three quantities that you measured in problem 6 are the same, then it would not matter which one you used in the denominator when computing the GC content. However, you saw that the three quantities are not all the same. In answers.txt, state which of these quantities should be used in the denominator and which should not, and why.
If your program incorrectly computed the GC content (which should be equal to (G+C)/(A+C+G+T)), then state that fact in your answers.txt file. Then, go back and correct your program, **and also update any incorrect answers elsewhere in your answers.txt file. It is fine to change the code we provided you if needed.
**If you are unsure if you are calculating things correctly, now would be a good time to validate your dna_analysis.py program's output using the Diff Checker. (See "Tips" at the top of this page for info on cutting and pasting things from the VSCode Terminal window. In particular, when copying and pasting into Diff Checker, be sure that you select the entire line as it will show differences in trailing spaces as a difference.) You can compare your output to the files given in the expected_output directory of the homework2 files. You have not yet completed the assignment, so your output will not be identical. But things like GC-content, AT-content and individual counts should be identical. You will produce the last two lines of output in the expected_output files in Problem 7 and Problem 8 below.
Step by Step Solution
There are 3 Steps involved in it
Step: 1
Get Instant Access to Expert-Tailored Solutions
See step-by-step solutions with expert insights and AI powered tools for academic success
Step: 2
Step: 3
Ace Your Homework with AI
Get the answers you need in no time with our AI-driven, step-by-step assistance
Get Started