Question
The genetic code of all living organisms is represented by a long sequence of simple molecules called nucleotides, or bases, which make up human DNA.
The genetic code of all living organisms is represented by a long sequence of simple molecules called nucleotides, or bases, which make up human DNA. There are four nucleotides: A, C, G, and T. The genetic code of a human is a string of 3.2 billion made of the letters A, C, G, and T. In this problem we search for a substring of length k that occurs most frequently in the human genome. DNA is composed of a string of 'a', 'g', 'c', and 't's, e.g., atcaatgatcaacgtaagcttctaagcatg atcaaggtgctcacacagtttatccacaac ctgagtggatgacatcaagataggtcgttg tatctccttcctctcgtactctcatgacca cggaaagatgatcaagagaggatgatttct tggccatatcgcaatgaatacttgtgactt gtgcttccaattgacatcttcagcgccata ttgcgctggccaaggtgacggagcgggatt acgaaagcatgatcatggctgttgttctgt ttatcttgttttgactgagacttgttagga tagacggtttttcatcactgactagccaaa gccttactctgcctgacatcgaccgtaaat tgataatgaatttacatgcttccgcgacga tttacctcttgatcatcgatccgattgaag atcttcaattgttaattctcttgcctcgac tcatagccatgatgagctcttgatcatgtt tccttaaccctctattttttacggaagaat gatcaagctgctgctcttgatcatcgtttc 1- Write a function that takes a long string of letters and creates a list of all possible k letter sequences. For example, the string 'gcacttgcatgcac' has the following 3 letter sequences: [gca,cac,act, ctt, ttg,tgc,gca, cat,atg, tgc,gca,cac] The function should find which of the above sequences occur most often. In this example, 'gca' appears three times and other sequences occur once. The function returns the highest occurring substring and its count. 2-Write a function that calls the above function. We will search for a sub-sequence of k letters that occurs most frequently. k is a variable between min_length and max_length. For example, if min_length=4 and max_length= 8, the program first finds which 4-letter sequence occurs most and with what frequency. It does the same thing for 5, 6, 7, and 8-letter sequences. Then it will print for example 'cttt', 52. It means that the highest frequency of occurrence occurred in 5-letter sequences and the highest frequencies in other cases were lower than 52. Here is an example file Download example filewith the first 6000 basepairs of the DNA of a virus. Calling mostCommonSubstring(dna,4,9) on that dataset should run in no more than a minute or two. That's way too slow to use on a whole DNA sequence of millions of basepairs, but reasonable given how much Python you know now. Here's what the output of the program would look like: Data file: initial6000.txt mink: 4 maxk: 9 mostCommonSubstring(dna,4,9) = 'cttt', 52 Data file: initial6000.txt mink: 3 maxk: 6 mostCommonSubstring(dna,3,6) = 'ttt', 176 Data file: initial6000.txt mink: 5 maxk: 10 mostCommonSubstring(dna,5,10) = 'gcttt', 20 Guidance Here are the steps I recommend to finish this project. It is unlikely you can do this all in one day, so plan accordingly: Create a flowchart or pseudo-code code algorithm for each of the functions described in the flowing steps. Write a program that sets a variable, dna, to a short string and also sets the variable, k, to a small integer. This program runs a loop printing out each k-length substring in dna. Once that is working, modify the program to do more inside the loop. Now store the substring in a variable, target. Then make a nested loop that counts how many times the target is seen in dna. Print out that count. Once that is working, modify the program to keep track of which target string produces the highest count. (Hint: use another pair of variables, highestCount and targetWithHighestCount, to keep track of the best target seen so far.) Print out targetWithHighestCount at the end of your program. Now turn your working program from the step above into a function. Instead of setting dna and k upfront, make them parameters to the function. Instead of printing targetWithHighestCount, return it. Name your function mostCommonK. Next write a program that sets dna to a short-ish string and then calls mostCommonK repeatedly with different values for the k argument, say for 2, 3, 4, 5, and 6. Now do something similar to what you did in step #3 and print out the value for the k argument that caused mostCommonK to return the largest value. Now change the program in the step above into a function, mostCommonSubstring. Next, write a program that does what runMostCommonSubstring needs to do. It asks the user for the name of the file and the values for mink and maxk, opens the file and reads its contents into the variable dna, then calls mostCommonSubstring(dna, mink, maxk) and prints the answer. Test this program on a file with a small dna string in it before attempting the big one. Once it works on a small file, then try the big one. (See the first bullet below for file reading instructions.) Change the previous step into a function, runMostCommonSubstring. Test it. And hand it in! The following code is for reading in a file. It assumes that the data file is in the same folder as your program and that it has a shortened name, 'initial6000.txt' f = open('initial6000.txt') dna = f.read() You can also mention the location of the file if it is not in the same folder as your program. For example f = open('F:/my_homework/python_examples/initial6000.txt' dna = f.read() After the read, the variable dna will have the entire 6000 character string as its value. Make sure that you have enough comments and docstrings including the program header and function description. Submissions: (1) Submit a flowchart for the function mostCommonK (2) Submit your Python source file PA2-.py Late Submissions: Within 2 days of the due: 20% reduction. Submissions more than 48 hours after the due will not be accepted without prior arrangement.
Step by Step Solution
There are 3 Steps involved in it
Step: 1
Get Instant Access to Expert-Tailored Solutions
See step-by-step solutions with expert insights and AI powered tools for academic success
Step: 2
Step: 3
Ace Your Homework with AI
Get the answers you need in no time with our AI-driven, step-by-step assistance
Get Started