Question

1 Approved Answer

Posted on Sep 12, 2024

Inputs are the oriC_nl.txt, and outputs are the command needed for each question. There are no more information to give, the assignment looks exactly like

Inputs are the oriC_nl.txt, and outputs are the command needed for each question.

There are no more information to give, the assignment looks exactly like the wording as shown.

write Python command for each question a,b,c,d

A typical kind of sequence question might be: In prokaryotes DnaA is a protein that activates initiation of DNA replication. There are multiple DnaA binding sites and they are typically 9-basepair repeats upstream of the oriC site. There is also a DNA Unwinding Element (DUE), which is a tandem array of three 13-basepair AT-rich sequences. Given the oriC sequence of a prokaryote, find potential DnaA boxes and the DUE.

The file oriC_nl.txt contains the 540 bases in the oriC region of Vibrio cholera broken up in 20bp chunks, with newline characters at the end of each line. Write programs to do the following:

a. Reverse Complement:

Input: The oriC_nl.txt file

Output: The complementary strand, written both 5' to 3' and 3' to 5'

Bonus: Use dictionaries and the get function.

b. Sequence Frequency: Find the most frequent k-mers in a string.

Input: The file oriC_nl.txt and an integer k

Output: The most frequent k-mers in the input file

c. Pattern Matching: Find all occurrences of a pattern in a string.

Input: The file oriC_nl.txt and a Pattern string

Output: All starting positions where Pattern appears as a substring in the file.

d. Sequence Frequency with Gaps: Find the most frequent k-mers in a string with one allowed mismatch.

Input: The file oriC_nl.txt and an integer k

Output: The most frequent k-mer consensus sequences in the input file

The oriC region of Vibrio cholera: -> oriC_nl.txt

atcaatgatc aacgtaagct tctaagcatg atcaaggtgc tcacacagtt tatccacaac ctgagtggat gacatcaaga taggtcgttg tatctccttc ctctcgtact ctcatgacca cggaaagatg atcaagagag gatgatttct tggccatatc gcaatgaata cttgtgactt gtgcttccaa ttgacatctt cagcgccata ttgcgctggc caaggtgacg gagcgggatt acgaaagcat gatcatggct gttgttctgt ttatcttgtt ttgactgaga cttgttagga tagacggttt ttcatcactg actagccaaa gccttactct gcctgacatc gaccgtaaat tgataatgaa tttacatgct tccgcgacga tttacctctt gatcatcgat ccgattgaag atcttcaatt gttaattctc ttgcctcgac tcatagccat gatgagctct tgatcatgtt tccttaaccc tctatttttt acggaagaat gatcaagctg ctgctcttga tcatcgtttc image text in transcribed

main.py oriC_nl.tx sequence-file open('oric.nl.txt','r') 2 int(raw_input ("Size k-mer to search for? ")) kmer num-mismatch of = int (raw_input("Number of mismatches tolerated in the kmer? ") = "" 6 sequence = 7-for line in sequence_file: sequence +-...join(line.split()).lower() 10 main.py oriC_nl.tx sequence-file open('oric.nl.txt','r') 2 int(raw_input ("Size k-mer to search for? ")) kmer num-mismatch of = int (raw_input("Number of mismatches tolerated in the kmer? ") = "" 6 sequence = 7-for line in sequence_file: sequence +-...join(line.split()).lower() 10