Answered step by step
Verified Expert Solution
Link Copied!

Question

1 Approved Answer

Finding genes in DNA is a fundamental problem in biology. After all, only a few percent of human DNA actually contains protein-coding genes. Sifting through

Finding genes in DNA is a fundamental problem in biology. After all, only a few percent of human DNA actually contains protein-coding genes. Sifting through the more than 3 billion base pairs to find the parts that are most likely to be genes requires some computational craftiness. That's where we're headed!

In this problem you'll write Python functions to find the longest open reading frame (ORF) in a DNA sequence. This is closely related to the problem of finding all the genes in a sequence, which is the topic of the final problem for this part.

All of your code for this problem should be in a file called orf.py.

In this problem, you'll make use of some of the functions that you wrote in the last problem. To get those functions, put the dna.py homework file in the same directory as the file for this problem, orf.py. Then, include the following line at the top of orf.py:

from dna import *

This will allow you to access those functions in your current program.

Finding the first open reading frame

Recall that an open reading frame (ORF) is the stretch of sequence between a start codon (with the sequence "ATG") and the next in frame stop codon. By "in frame" we mean that it is in a position that is a multiple of 3 nucleotides away.

For example, take a look at the string:

ATGCATAATTAGCT

There is an "ATG" at the beginning. Then there are two codons, "CAT" and "AAT", and then a stop codon "TAG". So "ATGCATAAT" is an open reading frame in this string. You might think for a moment that "ATGCATAA" is also an answer, but it's not an open reading frame because the "TAA" at the end of it is not "in frame" with the start codon "ATG" - that is, it's not a multiple of three nucleotides away from the leading "ATG".

Your first challenge is to write a function that finds the open reading frame given a sequence that begins with an "ATG". In this problem you need only operate on the forward sequence, and not consider its reverse complement.

Write a function called restOfORF(DNA) that takes as input a DNA sequence written 5' to 3'. It assumes that this DNA sequence begins with a start codon "ATG". It then finds the next in frame stop codon, and returns the ORF from the start to that stop codon. The sequence that is returned should include the start codon but not the stop codon. If there is no in frame stop codon, restOfORF should assume that the reading frame extends through the end of the sequence and simply return the entire sequence.

To this end, you will need to determine if a particular codon is a stop codon. Imagine that you have a string named codon and you wish to test if it is a stop codon, that is one of 'TAG', 'TAA', or 'TGA'. You could do this:

if codon == 'TAG' or codon == 'TAA' or codon == 'TGA':
 blah, blah, blah

Or, better yet, you could use in this way:

if codon in ['TAG', 'TAA', 'TGA']:
 blah, blah, blah

Here are some examples of restOfORF:

>>> restOfORF("ATGTGAA")
'ATG'
>>> restOfORF("ATGAGATAAG")
'ATGAGA'
>>> restOfORF("ATGAGATAGG")
'ATGAGA'
>>> restOfORF("ATGAAATT")
'ATGAAATT'

Note that in the last example there is no in frame stop codon, so we got back the whole string.

Your restOfORF can be written with a for loop. It will be a nice short function.

Finding ORFs part 2

Next, you will write functions that can find open reading frames in sequences that don't begin with an ATG. These functions will search for ATGs and then call restOfORF to get the corresponding open reading frames.

Consider some sequence of interest. Imagine we start at the 0 position and count off in units of 3 nucleotides. This defines a particular reading frame for the sequence. Here is an illustration, where alternating +++ and --- are used to indicate the units of 3 nucleotides.

CAGCTCCAATGTTTTAACCCCCCCC
+++---+++---+++---+++---+

Considering just the given sequence (and not the reverse complement), we can define two other reading frames on this sequence, starting at either the 1 or the 2 positions.

CAGCTACCATGTTTTAACCCCCCCC
-+++---+++---+++---+++---
CAGCTACCATGTTTTAACCCCCCCC
--+++---+++---+++---+++--

Every open reading frame between an ATG and a stop must fall in one of these three reading frames. A useful way to look for genes involves searching each of these frames separately for open reading frames.

oneFrame

Write a function oneFrame(DNA) that starts at the 0 position of DNA and searches forward in units of three looking for start codons. When it finds a start codon, oneFrame should take the slice of DNA beginning with that "ATG" and ask restOfORF for the open reading frame that begins there. It should store that sequence in a list and then continue searching for the next "ATG" start codon and repeat this process. Ultimately, this function returns the list of all ORFs that it found.

Here are some examples of oneFrame in action:

>>> oneFrame("CCCATGTTTTGAAAAATGCCCGGGTAAA")
['ATGTTT', 'ATGCCCGGG']
 
>>> oneFrame("CCATGTAGAAATGCCC")
[]
 
>>> oneFrame("ATGCCCATGGGGAAATTTTGACCC")
['ATGCCCATGGGGAAATTT', 'ATGGGGAAATTT']

longestORF

Next, you will write a function longestORF(DNA) that takes a DNA string, with bases written 5' to 3', and returns the sequence of the longest open reading frame on it, in any of the three possible frames. This function will not consider the reverse complement of DNA.

It shouldn't take much work to write longestORF given that you've already written oneFrame.

Consider the one sequence example from above:

>>> DNA="CAGCTCCAATGTTTTAACCCCCCCC"

We can look at the three frames of this sequence by slicing off 0, 1 or 2 base pairs at the start:

>>> oneFrame(DNA)
[]
>>> oneFrame(DNA[1:])
[]
>>> oneFrame(DNA[2:])
['ATGTTT']

Each call to oneFrame will produce a list. You can then combine the lists from the three calls, identify the largest ORF and return it.

To combine two lists, you can do the following:

>>> aList=[1,4]
>>> bList=[5,6]
>>> combinedList=aList+bList
>>> combinedList
[1,4,5,6]

Also, remember that if you have a string and want to know its length, you can use the built-in len function.

Here are some examples of longestORF:

>>> longestORF('ATGAAATAG')
'ATGAAA'
>>> longestORF('CATGAATAGGCCCA')
'ATGAATAGGCCCA'
>>> longestORF('CTGTAA')
''
>>> longestORF('ATGCCCTAACATGAAAATGACTTAGG')
'ATGAAAATGACT'

My code for each one:

#testrahelg Ch3findingThefirstORF 02/11/2020

def restOfORF(DNA):

'''This function takes as input a DNA sequence written 5' to 3' that starts with 'ATG' codon and then finds the next in frame stop codon, and returns the ORF from the start upto but not including the stop codon. Otheriwise, it assumes that the reading frame extends through the end of the sequence and returns the entire sequence'''

temp= "ATG"

for i in range(3,len(DNA)):

codon = DNA[i:i + 3]

if codon in ['TGA','TAG','TAA']:

break

temp=temp+codon[0]

return temp

#Ch3findingThefirstORFtesting

print "The restOfORF for the sequence (ATGTGAA) is:", restOfORF("ATGTGAA") #this should return 'ATG'

print "The restOfORF for the sequence (ATGAGATAAG) is:", restOfORF("ATGAGATAAG") #this should return 'ATGAGA'

print "The restOfORF for the sequence (ATGAGATAGG) is:", restOfORF("ATGAGATAGG") #this should return'ATGAGA'

print "The restOfORF for the sequence (ATGAAATT) is:", restOfORF("ATGAAATT") #this should return 'ATGAAATT'

#testrahelg Ch3findingORFsPart2 02/11/2020

def restOfORF(DNA):

'''This function takes a DNA sequence as input and finds the ORFs that don't begin with an ATG by searching for ATGs and then calls the function restOfORF to return the resulting ORF'''

temp = "ATG"

for i in range(3,len(DNA)):

codon = DNA[i:i+3]

if codon in ['TGA','TAG','TAA']:

break

temp=temp+codon[0]

return temp

def openOfORF(DNA):

temp=""

for i in range(len(DNA)):

codon = DNA[i:i+3]

if codon == "ATG":

print(restOfORF(DNA[i:]))

#Ch3findingORFsPart2testing

print "The openOfORF for the sequence (CAGCTCCAATGTTTTAACCCCCCCC) is:"

openOfORF("CAGCTCCAATGTTTTAACCCCCCCC") #this should print ATGTTT

#testrahelg Ch3oneFramepart2 02/11/2020

def restOfORF(DNA):

temp = "ATG"

for i in range(3,len(DNA)):

codon = DNA[i :i + 3]

if codon in ['TGA','TAG','TAA']:

break

temp += codon[0]

return temp

def oneFrame(DNA):

temp = ""

lis = []

for i in range(len(DNA)):

codon = DNA[i:i + 3]

if codon == "ATG":

lis.append(restOfORF(DNA[i:]))

print(lis)

#Ch3oneFramepart2testing

print "The oneFrame for the sequence(CCCATGTTTTGAAAAATGCCCGGGTAAA) is:"

oneFrame("CCCATGTTTTGAAAAATGCCCGGGTAAA") #This should print ['ATGTTT', 'ATGCCCGGG']

print "The oneFrame for the sequence (CCATGTAGAAATGCCC) is:"

oneFrame("CCATGTAG

Step by Step Solution

There are 3 Steps involved in it

Step: 1

blur-text-image

Get Instant Access to Expert-Tailored Solutions

See step-by-step solutions with expert insights and AI powered tools for academic success

Step: 2

blur-text-image

Step: 3

blur-text-image

Ace Your Homework with AI

Get the answers you need in no time with our AI-driven, step-by-step assistance

Get Started

Recommended Textbook for

More Books

Students also viewed these Databases questions