Answered step by step
Verified Expert Solution
Link Copied!

Question

1 Approved Answer

Prompt: Given the genomic sequences for an organism; one of the first steps in identifying the genes is to identify the open reading frames (ORFs).

Prompt:

Given the genomic sequences for an organism; one of the first steps in identifying the genes is to identify the open reading frames (ORFs). An open reading frame is a maximal length sequence of the DNA that starts with a start codon ATG and ends with a stop codon (TAA, TAG or TGA). In prokaryotes, gene may occur within ORFs. In eukaryotes, the story is complicated by the presence of introns that are spliced out of the mRNA before translation. Write a Python program that finds all the ORFs in a genomic sequence.

A genomic sequence has 6 reading frames, corresponding to the six possible ways of translating the sequence into three-letter codons. Frame 1 treats each group of three bases as a codon, starting from the first base. Frame 2 starts at the second base, and frame 3 starts at the third base. Frames 4, 5, and 6 are defined in a similar way, but refer to the opposite strand, which is the reverse complement of the first strand.

Specifications:

Write Python program called orfs to find all the open reading frames (ORFs) in the input sequence.

INPUT:

The program will take in as input a file, which will contain any number of DNA sequences in the FASTA format:

- A line beginning with a ">" is the header line for the next sequence

- All lines after the header contain sequence data.

- There will be any number of sequences per file.

- Sequences may be split over many lines.

- Sequence data may be upper or lower case.

- Sequence data may contain white space, which should be ignored.

Ask the user for the minimum ORF to search for. The default is 50, which means your program should print out all ORFs with at least 50 bases.

OUTPUT:

Print your output in FASTA format, with one header line for each ORF, followed by the DNA in the ORF.

The header should be the same as the header in the input file, followed by a bar "|" followed by FRAME = POS = LEN = , where is the frame number (1-6)

is the genomic position of the start of the ORF (left end is base 1) is the length of the ORF (in bases) If N = 4, 5 or 6, then P should be a negative number that indicates the position of the start of the ORF from the right end of the sequence. The DNA in the ORF should be printed out with a space between each codon, and no more than 15 codons per line.

For example:

>gi|1786181| Escherichia coli K-12 | FRAME = 1 POS = 5215 LEN = 138 ATG ATA AAA GGA GTA ACC TGT GAA AAA GAT GCA ATC TAT CGT ACT CGC ACT TTC CCT GGT TCT GGT CGC TCC CAT GGC AGC ACA GGC TGC GGA AAT TAC GTT AGT CCC GTC AGT AAA ATT ACA GAT AGG CGA TCG TGA

Worked Example:

Example Input:

> sequence 1 ATGCTACCGTAGTGAG

> sequence 2 AATTACTAATCAGCCCATGATCATAACATAA CTGTGTATGTCTTAGAGGACCAAACCCCCCTCCTTCC

Example Output (looking for ORFs of any size - not actual results, just an illustration. You can use online tools, such as ORFFinder at NCBI to check your results):

> sequence 1 | FRAME = 1 POS = 1 LEN = 12 ATG CTA CCG TAG

> sequence 2 | FRAME = 2 POS = 17 LEN = 15 ATG ATC ATA ACA TAA

> sequence 2 | FRAME = 2 POS = 38 LEN = 9 ATG TCT TAG

> sequence 2 | FRAME = 4 POS = -40 LEN = 9 ATG TTA TGA

> sequence 2 | FRAME = 6 POS = -45 LEN = 15 ATG ATC ATG GGC TGA

I need help writing python code to find reverse complements for frames 4, 5 and 6 to be printed as the output shown above.

My input is the Mus Musculus DNA sequence saved as Mus.txt

>NC_000087.8:c2663658-2662471 Mus musculus strain C57BL/6J chromosome Y, GRCm39

ATGGAGGGCCATGTCAAGCGCCCCATGAATGCATTTATGGTGTGGTCCCGTGGTGAGAGGCACAAGTTGG

CCCAGCAGAATCCCAGCATGCAAAATACAGAGATCAGCAAGCAGCTGGGATGCAGGTGGAAAAGCCTTAC

AGAAGCCGAAAAAAGGCCCTTTTTCCAGGAGGCACAGAGATTGAAGATCCTACACAGAGAGAAATACCCA

AACTATAAATATCAGCCTCATCGGAGGGCTAAAGTGTCACAGAGGAGTGGCATTTTACAGCCTGCAGTTG

CCTCAACAAAACTGTACAACCTTCTGCAGTGGGACAGGAACCCACATGCCATCACATACAGGCAAGACTG

GAGTAGAGCTGCACACCTGTACTCCAAAAACCAGCAAAGCTTTTATTGGCAGCCTGTTGATATCCCCACT

GGGCACCTGCAGCAGCAGCAGCAGCAGCAGCAGCAGCAGCAGTTCCATAACCACCACCAGCAGCAACAGC

AGTTCTATGACCACCACCAGCAGCAGCAGCAGCAGCAGCAGCAGCAGCAGCAGTTCCATGACCACCACCA

GCAGAAGCAGCAGTTTCATGACCACCACCAGCAGCAACAGCAGTTCCATGACCACCACCACCACCACCAG

GAGCAGCAGTTCCATGACCACCACCAGCAGCAACAGCAGTTCCATGACCACCAGCAGCAGCAGCAGCAGC

AGCAGCAGCAGCAGTTCCATGACCACCACCAGCAGAAGCAGCAGTTCCATGACCACCACCACCACCAACA

GCAGCAGCAGTTCCATGACCACCAGCAGCAGCAGCAGCAGTTCCATGACCACCAGCAGCAGCAGCATCAG

TTCCATGACCACCCCCAGCAGAAGCAGCAGTTCCATGACCACCCCCAGCAGCAACAGCAGTTCCATGACC

ACCACCACCAGCAGCAGCAGAAGCAGCAGTTCCATGACCACCACCAGCAGAAGCAGCAGTTCCATGACCA

CCACCAGCAGAAGCAGCAGTTCCATGACCACCACCAGCAGCAACAGCAGTTCCATGACCACCACCAGCAG

CAGCAGCAGCAGCAGCAGCAGCAGCAGCAGCAGTTCCACGACCAGCAGCTTACCTACTTACTAACAGCTG

ACATCACTGGTGAGCATACACCATACCAGGAGCACCTCAGCACAGCCCTGTGGTTGGCAGTCTCATGA

#Just need help to revise the below python code to find frames 4, 5, and 6 the reverse complements and #print the output for frames 4, 5, and 6 like the format for the example output shown above

import re

def reverse(seq):

return seq[::-1]

def complement(seq):

"""Returns a complement DNA sequence"""

comp = {'A': 'T', 'C': 'G', 'T': 'A', 'G': 'C'}

compseq = ''

for i in seq:

compseq += comp[i]

return compseq

def reverse_complement(seq):

""""Returns a reverse complement DNA sequence"""

compseq = complement(seq)

revcompseq = reverse(compseq)

return revcompseq

def main():

for line in seq:

if line.startswith(">"):

ac = re.search(r'(?=(ATG(?:...)*?)(?=TAG|TGA|TAA))', line).group()

sequence = ""

else:

sequence += line.rstrip(" ")

p_sequences[ac] = sequence

for ac, sequence in p_sequences.items():

print("Ac" + ac + " " + sequence + " ")

print('reverse():', reverse(seq))

compseq = complement(seq)

print(' complement():', compseq)

#To compute reverse complement

revcompseq = reverse_complement(seq)

print(" ReverseComplement():", revcompseq)

return

if __name__ == '__main__':

main()

with open('Mus.txt', 'r') as seq:

data = seq.read()

data = [x.split(' ', 1) for x in seq.split('>')]

data = [(x[0], ''.join(x[1].split())) for x in data if len(x) == 2]

start, end = [re.compile(x) for x in 'ATG TAG|TGA|TAA'.split()]

results = {}

# Use smallest end that is bigger than each start

ends.reverse()

for start in starts:

for end in ends:

if end > start and (end - start) % 3 == 0:

results[start] = end + 3

results = [(end - start, start) for

start, end in results.items()]

return max(results) if results else (0, 0)

def get_orfs(seq):

''' Returns length, header, forward/reverse indication,

and longest match (corrected if reversed)

'''

header, seqf = seq

seqr = seqf[::-1].translate(revtrans)

def readgroup(seq, group):

return list(x.start() for x in group.finditer(seq))

f = get_longest(readgroup(seqf, start), readgroup(seqf, end))

r = get_longest(readgroup(seqr, start), readgroup(seqr, end))

(length, index), s, direction = max((f, seqf, 'forward'), (r, seqr, 'reverse'))

return length, header, direction, s[index:index + length]

# Process entire file

all_orfs = [get_orfs(x) for x in data]

# Put in groups of 3

all_orfs = zip(all_orfs[::3], all_orfs[1::3], all_orfs[2::3])

Step by Step Solution

There are 3 Steps involved in it

Step: 1

blur-text-image

Get Instant Access to Expert-Tailored Solutions

See step-by-step solutions with expert insights and AI powered tools for academic success

Step: 2

blur-text-image

Step: 3

blur-text-image

Ace Your Homework with AI

Get the answers you need in no time with our AI-driven, step-by-step assistance

Get Started

Recommended Textbook for

Introduction to Wireless and Mobile Systems

Authors: Dharma P. Agrawal, Qing An Zeng

4th edition

1305087135, 978-1305087132, 9781305259621, 1305259629, 9781305537910 , 978-130508713

More Books

Students also viewed these Programming questions