Answered step by step
Verified Expert Solution
Link Copied!

Question

1 Approved Answer

below code is giving error . please assist. To obtain the human protein sequences in multiple FASTA format, you can use the following script: I

below code is giving error . please assist.

To obtain the human protein sequences in multiple FASTA format, you can use the following script:

I have written the code in Python:

# Load necessary modules

from Bio import SeqIO

import gzip

# Read in human genome file

genome_file = 'hg38.fa.gz'

with gzip.open(genome_file, 'rt') as f:

genome = SeqIO.parse(f, 'fasta')

# Read in RefSeq table

refseq_file = '[path to RefSeq table file]'

with open(refseq_file, 'r') as f:

refseq = SeqIO.parse(f, 'tab')

# Create dictionary of gene sequences

gene_dict = {}

for record in genome:

gene_name = record.id.split()[0]

gene_dict[gene_name] = record.seq

# Create dictionary of protein sequences

protein_dict = {}

for record in refseq:

if record.features:

for feature in record.features:

if feature.type == 'CDS':

gene_name = feature.qualifiers['gene'][0]

gene_seq = gene_dict.get(gene_name, None)

if gene_seq is not None:

protein_seq = gene_seq[feature.location.start.position:feature.location.end.position].translate()

protein_name = f">{record.id}:{record.name}:{gene_name}:{feature.qualifiers['protein_id'][0]}"

protein_dict[protein_name] = protein_seq

# Write output file

output_file = '[output file name]'

with open(output_file, 'w') as f:

for protein_name, protein_seq in protein_dict.items():

f.write(f"{protein_name} {protein_seq} ")

Error .

Each line should have one tab separating the title and sequence, this line has 11 tabs: 'chr1\t67092164\t67109072\tXM_011541469.2\t0\t-\t67093004\t67103382\t0\t5\t1440,187,70,145,44,\t0,3070,4087,11073,16864, ' 

Requirement :

The ID field describes what the sequence is. You should use the concatenation (with colon : as the delimiter) of the RefSeq table name and name2 fields as the ID. For example, for the first record in the RefSeq table, the corresponding ID should be. >NM_001276352.2:Clorf141. The sequence field simply records the corresponding sequence, all in one line. For example: MVLSPADKTNVKAAWGKVGAHAGEYGAEALERMFLSFPTTKTYFPHFDLSHGS AQVKGHGKKVADALTNAVAHVDDMPNALSALSDLHAHKLRVDPVNFKLLSHC LLVTLAAHLPAEFTPAVHASLDKFLASVSTVLTSKYR.

Ref table data.

#"bin" name chrom strand txStart txEnd cdsStart cdsEnd exonCount exonStarts exonEnds score name2 cdsStartStat cdsEndStat exonFrames
0 XM_011541469.2 chr1 - 67092164 67109072 67093004 67103382 5 67092164,67095234,67096251,67103237,67109028, 67093604,67095421,67096321,67103382,67109072, 0 C1orf141 cmpl cmpl 0,2,1,0,-1,
0 XM_017001276.2 chr1 - 67092164 67131227 67093004 67127240 9 67092164,67095234,67096251,67103237,67111576,67115351,67125751,67127165,67131141, 67093604,67095421,67096321,67103382,67111644,67115464,67125909,67127257,67131227, 0 C1orf141 cmpl cmpl 0,2,1,0,1,2,0,0,-1,
0 XM_011541467.2 chr1 - 67092164 67131227 67093004 67127240 9 67092164,67095234,67096251,67103237,67111576,67115351,67125751,67127165,67131141, 67093604,67095421,67096321,67103343,67111644,67115464,67125909,67127257,67131227, 0 C1orf141 cmpl cmpl 0,2,1,0,1,2,0,0,-1,

Step by Step Solution

There are 3 Steps involved in it

Step: 1

blur-text-image

Get Instant Access to Expert-Tailored Solutions

See step-by-step solutions with expert insights and AI powered tools for academic success

Step: 2

blur-text-image

Step: 3

blur-text-image

Ace Your Homework with AI

Get the answers you need in no time with our AI-driven, step-by-step assistance

Get Started

Recommended Textbook for

Medical Image Databases

Authors: Stephen T.C. Wong

1st Edition

1461375398, 978-1461375395

More Books

Students also viewed these Databases questions

Question

Show that one gram is one mole of unified atomic mass units.

Answered: 1 week ago

Question

explain the need for human resource strategies in organisations

Answered: 1 week ago

Question

describe the stages involved in human resource planning

Answered: 1 week ago