Question
below code is giving error . please assist. To obtain the human protein sequences in multiple FASTA format, you can use the following script: I
below code is giving error . please assist.
To obtain the human protein sequences in multiple FASTA format, you can use the following script:
I have written the code in Python:
# Load necessary modules
from Bio import SeqIO
import gzip
# Read in human genome file
genome_file = 'hg38.fa.gz'
with gzip.open(genome_file, 'rt') as f:
genome = SeqIO.parse(f, 'fasta')
# Read in RefSeq table
refseq_file = '[path to RefSeq table file]'
with open(refseq_file, 'r') as f:
refseq = SeqIO.parse(f, 'tab')
# Create dictionary of gene sequences
gene_dict = {}
for record in genome:
gene_name = record.id.split()[0]
gene_dict[gene_name] = record.seq
# Create dictionary of protein sequences
protein_dict = {}
for record in refseq:
if record.features:
for feature in record.features:
if feature.type == 'CDS':
gene_name = feature.qualifiers['gene'][0]
gene_seq = gene_dict.get(gene_name, None)
if gene_seq is not None:
protein_seq = gene_seq[feature.location.start.position:feature.location.end.position].translate()
protein_name = f">{record.id}:{record.name}:{gene_name}:{feature.qualifiers['protein_id'][0]}"
protein_dict[protein_name] = protein_seq
# Write output file
output_file = '[output file name]'
with open(output_file, 'w') as f:
for protein_name, protein_seq in protein_dict.items():
f.write(f"{protein_name} {protein_seq} ")
Error .
Each line should have one tab separating the title and sequence, this line has 11 tabs: 'chr1\t67092164\t67109072\tXM_011541469.2\t0\t-\t67093004\t67103382\t0\t5\t1440,187,70,145,44,\t0,3070,4087,11073,16864, '
Requirement :
The ID field describes what the sequence is. You should use the concatenation (with colon : as the delimiter) of the RefSeq table name and name2 fields as the ID. For example, for the first record in the RefSeq table, the corresponding ID should be. >NM_001276352.2:Clorf141. The sequence field simply records the corresponding sequence, all in one line. For example: MVLSPADKTNVKAAWGKVGAHAGEYGAEALERMFLSFPTTKTYFPHFDLSHGS AQVKGHGKKVADALTNAVAHVDDMPNALSALSDLHAHKLRVDPVNFKLLSHC LLVTLAAHLPAEFTPAVHASLDKFLASVSTVLTSKYR.
Ref table data.
#"bin" | name | chrom | strand | txStart | txEnd | cdsStart | cdsEnd | exonCount | exonStarts | exonEnds | score | name2 | cdsStartStat | cdsEndStat | exonFrames | |
0 | XM_011541469.2 | chr1 | - | 67092164 | 67109072 | 67093004 | 67103382 | 5 | 67092164,67095234,67096251,67103237,67109028, | 67093604,67095421,67096321,67103382,67109072, | 0 | C1orf141 | cmpl | cmpl | 0,2,1,0,-1, | |
0 | XM_017001276.2 | chr1 | - | 67092164 | 67131227 | 67093004 | 67127240 | 9 | 67092164,67095234,67096251,67103237,67111576,67115351,67125751,67127165,67131141, | 67093604,67095421,67096321,67103382,67111644,67115464,67125909,67127257,67131227, | 0 | C1orf141 | cmpl | cmpl | 0,2,1,0,1,2,0,0,-1, | |
0 | XM_011541467.2 | chr1 | - | 67092164 | 67131227 | 67093004 | 67127240 | 9 | 67092164,67095234,67096251,67103237,67111576,67115351,67125751,67127165,67131141, | 67093604,67095421,67096321,67103343,67111644,67115464,67125909,67127257,67131227, | 0 | C1orf141 | cmpl | cmpl | 0,2,1,0,1,2,0,0,-1, |
Step by Step Solution
There are 3 Steps involved in it
Step: 1
Get Instant Access to Expert-Tailored Solutions
See step-by-step solutions with expert insights and AI powered tools for academic success
Step: 2
Step: 3
Ace Your Homework with AI
Get the answers you need in no time with our AI-driven, step-by-step assistance
Get Started