Question
In last week's homework you wrote a script to read a FASTA file and report some basic statistics. Another important format is the FASTQ format
In last week's homework you wrote a script to read a FASTA file and report some basic statistics. Another important format is the FASTQ format Links to an external site., which stores both the sequence data as well as the quality scores for the nucleotide in the file. Your assignment this week is to expand your script to support both FASTQ and FASTA files. It should be able to detect the file type automatically, either from the file name or file content. FASTQ files typically end in either . fq or . fastq, along with the gzipped variants. In order to test your script, run it on the FASTQ files you download from the Human Microbiome Project using the commands below (they will take some time, these are large files): $ wget http://downloads.hmpdacc.org/data/Illumina/PHASEII/anterior_nares/SRS077085.tar.bz2 $ tar -xjf SRS077085.tar.bz2 For example a sequence read in FASTQ format looks like: @61JCNAAXX100503:5:100:10000:10232/1 CATGTAACATGTTCTATGTCCATAACTCCAGAATCATCAATACTTGATTTCTTCATTAGCATGTTCATAATAAATTCCCTTATTTTAAATGGTTTATAAGA +61JCNAAXX100503:5:100:10000:10232/1 GGGGGGGGGGGGGGGGGGGGGGFGGGGGGFGGGGGGGGGFGFGGGGEGGGGGGGGGFGAGCGFDFEEGEFGGDFEFFEDEE@FFFCCBDFEBCF DEDCE5 Description: Line 1: start with an @ followed by the sequence read identifier and description Line 2: sequence line Line 3: start with a + symbol follow by repeat of read identifier line Line 4: quality line, which should have the same length as the corresponding sequence line 2. If you had troubles with last week's script or would just like a fresh start, you can copy the 'official' solution here and modify it for this assignment: Course site on Canvas -> Modules -> Homework solutions -> M02 Sequence statistics When you turn in your assignment you should include: - Your script, attached as a file - Instructions how to run it - Summary statistics for the downloaded FASTQ files: sequence and nucleotide count, and average sequence length of the sequence reads
Last week's script requeirements: For this assignment, you should write a script which accepts the path to a FASTA file as an argument whose output is a report of a few basic statistics on the sequences found within the file. In this first HW assignment, this report should include the number of sequences found and the total number of residues (bases) that make them up.
For full credit, compress the FASTA file you download like this:
$ gzip CAM_PROJ_SargassoSea.read_pep.fa
Then, your script should detect whether a file is compressed (based on the existence of the '. gz' extension) and open the file appropriately.
Step by Step Solution
There are 3 Steps involved in it
Step: 1
Get Instant Access to Expert-Tailored Solutions
See step-by-step solutions with expert insights and AI powered tools for academic success
Step: 2
Step: 3
Ace Your Homework with AI
Get the answers you need in no time with our AI-driven, step-by-step assistance
Get Started