[Solved] In previous weeks' homework you wrote scr

Answered step by step

Verified Expert Solution

Link Copied!

Question

1 Approved Answer

Posted on Sep 09, 2024

In previous weeks' homework you wrote scripts to parse FASTA and FASTQ files. The example files I gave were from the Human Microbiome Project (HMP).

In previous weeks' homework you wrote scripts to parse FASTA and FASTQ files. The example files I gave were from the Human Microbiome Project (HMP). There are 690 such samples of Illumina reads organized by body site here:

http://hmpdacc.org/HMASM/Links to an external site.

In total, they represent 35 billion reads taking up 2.3 TB in compressed form. For this week homework, use the data table at that URL, look at the 'Reads' column and choose two different samples:

Small sample - This should be 100MB or less.

Large sample - This should be 2GB or more.

Use the table, copy the http link for the reads you want to download and place both into a directory using the 'wget' utility like this:

 $ mkdir ~/hw04 $ cd ~/hw04/ $ wget http://downloads.hmpdacc.org/data/Illumina/anterior_nares/SRS022006.tar.bz2 Links to an external site. $ tar -xf SRS022006.tar.bz2

You are going to perform genomic assembly and gene prediction on both of these. You should do all of your testing on the smaller sample until you can run through both processes without errors, then process the larger sample. If you do not have enough resources on your local machine to complete the larger assembly, run it on an instance you create on the Google Compute Engine.

Assembly =========

The following commands illustrate how to create a Docker image, mount your local directory, then run an assembly using MEGAHIT. You will, of course, need to change the file names depending on which sample you download. You can find the documentation, source code and citation of MEGAHIT, which is an ultra-fast and memory-efficient NGS assembler here Links to an external site..

 $ cd ~/hw04/ [Download attached Dockerfile into this directory] $ docker build -t megahit . $ docker run -v ~/hw04:/data -i -t megahit /bin/bash [The previous command puts you within the Docker image, mounting the hw04 directory as /data within the image. The path in -v ~/hw04:/data represent outside:inside container]  $ cd /data $ megahit -1 SRS022006/SRS022006.denovo_duplicates_marked.trimmed.1.fastq -2 SRS022006/SRS022006.denovo_duplicates_marked.trimmed.2.fastq -r SRS022006/SRS022006.denovo_duplicates_marked.trimmed.singleton.fastq -o megahit_out $ exit

Before exiting the docker container the summary statistics of the assembly is printed on the screen. Example: [STAT] 9528 contigs, total 15052462 bp, min 200 bp, max 148643 bp, avg 1580 bp, N50 3056 bp

After the above steps, you should be back in your ~/hw04 directory and can find the results within the megahit_out directory you see there.

Gene prediction

=============

MetaGeneMark is relatively quick and easy to run. Using the instructions for MEGAHIT above as a template, create your own Dockerfile and image to run MetaGeneMark in the Docker container.

NOTE: Before you run it, you must obtain the license key and put it at /.gm_key within the Docker image you create.

Analysis and Report

===============

After you have processed both samples create and submit the following report when you turn in your assignment on canvas.

1. A written description of both processes

2. The representative list of commands which document the execution of both tasks.

3. The location of your output files

4. Generate descriptive statistics for both the small and large sample ( (your homework scripts from the last few weeks should be useful here)

Total read count and average read length of the input FASTQ files

Total contig sequences and average contig length after the assembly

Compare these statistics between the small and large samples. What differences do you see between them and how does the starting read count of each contribute to these differences?