Answered step by step
Verified Expert Solution
Link Copied!

Question

1 Approved Answer

Objective: Design and write an OLC (Overlap Layout Consensus) program for DNA sequencing. Data [ Undergraduates ] Download the first two chromosome sequences of Yeast

Objective: Design and write an OLC (Overlap Layout Consensus) program for DNA sequencing.

Data

[Undergraduates] Download the first two chromosome sequences of Yeast (S. cerevisiae) (NC_001133.fna and NC_001134.fna) from

http://www.yeastgenome.org/strain/S288C/overview (pick Genebank ID or RefSeq ID, and use FASTA format)

[Graduates] Download the first and the last chromosome (CHR01 and CHR22) sequences of Human from ftp://ftp.ncbi.nlm.nih.gov/genomes/Homo_sapiens).

https://www.ncbi.nlm.nih.gov/nuccore?Db=nuccore&Cmd=DetailsSearch&Term=%22Homo+sapiens%22%5BOrganism%5D+AND+srcdb_refseq%5Bprop%5D+AND+biomol_genomic%5Bprop%5D+AND+NC_000000%5BPACC%5D+%3A+NC_999999%5BPACC%5D

Delete all occurrences of N in nucleotides.

Simulate reads by randomly segmenting the genome in chunks of average 400 bp. Each read can be randomly (Gaussian) distributed with the average of 400 bp, but you may use uniform lengths of 400 bp in reads.

You have to generate enough reads so that an arbitrary base position is found in 10 reads, on the average, namely the coverage depth is 10. As in the slides, in order to have the coverage depth of 10, the total number of bases in fragments has to be 10 times the genome size.

Real reads include both 5 to 3 and 3 to 5 reads. For simplicity, you assume that the simulated reads are only for 5 to 3 reads.

2. Write a program(s) to create the overlap graph from reads. An overlap(si , sj) is defined as the length of the longest matches between the suffix of si and the prefix of sj .In order to compute overlaps, you need to perform n*(n-1) comparisons, where n denotes the number of reads.

The following is an example with 7 reads, and the overlap value is in the parentheses.

read overlap

1. TACCTTG 2(3) 4(1) 7(1)

2. TTGAT 3(3)

3. GATATGG 4(2) 7(1)

4. GGAG 3(1) 7(1)

5. CTCTA 1(2) 6(3)

6. CTAGT

7. GCTCT 2(1) 5(4) 6(2)

3. Each read in part 2 above becomes a node in a graph. Each link (edge) between nodes represents the overlap value. The sequencing problem becomes the traveling salesman problem, visiting every graph node with the largest sum of overlap values. You can try a greedy algorithm. Start with the largest overlap value. Follow the path with largest overlap values.

4. Run the assembly program for only one chromosome. Compare the resulting assembly with the original sequence from NCBI.

5. Run the assembly program for read from two chromosomes, and determine if contigs in two chromosomes are distinguished.

What to submit:

Submit a report including the following:

The data structure that you used to store the simulated read fragments with a discussion on an extension to sequence sizes of mammals.

A discussion on the comparison of your assembled sequence with the original sequence.

Any references, if used.

Your program in an appendix.

Step by Step Solution

There are 3 Steps involved in it

Step: 1

blur-text-image

Get Instant Access to Expert-Tailored Solutions

See step-by-step solutions with expert insights and AI powered tools for academic success

Step: 2

blur-text-image

Step: 3

blur-text-image

Ace Your Homework with AI

Get the answers you need in no time with our AI-driven, step-by-step assistance

Get Started

Recommended Textbook for

Inductive Databases And Constraint Based Data Mining

Authors: Saso Dzeroski ,Bart Goethals ,Pance Panov

2010th Edition

1489982175, 978-1489982179

More Books

Students also viewed these Databases questions