Question
Objective: Design and write an OLC (Overlap Layout Consensus) program for DNA sequencing. Data [ Undergraduates ] Download the first two chromosome sequences of Yeast
Objective: Design and write an OLC (Overlap Layout Consensus) program for DNA sequencing.
Data
[Undergraduates] Download the first two chromosome sequences of Yeast (S. cerevisiae) (NC_001133.fna and NC_001134.fna) from
http://www.yeastgenome.org/strain/S288C/overview (pick Genebank ID or RefSeq ID, and use FASTA format)
[Graduates] Download the first and the last chromosome (CHR01 and CHR22) sequences of Human from ftp://ftp.ncbi.nlm.nih.gov/genomes/Homo_sapiens).
https://www.ncbi.nlm.nih.gov/nuccore?Db=nuccore&Cmd=DetailsSearch&Term=%22Homo+sapiens%22%5BOrganism%5D+AND+srcdb_refseq%5Bprop%5D+AND+biomol_genomic%5Bprop%5D+AND+NC_000000%5BPACC%5D+%3A+NC_999999%5BPACC%5D
Delete all occurrences of N in nucleotides.
Simulate reads by randomly segmenting the genome in chunks of average 400 bp. Each read can be randomly (Gaussian) distributed with the average of 400 bp, but you may use uniform lengths of 400 bp in reads.
You have to generate enough reads so that an arbitrary base position is found in 10 reads, on the average, namely the coverage depth is 10. As in the slides, in order to have the coverage depth of 10, the total number of bases in fragments has to be 10 times the genome size.
Real reads include both 5 to 3 and 3 to 5 reads. For simplicity, you assume that the simulated reads are only for 5 to 3 reads.
2. Write a program(s) to create the overlap graph from reads. An overlap(si , sj) is defined as the length of the longest matches between the suffix of si and the prefix of sj .In order to compute overlaps, you need to perform n*(n-1) comparisons, where n denotes the number of reads.
The following is an example with 7 reads, and the overlap value is in the parentheses.
read overlap
1. TACCTTG 2(3) 4(1) 7(1)
2. TTGAT 3(3)
3. GATATGG 4(2) 7(1)
4. GGAG 3(1) 7(1)
5. CTCTA 1(2) 6(3)
6. CTAGT
7. GCTCT 2(1) 5(4) 6(2)
3. Each read in part 2 above becomes a node in a graph. Each link (edge) between nodes represents the overlap value. The sequencing problem becomes the traveling salesman problem, visiting every graph node with the largest sum of overlap values. You can try a greedy algorithm. Start with the largest overlap value. Follow the path with largest overlap values.
4. Run the assembly program for only one chromosome. Compare the resulting assembly with the original sequence from NCBI.
5. Run the assembly program for read from two chromosomes, and determine if contigs in two chromosomes are distinguished.
What to submit:
Submit a report including the following:
The data structure that you used to store the simulated read fragments with a discussion on an extension to sequence sizes of mammals.
A discussion on the comparison of your assembled sequence with the original sequence.
Any references, if used.
Your program in an appendix.
Step by Step Solution
There are 3 Steps involved in it
Step: 1
Get Instant Access to Expert-Tailored Solutions
See step-by-step solutions with expert insights and AI powered tools for academic success
Step: 2
Step: 3
Ace Your Homework with AI
Get the answers you need in no time with our AI-driven, step-by-step assistance
Get Started