Answered step by step
Verified Expert Solution
Link Copied!

Question

1 Approved Answer

Project 1 a , 1 b - Read Mapping In this project, the goal is to identify the variants in a donor genome using reads

Project 1a,1b - Read Mapping
In this project, the goal is to identify the variants in a donor genome using reads from that genome and a
reference genome sequence.
Genomic Simulator:
The reference sequence is a randomly generated sequence. A donor genome sequence is simulated by
applying a mutation process on the reference sequence. In the mutation process, each position has a chance
of a substitution, insertion or deletion. The mutations are logged as they are being generated resulting in a
donor genome true mutation list relative to the reference genome. Reads are then simulated from the donor
genome by sampling uniformly from locations in the genome. Read errors are inserted into the reads with
each position in the read having a chance of substitution (read error), insertion or deletion. Reads are
generated in pairs with the insert size between the pairs uniformly generated between a minimum and
maximum insert size.
Project Goal:
The goal of the project is to predict the donor mutations list from the reads and reference genome.
Data formats:
The reference genome and reads format are in fasta format. Each sequence in a fasta file contains an
identifier line starting with a > and then the sequence on following lines.
An example of the reference sequence format is:
>genome_1000
AAGTGGGTCTCGGCGGAACTGGCTACGAGAATATGCAGTTGGCAATGGTACCACTTTTGTAAGTACATAGTTCATGAGTC
CGTTTTGACGTGGTGGCCATCTTTGTCACACCTCGATCCACGCCCTATAATACTTAGTTAACGCCTTCTATGTCGTGTAA
TCCACAAATTAATTCGAGAACATCCTGCCCGTAGGTTTCAGATGGATTCATAGTGCCCCATTTGGTGACGAGCGCTTGAG
GCAACTATTTAGCTTTGCGGCGTGACCCGCACTACCGTATCCGTTGGGCCTGTTTTAAGGAAAAATATAGGGCAAGACTG
ACTTGGCCCAGTGCAATCGCGCACCCCGCCTCGCAGCAGGCCTCTAGAAGCAGCAGGTCTCGTGTTAAGGTTGTACTAGG
TTCGGTTAGTACACTTTGACTACACCATCAGTAACTATTAAGATCAGATTCGTTCGTGTTAGTAGGATCCATGGATTCCG
ACATCGGCCCGAAGCCCCCTGGGTCACAATGAGGCAGGCGATCGGAGCGACATACGACCCCACTCCACATTAATGCGATG
Reads which are part of a pair are denoted with the same identifier ending with a /1 or /2. An example of a
reads format is:
>read_0/1
TGACTACACCATCAGTAACATTAAGATCAGATTCGTTCGTGTTACTAGG
>read_0/2
CCACTCCACATTCATGCGATGGATATGATCCCACGGCAAGTCGCCTTTGA
>read_1/1
ACTACGTATCCGTTGGACCTGTTTTAAGGAAAAATATAGGGCAAGACTGAC
>read_1/2
AGGAAAAATATAGGGCAAGACTGATTTGGCCCAGTGCAATCGCGCACCCC
>read_2/1
AGCGCGCACCCCGCCTCTCAGCAGGCCTCTAGAAGCAGCAGGTCTCGTGT
>read_2/2
ATTCGTTCGTGTCACTAGGATCCTGGATTACCGACATCGGCCCTATGCCCC
>read_3/1
CAGGCGAACGGAGCGACATACGACCCCACTCCACATTAATGCGATGGATA
>read_3/2
CAGTAATTAGATGGGATAATTTCGTTCGGGGTCCAACCACCTATAGGTAG
>read_4/1
GTTGCATGTCCAAGTAGAGAAGAGCCAGTCCCCCGGACACGCTCCAAAACG
>read_4/2
GAGATACATCTCGAGGATGGGCCTGCGCGTCAGCTAATACATTAAATTCA
The mutation list format has a line for each mutation between the reference and the donor genome. The
format starts with a > symbol followed by a S,I, or D for a substitution, insertion and deletion followed by
the position in the reference followed by a space. For substitutions, the format continues with the reference
and donor bases are provided. For insertions, the new sequence that is added is included. For deletions, the
sequence that is removed is included. An example of the format is:
>S732 G A
>S851 A G
>I260 T
>I274 T
>D96 C
>D471 T
Evaluation:
The predicted mutation list is evaluated by comparing entries between the predicted mutation list and the true
mutation list. An entry is considered correct only if it matches exactly between the two lists. We evaluate the
predictions by computing the F-score for overall mutations, substitutions only, insertions only and deletions
only. Need an f-score of at least 0.5
Sample Genome:
A sample genome of length 1000 is provided with the true mutation list. A set of reads with errors and without
errors is provided. This genome can be used to develop and evaluate a solution to the project before applying
it to the larger problems.
Project 1a:
A reference genome of 10,000 is provided with paired reads containing errors is provided.
Output needs to be predictions.csv in a zip file
"I used the trivial algorithm - I got my score up after realizing a lot of the errors I had in my predictions csv file were read errors rather than actual mutations (so I had a lot of extra errors, that might be whats dragging your score down). I would recommend using some sort of consensus algorithm to filter out which errors actually appear frequently enough in your errors list to count as actual mutations, and use the sample_no_repeats_mutations.txt file to verify your predictions match as closely as possible"
Your code should be able to run from standard command line interface as described in the programming assignment. There should be a shell command called runproject.sh which will run with the project input files located in the same directory.

Step by Step Solution

There are 3 Steps involved in it

Step: 1

blur-text-image

Get Instant Access to Expert-Tailored Solutions

See step-by-step solutions with expert insights and AI powered tools for academic success

Step: 2

blur-text-image

Step: 3

blur-text-image

Ace Your Homework with AI

Get the answers you need in no time with our AI-driven, step-by-step assistance

Get Started

Recommended Textbook for

Database Processing Fundamentals, Design, and Implementation

Authors: David M. Kroenke, David J. Auer

14th edition

133876705, 9781292107639, 1292107634, 978-0133876703

More Books

Students also viewed these Databases questions