Answered step by step
Verified Expert Solution
Question
1 Approved Answer
Project 1 a , 1 b - Read Mapping In this project, the goal is to identify the variants in a donor genome using reads
Project ab Read Mapping
In this project, the goal is to identify the variants in a donor genome using reads from that genome and a
reference genome sequence.
Genomic Simulator:
The reference sequence is a randomly generated sequence. A donor genome sequence is simulated by
applying a mutation process on the reference sequence. In the mutation process, each position has a chance
of a substitution, insertion or deletion. The mutations are logged as they are being generated resulting in a
donor genome true mutation list relative to the reference genome. Reads are then simulated from the donor
genome by sampling uniformly from locations in the genome. Read errors are inserted into the reads with
each position in the read having a chance of substitution read error insertion or deletion. Reads are
generated in pairs with the insert size between the pairs uniformly generated between a minimum and
maximum insert size.
Project Goal:
The goal of the project is to predict the donor mutations list from the reads and reference genome.
Data formats:
The reference genome and reads format are in fasta format. Each sequence in a fasta file contains an
identifier line starting with a and then the sequence on following lines.
An example of the reference sequence format is:
genome
AAGTGGGTCTCGGCGGAACTGGCTACGAGAATATGCAGTTGGCAATGGTACCACTTTTGTAAGTACATAGTTCATGAGTC
CGTTTTGACGTGGTGGCCATCTTTGTCACACCTCGATCCACGCCCTATAATACTTAGTTAACGCCTTCTATGTCGTGTAA
TCCACAAATTAATTCGAGAACATCCTGCCCGTAGGTTTCAGATGGATTCATAGTGCCCCATTTGGTGACGAGCGCTTGAG
GCAACTATTTAGCTTTGCGGCGTGACCCGCACTACCGTATCCGTTGGGCCTGTTTTAAGGAAAAATATAGGGCAAGACTG
ACTTGGCCCAGTGCAATCGCGCACCCCGCCTCGCAGCAGGCCTCTAGAAGCAGCAGGTCTCGTGTTAAGGTTGTACTAGG
TTCGGTTAGTACACTTTGACTACACCATCAGTAACTATTAAGATCAGATTCGTTCGTGTTAGTAGGATCCATGGATTCCG
ACATCGGCCCGAAGCCCCCTGGGTCACAATGAGGCAGGCGATCGGAGCGACATACGACCCCACTCCACATTAATGCGATG
Reads which are part of a pair are denoted with the same identifier ending with a or An example of a
reads format is:
read
TGACTACACCATCAGTAACATTAAGATCAGATTCGTTCGTGTTACTAGG
read
CCACTCCACATTCATGCGATGGATATGATCCCACGGCAAGTCGCCTTTGA
read
ACTACGTATCCGTTGGACCTGTTTTAAGGAAAAATATAGGGCAAGACTGAC
read
AGGAAAAATATAGGGCAAGACTGATTTGGCCCAGTGCAATCGCGCACCCC
read
AGCGCGCACCCCGCCTCTCAGCAGGCCTCTAGAAGCAGCAGGTCTCGTGT
read
ATTCGTTCGTGTCACTAGGATCCTGGATTACCGACATCGGCCCTATGCCCC
read
CAGGCGAACGGAGCGACATACGACCCCACTCCACATTAATGCGATGGATA
read
CAGTAATTAGATGGGATAATTTCGTTCGGGGTCCAACCACCTATAGGTAG
read
GTTGCATGTCCAAGTAGAGAAGAGCCAGTCCCCCGGACACGCTCCAAAACG
read
GAGATACATCTCGAGGATGGGCCTGCGCGTCAGCTAATACATTAAATTCA
The mutation list format has a line for each mutation between the reference and the donor genome. The
format starts with a symbol followed by a SI or D for a substitution, insertion and deletion followed by
the position in the reference followed by a space. For substitutions, the format continues with the reference
and donor bases are provided. For insertions, the new sequence that is added is included. For deletions, the
sequence that is removed is included. An example of the format is:
S G A
S A G
I T
I T
D C
D T
Evaluation:
The predicted mutation list is evaluated by comparing entries between the predicted mutation list and the true
mutation list. An entry is considered correct only if it matches exactly between the two lists. We evaluate the
predictions by computing the Fscore for overall mutations, substitutions only, insertions only and deletions
only. Need an fscore of at least
Sample Genome:
A sample genome of length is provided with the true mutation list. A set of reads with errors and without
errors is provided. This genome can be used to develop and evaluate a solution to the project before applying
it to the larger problems.
Project a:
A reference genome of is provided with paired reads containing errors is provided.
Output needs to be predictions.csv in a zip file
"I used the trivial algorithm I got my score up after realizing a lot of the errors I had in my predictions csv file were read errors rather than actual mutations so I had a lot of extra errors, that might be whats dragging your score down I would recommend using some sort of consensus algorithm to filter out which errors actually appear frequently enough in your errors list to count as actual mutations, and use the samplenorepeatsmutations.txt file to verify your predictions match as closely as possible"
Your code should be able to run from standard command line interface as described in the programming assignment. There should be a shell command called runproject.sh which will run with the project input files located in the same directory.
Step by Step Solution
There are 3 Steps involved in it
Step: 1
Get Instant Access to Expert-Tailored Solutions
See step-by-step solutions with expert insights and AI powered tools for academic success
Step: 2
Step: 3
Ace Your Homework with AI
Get the answers you need in no time with our AI-driven, step-by-step assistance
Get Started