Project 1 a , 1 b Read Mapping In this project, the goal is to identify the variants in a donor genome using reads from that genome and a reference genome sequence Genomic Simulator The reference sequence is a randomly generated sequence A donor genome sequence is simulated by applying a mutation process on the reference sequence In the mutation process, each position has a chance of a substitution, insertion or deletion The mutations are logged as they are being generated resulting in a donor genome true mutation list relative to the reference genome Reads are then simulated from the donor genome by sampling uniformly from locations in the genome Read errors are inserted into the reads with each position in the read having a chance of substitution ( read error ) , insertion or deletion Reads are generated in pairs with the insert size between the pairs uniformly generated between a minimum and maximum insert size Project Goal The goal of the project is to predict the donor mutations list from the reads and reference genome Data formats The reference genome and reads format are in fasta format Each sequence in a fasta file contains an identifier line starting with a and then the sequence on following lines An example of the reference sequence format is genome 1 0 0 0 AAGTGGGTCTCGGCGGAACTGGCTACGAGAATATGCAGTTGGCAATGGTACCACTTTTGTAAGTACATAGTTCATGAGTC CGTTTTGACGTGGTGGCCATCTTTGTCACACCTCGATCCACGCCCTATAATACTTAGTTAACGCCTTCTATGTCGTGTAA TCCACAAATTAATTCGAGAACATCCTGCCCGTAGGTTTCAGATGGATTCATAGTGCCCCATTTGGTGACGAGCGCTTGAG GCAACTATTTAGCTTTGCGGCGTGACCCGCACTACCGTATCCGTTGGGCCTGTTTTAAGGAAAAATATAGGGCAAGACTG ACTTGGCCCAGTGCAATCGCGCACCCCGCCTCGCAGCAGGCCTCTAGAAGCAGCAGGTCTCGTGTTAAGGTTGTACTAGG TTCGGTTAGTACACTTTGACTACACCATCAGTAACTATTAAGATCAGATTCGTTCGTGTTAGTAGGATCCATGGATTCCG ACATCGGCCCGAAGCCCCCTGGGTCACAATGAGGCAGGCGATCGGAGCGACATACGACCCCACTCCACATTAATGCGATG Reads which are part of a pair are denoted with the same identifier ending with a 1 or 2 An example of a reads format is read 0 1 TGACTACACCATCAGTAACATTAAGATCAGATTCGTTCGTGTTACTAGG read 0 2 CCACTCCACATTCATGCGATGGATATGATCCCACGGCAAGTCGCCTTTGA read 1 1 ACTACGTATCCGTTGGACCTGTTTTAAGGAAAAATATAGGGCAAGACTGAC read 1 2 AGGAAAAATATAGGGCAAGACTGATTTGGCCCAGTGCAATCGCGCACCCC read 2 1 AGCGCGCACCCCGCCTCTCAGCAGGCCTCTAGAAGCAGCAGGTCTCGTGT read 2 2 ATTCGTTCGTGTCACTAGGATCCTGGATTACCGACATCGGCCCTATGCCCC read 3 1 CAGGCGAACGGAGCGACATACGACCCCACTCCACATTAATGCGATGGATA read 3 2 CAGTAATTAGATGGGATAATTTCGTTCGGGGTCCAACCACCTATAGGTAG read 4 1 GTTGCATGTCCAAGTAGAGAAGAGCCAGTCCCCCGGACACGCTCCAAAACG read 4 2 GAGATACATCTCGAGGATGGGCCTGCGCGTCAGCTAATACATTAAATTCA The mutation list format has a line for each mutation between the reference and the donor genome The format starts with a symbol followed by a S , I , or D for a substitution, insertion and deletion followed by the position in the reference followed by a space For substitutions, the format continues with the reference and donor bases are provided For insertions, the new sequence that is added is included For deletions, the sequence that is removed is included An example of the format is S 7 3 2 G A S 8 5 1 A G I 2 6 0 T I 2 7 4 T D 9 6 C D 4 7 1 T Evaluation The predicted mutation list is evaluated by comparing entries between the predicted mutation list and the true mutation list An entry is considered correct only if it matches exactly between the two lists We evaluate the predictions by computing the F score for overall mutations, substitutions only, insertions only and deletions only Need an f score of at least 0 5 Sample Genome A sample genome of length 1 0 0 0 is provided with the true mutation list A set of reads with errors and without errors is provided This genome can be used to develop and evaluate a solution to the project before applying it to the larger problems Project 1 a A reference genome of 1 0 , 0 0 0 is provided with paired reads containing errors is provided Output needs to be predictions csv in a zip file I used the trivial algorithm I got my score up after realizing a lot of the errors I had in my predictions csv file were read errors rather than actual mutations ( so I had a lot of extra errors, that might be what s dragging your score down ) I would recommend using some sort of consensus algorithm to filter out which errors actually appear frequently enough in your errors list to count as actual mutations, and use the sample no repeats mutations txt file to verify your predictions match as closely as possible Your code should be able to run from standard command line interface as described in the programming assignment There should be a shell command called runproject sh which will run with the project input files located in the same directory

Question

Project 1 a , 1 b   Read Mapping In this project, the goal is to identify the variants in a donor genome using reads from that genome and a reference genome sequence  Genomic Simulator  The reference sequence is a randomly generated sequence  A donor genome sequence is simulated by applying a mutation process on the reference sequence  In the mutation process, each position has a chance of a substitution, insertion or deletion  The mutations are logged as they are being generated resulting in a donor genome true mutation list relative to the reference genome  Reads are then simulated from the donor genome by sampling uniformly from locations in the genome  Read errors are inserted into the reads with each position in the read having a chance of substitution ( read error ) , insertion or deletion  Reads are generated in pairs with the insert size between the pairs uniformly generated between a minimum and maximum insert size  Project Goal  The goal of the project is to predict the donor mutations list from the reads and reference genome  Data formats  The reference genome and reads format are in fasta format  Each sequence in a fasta file contains an identifier line starting with a   and then the sequence on following lines  An example of the reference sequence format is    genome   1 0 0 0 AAGTGGGTCTCGGCGGAACTGGCTACGAGAATATGCAGTTGGCAATGGTACCACTTTTGTAAGTACATAGTTCATGAGTC CGTTTTGACGTGGTGGCCATCTTTGTCACACCTCGATCCACGCCCTATAATACTTAGTTAACGCCTTCTATGTCGTGTAA TCCACAAATTAATTCGAGAACATCCTGCCCGTAGGTTTCAGATGGATTCATAGTGCCCCATTTGGTGACGAGCGCTTGAG GCAACTATTTAGCTTTGCGGCGTGACCCGCACTACCGTATCCGTTGGGCCTGTTTTAAGGAAAAATATAGGGCAAGACTG ACTTGGCCCAGTGCAATCGCGCACCCCGCCTCGCAGCAGGCCTCTAGAAGCAGCAGGTCTCGTGTTAAGGTTGTACTAGG TTCGGTTAGTACACTTTGACTACACCATCAGTAACTATTAAGATCAGATTCGTTCGTGTTAGTAGGATCCATGGATTCCG ACATCGGCCCGAAGCCCCCTGGGTCACAATGAGGCAGGCGATCGGAGCGACATACGACCCCACTCCACATTAATGCGATG Reads which are part of a pair are denoted with the same identifier ending with a   1 or   2   An example of a reads format is    read   0   1 TGACTACACCATCAGTAACATTAAGATCAGATTCGTTCGTGTTACTAGG   read   0   2 CCACTCCACATTCATGCGATGGATATGATCCCACGGCAAGTCGCCTTTGA   read   1   1 ACTACGTATCCGTTGGACCTGTTTTAAGGAAAAATATAGGGCAAGACTGAC   read   1   2 AGGAAAAATATAGGGCAAGACTGATTTGGCCCAGTGCAATCGCGCACCCC   read   2   1 AGCGCGCACCCCGCCTCTCAGCAGGCCTCTAGAAGCAGCAGGTCTCGTGT   read   2   2 ATTCGTTCGTGTCACTAGGATCCTGGATTACCGACATCGGCCCTATGCCCC   read   3   1 CAGGCGAACGGAGCGACATACGACCCCACTCCACATTAATGCGATGGATA   read   3   2 CAGTAATTAGATGGGATAATTTCGTTCGGGGTCCAACCACCTATAGGTAG   read   4   1 GTTGCATGTCCAAGTAGAGAAGAGCCAGTCCCCCGGACACGCTCCAAAACG   read   4   2 GAGATACATCTCGAGGATGGGCCTGCGCGTCAGCTAATACATTAAATTCA The mutation list format has a line for each mutation between the reference and the donor genome  The format starts with a   symbol followed by a S , I , or D for a substitution, insertion and deletion followed by the position in the reference followed by a space  For substitutions, the format continues with the reference and donor bases are provided  For insertions, the new sequence that is added is included  For deletions, the sequence that is removed is included  An example of the format is    S 7 3 2 G A   S 8 5 1 A G   I 2 6 0 T   I 2 7 4 T   D 9 6 C   D 4 7 1 T Evaluation  The predicted mutation list is evaluated by comparing entries between the predicted mutation list and the true mutation list  An entry is considered correct only if it matches exactly between the two lists  We evaluate the predictions by computing the F   score for overall mutations, substitutions only, insertions only and deletions only  Need an f   score of at least 0   5 Sample Genome  A sample genome of length 1 0 0 0 is provided with the true mutation list  A set of reads with errors and without errors is provided  This genome can be used to develop and evaluate a solution to the project before applying it to the larger problems  Project 1 a  A reference genome of 1 0 , 0 0 0 is provided with paired reads containing errors is provided  Output needs to be predictions csv in a zip file  I used the trivial algorithm   I got my score up after realizing a lot of the errors I had in my predictions csv file were read errors rather than actual mutations ( so I had a lot of extra errors, that might be what s dragging your score down )   I would recommend using some sort of consensus algorithm to filter out which errors actually appear frequently enough in your errors list to count as actual mutations, and use the sample   no   repeats   mutations txt file to verify your predictions match as closely as possible  Your code should be able to run from standard command line interface as described in the programming assignment  There should be a shell command called runproject sh which will run with the project input files located in the same directory

Accepted Answer

The Answer is in the image, click to view ...

Question

Project 1 a , 1 b - Read Mapping In this project, the goal is to identify the variants in a donor genome using reads

Step by Step Solution

Step: 1

Get Instant Access to Expert-Tailored Solutions

Step: 2

Step: 3

Ace Your Homework with AI

Recommended Textbook for

Database Processing Fundamentals, Design, and Implementation

Students also viewed these Databases questions

Question

Question

Question

Question

Question

Question

Question

Question

Question

Question

Question

Question

Question

Question