Question: Project 1 a , 1 b - Read Mapping In this project, the goal is to identify the variants in a donor genome using reads

Project

1

, 1

-

Read Mapping

In this project, the goal is to identify the variants in a donor genome using reads from that genome and a

reference genome sequence.

Genomic Simulator:

The reference sequence is a randomly generated sequence. A donor genome sequence is simulated by

applying a mutation process on the reference sequence. In the mutation process, each position has a chance

of a substitution, insertion or deletion. The mutations are logged as they are being generated resulting in a

donor genome

true mutation

list relative to the reference genome. Reads are then simulated from the donor

genome by sampling uniformly from locations in the genome. Read errors are inserted into the reads with

each position in the read having a chance of substitution

(

read error

),

insertion or deletion. Reads are

generated in pairs with the insert size between the pairs uniformly generated between a minimum and

maximum insert size.

Project Goal:

The goal of the project is to predict the donor mutations list from the reads and reference genome.

Data formats:

The reference genome and reads format are in

fasta

format. Each sequence in a fasta file contains an

identifier line starting with a

>

and then the sequence on following lines.

An example of the reference sequence format is:

>

genome

_1000

AAGTGGGTCTCGGCGGAACTGGCTACGAGAATATGCAGTTGGCAATGGTACCACTTTTGTAAGTACATAGTTCATGAGTC

CGTTTTGACGTGGTGGCCATCTTTGTCACACCTCGATCCACGCCCTATAATACTTAGTTAACGCCTTCTATGTCGTGTAA

TCCACAAATTAATTCGAGAACATCCTGCCCGTAGGTTTCAGATGGATTCATAGTGCCCCATTTGGTGACGAGCGCTTGAG

GCAACTATTTAGCTTTGCGGCGTGACCCGCACTACCGTATCCGTTGGGCCTGTTTTAAGGAAAAATATAGGGCAAGACTG

ACTTGGCCCAGTGCAATCGCGCACCCCGCCTCGCAGCAGGCCTCTAGAAGCAGCAGGTCTCGTGTTAAGGTTGTACTAGG

TTCGGTTAGTACACTTTGACTACACCATCAGTAACTATTAAGATCAGATTCGTTCGTGTTAGTAGGATCCATGGATTCCG

ACATCGGCCCGAAGCCCCCTGGGTCACAATGAGGCAGGCGATCGGAGCGACATACGACCCCACTCCACATTAATGCGATG

Reads which are part of a pair are denoted with the same identifier ending with a

/ 1

/ 2 .

An example of a

reads format is:

>

read

_0 / 1

TGACTACACCATCAGTAACATTAAGATCAGATTCGTTCGTGTTACTAGG

>

read

_0 / 2

CCACTCCACATTCATGCGATGGATATGATCCCACGGCAAGTCGCCTTTGA

>

read

_1 / 1

ACTACGTATCCGTTGGACCTGTTTTAAGGAAAAATATAGGGCAAGACTGAC

>

read

_1 / 2

AGGAAAAATATAGGGCAAGACTGATTTGGCCCAGTGCAATCGCGCACCCC

>

read

_2 / 1

AGCGCGCACCCCGCCTCTCAGCAGGCCTCTAGAAGCAGCAGGTCTCGTGT

>

read

_2 / 2

ATTCGTTCGTGTCACTAGGATCCTGGATTACCGACATCGGCCCTATGCCCC

>

read

_3 / 1

CAGGCGAACGGAGCGACATACGACCCCACTCCACATTAATGCGATGGATA

>

read

_3 / 2

CAGTAATTAGATGGGATAATTTCGTTCGGGGTCCAACCACCTATAGGTAG

>

read

_4 / 1

GTTGCATGTCCAAGTAGAGAAGAGCCAGTCCCCCGGACACGCTCCAAAACG

>

read

_4 / 2

GAGATACATCTCGAGGATGGGCCTGCGCGTCAGCTAATACATTAAATTCA

The mutation list format has a line for each mutation between the reference and the donor genome. The

format starts with a

>

symbol followed by a

,

,

for a substitution, insertion and deletion followed by

the position in the reference followed by a space. For substitutions, the format continues with the reference

and donor bases are provided. For insertions, the new sequence that is added is included. For deletions, the

sequence that is removed is included. An example of the format is:

>

732

G A

>

851

A G

>

260

>

274

>

96

>

471

Evaluation:

The predicted mutation list is evaluated by comparing entries between the predicted mutation list and the true

mutation list. An entry is considered correct only if it matches exactly between the two lists. We evaluate the

predictions by computing the F

-

score for overall mutations, substitutions only, insertions only and deletions

only. Need an f

-

score of at least

0.5

Sample Genome:

A sample genome of length

1000

is provided with the true mutation list. A set of reads with errors and without

errors is provided. This genome can be used to develop and evaluate a solution to the project before applying

it to the larger problems.

Project

1

A reference genome of

10, 000

is provided with paired reads containing errors is provided.

Output needs to be predictions.csv in a zip file

"I used the trivial algorithm

-

I got my score up after realizing a lot of the errors I had in my predictions csv file were read errors rather than actual mutations

(

so I had a lot of extra errors, that might be what

s dragging your score down

) .

I would recommend using some sort of consensus algorithm to filter out which errors actually appear frequently enough in your errors list to count as actual mutations, and use the sample

_

_

repeats

_

mutations.txt file to verify your predictions match as closely as possible"

Your code should be able to run from standard command line interface as described in the programming assignment. There should be a shell command called runproject.sh which will run with the project input files located in the same directory.

Step by Step Solution

There are 3 Steps involved in it

1 Expert Approved Answer

Step: 1 Unlock blur-text-image

Question Has Been Solved by an Expert!

Get step-by-step solutions from verified subject matter experts

Step: 2 Unlock

Step: 3 Unlock

Students Have Also Explored These Related Databases Questions!

Project 1 a , 1 b - Read Mapping In this project, the goal is to identify the variants in a donor genome using reads from that genome and a reference genome sequence. Genomic Simulator: The reference...

Use the following guidelines to perform the analysis (these guidelines assume you are performing an RNA-seq analysis) For the final assignment, you will need to submit (1) an MSWord or PDF document...

GENETICS AND FORCES OF EVOLUTION Introduction The goal of anthropological genetics is to understand the evolutionary relationships, demographic histories, and genetic bases of biological variation in...

please help I need this by tonight ASAP. if you solve this i will give you the BEST RATING. ALL I NEED IS THE FULL DETAILED CODE FOR THE MULTITASKING COMMANDER. other commanders are finished. I...

National Business Institute of Australia 20 Clark Rd. Ivanhoe Victoria 3079 www.nbi.com.au 03 9499 7872 FNSORG601A Negotiate to achieve goals and manage disputes Student Workbook melbourne . sydney ....

Having a little trouble finishing my code and would like some help to complete this code. I've added my code at the button. feel free to point out the mistake. I wanna understand the code. Thank you...

Complete this code for me so I can have a better understanding. Decoding message in c #define TRUE 1 #define FALSE 0 typedef struct StackStruct { int* darr; /* pointer to dynamic array */ int...

PLEASE PROGRAM IN C++ AND SEPERATE SOLUTIONS FOR #'S 1-4. THANK YOU # Project 1 ## Historical cryptography The goal of this project is to break a substitution cipher. A [substitution...

Optimum Weight Co. offers personal weight reduction consulting services to individuals. After all the accounts have been closed on June 30, 2010, the end of the current fiscal year, the balances of...

Pine Products Company uses a job order cost system. For a number of months there has been an ongoing rift between the sales department and the production department concerning a special-order...

1. How can hospitality businesses prepare, adapt, and respond to data hacks and information security incidents?

As an IT team leader how would you define enterprise in the context of your team for a stakeholder

Explain the paradoxes and tensions in pay systems in relation to managing the employment relationship

Is the proposal from the HR advisor for a differential salary structure favouring law lecturers justifiable?

How can evaluation of LMD become more than an act of faith?