Answered step by step

Verified Expert Solution

Link Copied!

Question

1 Approved Answer

Posted on Jul 26, 2024

Project 1 a , 1 b - Read Mapping In this project, the goal is to identify the variants in a donor genome using reads

Project

1

, 1

-

Read Mapping

In this project, the goal is to identify the variants in a donor genome using reads from that genome and a

reference genome sequence.

Genomic Simulator:

The reference sequence is a randomly generated sequence. A donor genome sequence is simulated by

applying a mutation process on the reference sequence. In the mutation process, each position has a chance

of a substitution, insertion or deletion. The mutations are logged as they are being generated resulting in a

donor genome

true mutation

list relative to the reference genome. Reads are then simulated from the donor

genome by sampling uniformly from locations in the genome. Read errors are inserted into the reads with

each position in the read having a chance of substitution

(

read error

),

insertion or deletion. Reads are

generated in pairs with the insert size between the pairs uniformly generated between a minimum and

maximum insert size.

Project Goal:

The goal of the project is to predict the donor mutations list from the reads and reference genome.

Data formats:

The reference genome and reads format are in

fasta

format. Each sequence in a fasta file contains an

identifier line starting with a

>

and then the sequence on following lines.

An example of the reference sequence format is:

>

genome

_1000

AAGTGGGTCTCGGCGGAACTGGCTACGAGAATATGCAGTTGGCAATGGTACCACTTTTGTAAGTACATAGTTCATGAGTC

CGTTTTGACGTGGTGGCCATCTTTGTCACACCTCGATCCACGCCCTATAATACTTAGTTAACGCCTTCTATGTCGTGTAA

TCCACAAATTAATTCGAGAACATCCTGCCCGTAGGTTTCAGATGGATTCATAGTGCCCCATTTGGTGACGAGCGCTTGAG

GCAACTATTTAGCTTTGCGGCGTGACCCGCACTACCGTATCCGTTGGGCCTGTTTTAAGGAAAAATATAGGGCAAGACTG

ACTTGGCCCAGTGCAATCGCGCACCCCGCCTCGCAGCAGGCCTCTAGAAGCAGCAGGTCTCGTGTTAAGGTTGTACTAGG

TTCGGTTAGTACACTTTGACTACACCATCAGTAACTATTAAGATCAGATTCGTTCGTGTTAGTAGGATCCATGGATTCCG

ACATCGGCCCGAAGCCCCCTGGGTCACAATGAGGCAGGCGATCGGAGCGACATACGACCCCACTCCACATTAATGCGATG

Reads which are part of a pair are denoted with the same identifier ending with a

/ 1

/ 2 .

An example of a

reads format is:

>

read

_0 / 1

TGACTACACCATCAGTAACATTAAGATCAGATTCGTTCGTGTTACTAGG

>

read

_0 / 2

CCACTCCACATTCATGCGATGGATATGATCCCACGGCAAGTCGCCTTTGA

>

read

_1 / 1

ACTACGTATCCGTTGGACCTGTTTTAAGGAAAAATATAGGGCAAGACTGAC

>

read

_1 / 2

AGGAAAAATATAGGGCAAGACTGATTTGGCCCAGTGCAATCGCGCACCCC

>

read

_2 / 1

AGCGCGCACCCCGCCTCTCAGCAGGCCTCTAGAAGCAGCAGGTCTCGTGT

>

read

_2 / 2

ATTCGTTCGTGTCACTAGGATCCTGGATTACCGACATCGGCCCTATGCCCC

>

read

_3 / 1

CAGGCGAACGGAGCGACATACGACCCCACTCCACATTAATGCGATGGATA

>

read

_3 / 2

CAGTAATTAGATGGGATAATTTCGTTCGGGGTCCAACCACCTATAGGTAG

>

read

_4 / 1

GTTGCATGTCCAAGTAGAGAAGAGCCAGTCCCCCGGACACGCTCCAAAACG

>

read

_4 / 2

GAGATACATCTCGAGGATGGGCCTGCGCGTCAGCTAATACATTAAATTCA

The mutation list format has a line for each mutation between the reference and the donor genome. The

format starts with a

>

symbol followed by a

,

,

for a substitution, insertion and deletion followed by

the position in the reference followed by a space. For substitutions, the format continues with the reference

and donor bases are provided. For insertions, the new sequence that is added is included. For deletions, the

sequence that is removed is included. An example of the format is:

>

732

G A

>

851

A G

>

260

>

274

>

96

>

471

Evaluation:

The predicted mutation list is evaluated by comparing entries between the predicted mutation list and the true

mutation list. An entry is considered correct only if it matches exactly between the two lists. We evaluate the

predictions by computing the F

-

score for overall mutations, substitutions only, insertions only and deletions

only. Need an f

-

score of at least

0.5

Sample Genome:

A sample genome of length

1000

is provided with the true mutation list. A set of reads with errors and without

errors is provided. This genome can be used to develop and evaluate a solution to the project before applying

it to the larger problems.

Project

1

A reference genome of

10, 000

is provided with paired reads containing errors is provided.

Output needs to be predictions.csv in a zip file

"I used the trivial algorithm

-

I got my score up after realizing a lot of the errors I had in my predictions csv file were read errors rather than actual mutations

(

so I had a lot of extra errors, that might be what

s dragging your score down

) .

I would recommend using some sort of consensus algorithm to filter out which errors actually appear frequently enough in your errors list to count as actual mutations, and use the sample

_

_

repeats

_

mutations.txt file to verify your predictions match as closely as possible"

Your code should be able to run from standard command line interface as described in the programming assignment. There should be a shell command called runproject.sh which will run with the project input files located in the same directory.