Question
PLEASE COMPLETE THIS USING MATLAB!!! THANK YOU IN ADVANCE!!! HERE IS THE h2q2.fa DNA code >ident_60_seq_1 GAATAACGAGGCACGAATCTCTTCTCACATCCCTTTTTCGCTATGAATCAGAAAGCTGTCCCTCCGTCTA GTTCAAGCATGGGTTGTCAGCCCGGCCACGTCTCAAAACAGCGTTGTCTGAGACACTTGCTCTCTCACGC GCGGTGCTTCGCCGTGGTGCCGAGATTCTTTGGTCGTTCGAGTACCGGGTCGAGGCTATTCAAGTTCGGA GCCAACAGACCTCGATGGCATAACTATGGGGGCTGCTTTATGTCACTAAACTTAAACATTTCATCGCAGC TCATCTAGGACCCCACGGTGTGCACTAGCCATATCGTTGTGCAAGGCTGCGACGTCTGTTGTGGCATCAT AGGCACCATATACGGGAACCTAAGTAGACTTTTATGCTCGAGCGGTTCCCGGACCCGCCGTTACTTGATT CGTTGAATAATATGGAGGTACCTGCCTTTTCATCTGTTCGTCGAAAGGATAGGTTGAATGGCTGTCTCGC TCGTATACAG >ident_60_seq_2
PLEASE COMPLETE THIS USING MATLAB!!! THANK YOU IN ADVANCE!!!
HERE IS THE h2q2.fa DNA code
>ident_60_seq_1
GAATAACGAGGCACGAATCTCTTCTCACATCCCTTTTTCGCTATGAATCAGAAAGCTGTCCCTCCGTCTA
GTTCAAGCATGGGTTGTCAGCCCGGCCACGTCTCAAAACAGCGTTGTCTGAGACACTTGCTCTCTCACGC
GCGGTGCTTCGCCGTGGTGCCGAGATTCTTTGGTCGTTCGAGTACCGGGTCGAGGCTATTCAAGTTCGGA
GCCAACAGACCTCGATGGCATAACTATGGGGGCTGCTTTATGTCACTAAACTTAAACATTTCATCGCAGC
TCATCTAGGACCCCACGGTGTGCACTAGCCATATCGTTGTGCAAGGCTGCGACGTCTGTTGTGGCATCAT
AGGCACCATATACGGGAACCTAAGTAGACTTTTATGCTCGAGCGGTTCCCGGACCCGCCGTTACTTGATT
CGTTGAATAATATGGAGGTACCTGCCTTTTCATCTGTTCGTCGAAAGGATAGGTTGAATGGCTGTCTCGC
TCGTATACAG
>ident_60_seq_2
TCAATCCGAACAAGTATCGTACCGGTTATCTAAAACTTGGGGCATTGTTCCCTTACTGCACATTAAAAAG
CTTGGTGACGCTTCGAGTCGATGCCCAGTGCCACGGCAAAGCTGAGCTCGCTAGCAACCGAAAGATACTC
GGATCTAAAACATCGAGTCCCTGTCTCGGGTTCACGGCGGCGCGGACATGAGTCACCATAAAAATAATAG
GATCAGTGGGCACCAGGATGTTGCATTTAAACTTATTCATCCGTTATGCTGCTGCGCAAGTTCGACTCTC
CTAACTCAAGGTTAAATTGAGGATTGTTCCTCATCCCTAGTATTGCATCGCTGTGACAGGGGGAATGTAC
AACTGAGCGATTCTAGGACGCAAGCAGATTAGTGTAATGGAGAGAGTTCACACGCAATTAGTCTCAGGCA
TCTCGTGACGACATCTTTACTCTATGAGTGAACTCTAGAAAAGTTATAATCGCGAGCGTTTCCTCATCCC
GGGCGCCGCC
>ident_70_seq_1
TACCCGGAAGGCCGTCCTGCTCCCCAGACCCGTTCTAAGTAGCTTGAGCAGCAAGACGCGAAGTAACTAC
CTGTCGTACAGATAGTCGACGGCCAGCGTCTGTTAGTAAGGCCTCGGCTATAAACATTTTAGAACCACTA
CGTCGAATTAACGAAGGTCTCGACGATTTTTGTGGACGGTGGGCAATACAACAACGGGACAAGGCGAACT
CGTCAGACTCAGGTGCACTCGGGACCAACCAACCAGTCCTGTAGAAGGAGACGAGATTTACGCCAATCAG
TATTAAGGTTCTTCGTTTGGTCCGCTTTCCTAGTGTCAAGTATGGCCTGGACGTAATGAGCACCCCATGG
GGACTCCACCCCTCGTCTAATGCGTCTTTGTGTGGCATACATACGGTCCAGCGACACGTATACAGCACTG
CCGATAGGTAGGCTTTCGAATCGTTTACGAAGTACGGGCGCGACTAAGACGTGTGACAGGATACATATCC
TAAGGCCTAT
>ident_70_seq_2
TGTTGGACATACATCTCAAGGCGCCGTTCCGGGCATAGTTTTGAGCCACCAGATAGAGTCTAGCCGTGAT
AATGACATGTCGCTGTTTCCGAGGTAGAGAACACAATTCGGTCGTACTGCGGGCAATTCTAGACTTCCTT
AGAGCTCTTTTTATACACAACCATACGGTGCCAATCGACCAGCAGTCCAGACGGTTTTGGCGGAGGTTGT
GCAATACTAGATCGGAACAACACGTACGTACGCTTTAACCTGAGGTTACCCCTCCACCCTAGACTAACGC
GGTTCATGGCATCACCACACCATGATTCCCTGAGTCAGCATCGTCTTCTTGAAAGACGTACGCTCAGTCT
CGTATCCAATACGATAGGTGTCCTGCTAGAGCCGTGGTTGTCCCGATCAGTTAACTCTTCCCAATAGTAG
TCGTCCACATCAGTTGGGCCGGCTTGAAAATTTATGCCTAAGTCGCGTAGGTTACAGGGTCCGCCTGTGC
ACTGTCTCGG
>ident_80_seq_1
TAGAACGCAGGCGCCTTACAATATTGCCGGCGCCCGACTCAAGCAGATCCCCATGTCGAATTTTCGAATT
GAATCTAGTCCTTGCTTTTCGACCGAATATGGTGGGGAACATGTGCCCGGAGCTCTGATACGACATTGCG
GCCATACGAACTCATCGCTAGTCTTACGCGCAGGGGCGGTAGTACAGTCGTTGCATCGAAGTAGCATATG
AGCAGTAGCTCTCGGATGTATTTGATCAGAATAACGTCGTGCCAGTCGGCTACGACTAGCCCCCGAACCA
CTTGTTCGGTCCATGGGTAACAGAGGCTCTGTACAAGAGCGGAACAGAACATCCTTGCAGGTCGGGTTCC
GACCCTGTTTACCGGCGGCGCATGCCCAATCCATTCCCCAGAAACGAAAATCTGGACAGCTCCTGAGCTC
ACTGTGTTGTCTCATCGAGTATTGAAGTCAACTATGCTTTCTGTATAATACGCAAATCCGTAAGAATACC
AGCGTGGCTG
>ident_80_seq_2
GCGTACACCAATCAGGCGCGTTTCAATATTGTCGGAGAGCGACTCAAGCACAACGCGATGTCGAACTACC
GGCCAGCTCGTGCTAGAGCGTGGATCAATGTTGTATTTGCCAGGAACCCACGCCGATCCGCGCTCGGTCG
ACGCACTGACTCGCCGGATATTCTTATGCATATTGCTTGTAAATTATCTCAATGATTAGGGAATCCTTAC
CCCAGATTCTTCCACGGACTGTACATCTCGTACGGTTGTAGCACTAGTCTCAATACTTGCGCTAAAGGCC
GGCACGCTTCATCACCACTAACACCCTTGGTAGAGTTGAGATGTTTGGTGTGGATAGTGCCAGTTACACG
ATCGGATTATAAATCACAATAAGATCTAGTCCGACACACCCCCTCTAGGCGAACTAAAGGGTAGCGGCGG
CATGCTGGCTTTTCTACCACTTTTGCATCGTGGCTACCCCGCGGAGTGGCAAGCTAGGTGATACGCGGTT
TGATTGCTAT
>ident_90_seq_1
GCCTTCCGGGTGTCGTTCTTTCACCTGGTTAATAGATTCGAGTTGACAAGAGTCACTGATAGTCCTCAGC
CTTCCAATCCTCCCAACTACGTAGGCGTTCCAGTAGTACCTGGTTAAACGAGTAGACACTATCCTACATA
GTCGAGCAACACGACGCTGTCGGTTGAGAACCACATCTTCTAGCCGATGTAGCTCGCGTGGCGCGTGCAG
CAGTGCACTAAGTCCGGACGAGCTTCCTAGTTCTCACGCTAGACTTGCCTACTAGAGATTTCATCAATAA
TCTTTCCCGCACCGTGGTTGATATACCGGAATTGGCGCTCGAGGCTCTATGGGCAGAGTAGGTCACGCTC
GGCCACATGTCGGTAGTCAGATTAACTCGAATGCTAGTCTCCGTTGGTTAAACTAAACGTCCGAGGCAAT
CGGACGTCCAGATGCGACTCAATAGTTGCCCACTGGACCACGCTTGTGATCGACATGAGCTCGTAGAACG
CCAAGGAGCA
>ident_90_seq_2
TATGAAGGGCTCCTAATTTGGACTCCCCCCCTAGTGCCTGTTGCTTGGTAGAGAATAGAAAAGTTTTACA
TCGGGAAGATTCCGCTAATTGCGCCACCGCGTCTGATGCACGTCAATTAAGAATAGATTTGCCCAACGTT
ATGTTAACGTTATGTTCTAACGATCCTTGGATAGGAACTGTAATTACCTGGGTAGCGGTGTGTAGATAAC
GAGTTGACCAGAGTCACTCATAGTCTTCAGCCTTCCAATCTTGCCAACTGCTCCAGGCCAGCGACATGAG
AAAATCGGTGCACCTTTTCGGGGACCGGCGGGTCGAGCCCCTGTATCTCGGAAAAGCGGGCCAGAAACCC
TGTACGGTTACCAAAGGACATATCGAGTGCATTGTGCCTATTCTTAATACGGCTCCGTCACGTTCCGAAG
TTGATACGAACATGCACGCCACACCAGTTGATTCTGGCCTGGGGTCTGGATTCGGAGTGTCCTTTGTCAG
TTATCTAGTA
Problem 1 (40 points): Alignment Statistics A). (10 points) Write a MATLAB program to calculate a series of scoring matrices for aligning DNA sequences with 60, 70, 80, and 90% of identity, assuming A, C, G and T have equal probabilities. Do not scale the scores to integers (so l is equal to 1 in all the matrices). Output should be a 4x4 matrix and the row/column order of the nucleotides are A, C, G, T. (For example, the first row of the matrix shows the score of aligning an A again an A, C, G, or T respectively.) B). (5 points) Assuming that you are using one of the matrices above to perform ungapped local alignment (i.e., gap penalty = minus infinity) between two 100bp-long DNA sequences, and got a score of 20. Using the extreme value distribution, calculate the E-value of the alignment score (assuming K = 0.1). Is this alignment significant? C). (5 points) Redo (B) for score = 5. D). (5 points) Redo (B), but you are comparing a sequence of length 100bp to a database of 1 million sequences, each of which is 100bp long, and the most similar sequence has an alignment score 20. E). (15 points) Take the MATLAB swalign function (type in 'help swalign' for details), or revise your implementation of Smith-Waterman algorithm so that it can align two DNA sequences with different substitution matrix, and only outputs scores (no traceback needed). Use a fixed gap penalty of 5 per gap. (This gap penalty is large enough so you probably would not see any gap in your optimal local alignment). No trace back is necessary. Using the matrices you computed in (A) to align each pair of sequences in the fasta files (e.g., ident_60_seq-1 and ident_60_seq_2 are a pair.) For your information, ident_60 means the two sequences are expected to share something that are 60% similar to each other. Each sequence in the files is 500 bases long. You can use the function fastaread to read the sequences: seqs = fastaread('hw2q2.fa'). The variable seqs is a struct array and you can get the i-th sequence with seqs(i).Sequence, and the name of the sequence with seqs(i).Header. There are 4 pairs of sequences, and 4 scoring matrices, so you will need to calculate 4 x 4 = 16 alignment scores. Plot the alignment scores (y-axis) against the percent identities (x-axis) of the substitution matrices used to obtain the alignments. (The alignment scores for each pair of sequences should be plotted as one line; you will have a total of 4 lines.) In addition, compute the p-value of each alignment score and plot the p-values (y-axis) against the percent identities (x-axis). It is best to plot the p-values in logarithm scale. (Use semilogy instead of plot.) Problem 1 (40 points): Alignment Statistics A). (10 points) Write a MATLAB program to calculate a series of scoring matrices for aligning DNA sequences with 60, 70, 80, and 90% of identity, assuming A, C, G and T have equal probabilities. Do not scale the scores to integers (so l is equal to 1 in all the matrices). Output should be a 4x4 matrix and the row/column order of the nucleotides are A, C, G, T. (For example, the first row of the matrix shows the score of aligning an A again an A, C, G, or T respectively.) B). (5 points) Assuming that you are using one of the matrices above to perform ungapped local alignment (i.e., gap penalty = minus infinity) between two 100bp-long DNA sequences, and got a score of 20. Using the extreme value distribution, calculate the E-value of the alignment score (assuming K = 0.1). Is this alignment significant? C). (5 points) Redo (B) for score = 5. D). (5 points) Redo (B), but you are comparing a sequence of length 100bp to a database of 1 million sequences, each of which is 100bp long, and the most similar sequence has an alignment score 20. E). (15 points) Take the MATLAB swalign function (type in 'help swalign' for details), or revise your implementation of Smith-Waterman algorithm so that it can align two DNA sequences with different substitution matrix, and only outputs scores (no traceback needed). Use a fixed gap penalty of 5 per gap. (This gap penalty is large enough so you probably would not see any gap in your optimal local alignment). No trace back is necessary. Using the matrices you computed in (A) to align each pair of sequences in the fasta files (e.g., ident_60_seq-1 and ident_60_seq_2 are a pair.) For your information, ident_60 means the two sequences are expected to share something that are 60% similar to each other. Each sequence in the files is 500 bases long. You can use the function fastaread to read the sequences: seqs = fastaread('hw2q2.fa'). The variable seqs is a struct array and you can get the i-th sequence with seqs(i).Sequence, and the name of the sequence with seqs(i).Header. There are 4 pairs of sequences, and 4 scoring matrices, so you will need to calculate 4 x 4 = 16 alignment scores. Plot the alignment scores (y-axis) against the percent identities (x-axis) of the substitution matrices used to obtain the alignments. (The alignment scores for each pair of sequences should be plotted as one line; you will have a total of 4 lines.) In addition, compute the p-value of each alignment score and plot the p-values (y-axis) against the percent identities (x-axis). It is best to plot the p-values in logarithm scale. (Use semilogy instead of plot.)Step by Step Solution
There are 3 Steps involved in it
Step: 1
Get Instant Access to Expert-Tailored Solutions
See step-by-step solutions with expert insights and AI powered tools for academic success
Step: 2
Step: 3
Ace Your Homework with AI
Get the answers you need in no time with our AI-driven, step-by-step assistance
Get Started