Question

1 Approved Answer

Posted on Sep 24, 2024

C++ please, how would I implement two algorithms: i) Needleman-Wunsch algorithm for computing OPTIMAL GLOBAL ALIGNMENT, and ii) Smith-Waterman algorithm for computing OPTIMAL LOCAL ALIGNMENT,

both using affine gap penalty function between two input DNA sequences, s₁ and s₂, of lengths m and n respectively.

Input:

Input FASTA file containing two sequences, s1 and s2 : alignment parameters {m_a: match_score, m_i: mismatch penalty, h: gap opening penalty, g: gap extension penalty} // you need to have parameters.config file with these parameters Alignment flag: {0: global, 1: local} Output: If alignment flag is 0: report statistics for optimal global alignment. If alignment flag is 1: report statistics for optimal local alignment. Alignment statistics needed for output: optimal score, optimal path, number of matches, number of mismatches, number of opening gaps, number of gap extensions, percent identity (=number of matches/alignment length)

Design elements:

You are expected to follow the Needleman-Wunsch algorithm for global alignment computation and Smith-Waterman algorithm for local alignment computation. And your code should be for the affine gap penalty function version of these algorithms (i.e., allowing for both gap opening and gap extension penalties).

Each cell of your Dynamic Programming table ("DP table") should have the following structure:

struct DP_cell { int Sscore; // Substitution (S) score int Dscore; // Deletion (D) score int Iscore; // Insertion (I) score ... // add any other field(s) that you may need for the implementation }

When you do the optimal path retrace to recover an optimal alignment, please bear in mind that you might have to lookup and keep switching between all three values (S, D, I) as you traverse from one cell to the next. But your path itself will end up using only one out of the three entries (S, D, or I) per cell.

(note that the first gap in a sequence is counted as both an opening gap and a gap extension for output purposes)

Other specs:

At the start of the program, you should read the alignment score parameters from a user-specified input file (optional). The default name of the file, if the user does not specify one, should be "parameters.config" in the present working directory. The parameters.config file should allow the user to specify one scoring parameter in each line (space or tab delimited). For example:

match 1 mismatch -2 h -5 g -1

The command prompt usage for your program should look as follows:

$ <0: global, 1: local>

Input File Formats:

The two input sequences should be given as input in one text file. The text file should be in what is called the "multi-sequence FASTA format", which is as follows:

The format allows the file to contain any number of sequences, although in this program project you will have only two sequences as input.

Each sequence will first start with a HEADER line, which has the sequence name in it. This header line will always starts with the ">" symbol and is immediately followed (without any whitespace character) by a word that will serve as the unique identifier (or name) for that sequence. Whatever follows the first whitespace character after the identifier is a don't care and can be ignored in your program.

The header line is followed by the actual DNA sequence which is a string over the alphabet {a,c,g,t}. The sequence can span multiple lines and each line can variable number of characters (but no whitespaces or any other special characters).