Question

1 Approved Answer

Posted on Aug 04, 2024

Course : Advance Analysis Of Algorithms Urgent Required : Need a Critical Review Of this Research Paper Proceedings of the Twenty-First International Joint Conference on

image text in transcribed

Course : Advance Analysis Of Algorithms Urgent Required : Need a Critical Review Of this Research Paper

Proceedings of the Twenty-First International Joint Conference on Artificial Intelligence (ICAT-09) Efficient Dominant Point Algorithms for the Multiple Longest Common Subsequence (MLCS) Problem Qingguo Wang, Dmitry Korkin and Yi Shang Department of Computer Science University of Missouri qwp4b@mizzou.edu, korkin@korkinlab.org, shangy@missouri.edu Abstract Finding the longest common subsequence of multi- ple strings is a classical computer science problem and has many applications in the areas of bioinfor- matics and computational genomics. In this paper, we present a new sequential algorithm for the gen- cral case of MLCS problem, and its parallel realiza- tion. The algorithm is based on the dominant point approach and employs a fast divide-and-conquer technique to compute the dominant points When applied to find a MLCS of 3 strings, our general al- gorithm is shown to exhibit the same performance as the best existing MLCS algorithm by Hakata and Imai, designed specifically for the case of 3 strings. Moreover, we show that for a general case of more than 3 strings, the algorithm is significantly faster than the best existing sequential approaches, reach- ing up to 2-3 orders of magnitude faster on the large-size problems. Finally, we propose a paral- lel implementation of the algorithm. Evaluating the parallel algorithm on a benchmark set of both ran- dom and biological sequences reveals a near-linear speed-up with respect to the sequential algorithm. Keywards: Search, Dynamic Programming, Computa- tional Biology 1 Introduction The multiple longest common subsequence problem (MLCS) is to find the longest subsequence shared between two or more sequences. It is a classical computer science prob- lem with important applications in many fields such as in- formation retrieval and computational biology (Masek and Paterson, 1980; Smith and Waterman, 1981]. For over 30 years, significant efforts have been made to find an efti- cient algorithm for the MLCS problem. The most signifi- cant contribution has been done to study the simplest case of MLCS of two or three sequences (Hirschberg, 1977; Hakata and Imai, 1998). However, while several attempts towards finding an efficient algorithm for a general case of more than 3 equences (Hakata and Imai, 1998; Chen et al, 2006), it is yet to be developed. A general case of MLCS is of a tremendous value to computational biology and compu- tational genomics that deal with biological sequences [Korkin et al., 2008]. With the increasing volume of biological data and prevalent usage of computational sequence analysis tools, an efficient MLCS algorithm applicable to many sequences will have a significant impact on computational biology and applications In this paper, we present an efficient algorithm for the MLCS problem of three and more sequences. The new method is based on the dominant point approach. Domi- nant points are minimal points in a multidimensional search space. Knowing those points allows to reduce the search space size by orders of magnitude, hence significantly the computation time. Our algorithm performs a new divide- and-conquer technique to construct dominant point sets ef- ficiently. Unlike FAST-LCS (Chen et al, 2006), a MLCS algorithm that works with the whole dominant point set, our method takes advantages of the structure relationships among dominant points and partitions them into independent subsets, where the divide-and-conquer technique is applied. Compared to existing state-of-the-art MLCS algorithms, our dominant-point algorithm is significantly faster on multiple sequences longer than 1000. We have also developed an efti- cient parallel version of the algorithm. By dividing the prob- lem into smaller sub-problems and solving the sub-problems in parallel, we have achieved a near linear speedup. The paper is organized as follows In the next section, we briefly review state-of-the-art methods for MLCS. Then, we present the basics of the dominant point method in Section 3 and the new sequential algorithm in Section 4. The new parallel algorithm is presented in Section 5. In Section 6, we show the experimental results. Finally, in Section 7, we summarize the paper. 2 Related work Classical methods for the MLCS problem are based on dy- namic programming (Sankoff, 1972: Smith and Wateman, 1981]. In its simplest case, given two sequences di and as of length ni and no respectively, a dynamic programming algo- rithm iteratively builds a n, X1score matrix L, in which L[43],0 Sis 9,0 Sis 1 is the length of a LCS between two prefixes a[1, ... , d) and a2[1,...,1). 0, L6, j]={L -1, -1]+1. if i orj=0 if a[i] = agli a215) (1) maxLll, 3-1, 43-1,3), i 1494 A point p = P.P...Pd) in L is called a match if al[p] = 12[P2] = ... = ad[Pd]. If a match p corresponds to character 8 E, i.e., a[p] = s, it is denoted as p(3) For two points p = P.Pa...Pd) and a (91,92,..., gd], we say that p dominates q if p 3, the minima of N dominant points in the d-dimensional space can be compued in O(dNlogd-2N) time by a divide-and-conquer algorithm. The computation time is 0(dN log-2n) if the sequence length n 0 do { 11 current Les position = a1[pu); 12 pick a point q such that p Par(2, 2); 13 P= q 14 k = k-1; } Quick-DP consists of two parts In the first part, the set of all dominants is calculated iteratively, starting from a 0- dominant set (containing one element). The set of (k + 1)- dominants, D(+1), is obtained, based on the set of k- dominants, Dk. In the second part, a MLCS-optimal path, corresponding to a MLCS, is calculated, tracing back through set of dominant points obtained in the first part of the algo- rithm, and starting with an element from the last dominant set. All MLCS can be enumerated systematically as well. To efficiently enumerate all parents of each dominant point, we calculate a preprocessing matrix T = {T[9,3,4]},8 20 Simarisusa {|aul), 1 Si sd, where each clement T[s,1,4 specifies the position of the first occur- rence of character s in the i-th sequence, starting from the 6+1)-st position in that sequence. If s does not occur any more in the i-th equence, the value of T[3,1, t] is equal to 1+ mazisk 3, the minima of N dominant points in the d-dimensional space can be compued in O(dNlogd-2N) time by a divide-and-conquer algorithm. The computation time is 0(dN log-2n) if the sequence length n 0 do { 11 current Les position = a1[pu); 12 pick a point q such that p Par(2, 2); 13 P= q 14 k = k-1; } Quick-DP consists of two parts In the first part, the set of all dominants is calculated iteratively, starting from a 0- dominant set (containing one element). The set of (k + 1)- dominants, D(+1), is obtained, based on the set of k- dominants, Dk. In the second part, a MLCS-optimal path, corresponding to a MLCS, is calculated, tracing back through set of dominant points obtained in the first part of the algo- rithm, and starting with an element from the last dominant set. All MLCS can be enumerated systematically as well. To efficiently enumerate all parents of each dominant point, we calculate a preprocessing matrix T = {T[9,3,4]},8 20 Simarisusa {|aul), 1 Si sd, where each clement T[s,1,4 specifies the position of the first occur- rence of character s in the i-th sequence, starting from the 6+1)-st position in that sequence. If s does not occur any more in the i-th equence, the value of T[3,1, t] is equal to 1+ mazisk