Exercise 3.5.7 from Mining of Massive Datasets: Find the edit distances (using only insertions and deletions) between the following pairs of strings. abccdabc and acbdcab.
Exercise 3.5.8 : There are a number of other notions of edit distance available. For instance, we can allow, in addition to insertions and deletions, the following operations:
i. Mutation, where one symbol is replaced by another symbol. Note that a mutation can always be performed by an insertion followed by a deletion, but if we allow mutations, then this change counts for only 1, not 2, when computing the edit distance.
ii. Transposition, where two adjacent symbols have their positions swapped. Like a mutation, we can simulate a transposition by one insertion followed by one deletion, but here we count only 1 for these two steps.
Repeat Exercise 3.5.7 if edit distance is defined to be the number of insertions, deletions, mutations, and transpositions needed to transform one string into another.
please answer 3.5.7 and 3.5.8
100 CHAPTER 3. FINDING SIMILAR ITEMS 3.5.5 Edit Distance This distance makes sense when points are strings. The distance between two strings 112 and y=Vi Y m is the smallest number of insertions and deletions of single characters that will convert to y. Example 3.15: The edit distance between the strings I = abcde and y acfdeg is 3. To convert to y: 1. Delete b. 2. Insert f after c. 3. Insert g after e. to y. No sequence of fewer than three insertions and/or deletions will convert Thus, d(x,y) = 3. 0 Another way to define and calculate the edit distance dx,y) is to compute a longest common subsequence (LCS) of r and y. An LCS of u and y is a string that is constructed by deleting positions from x and y, and that is as long as any string that can be constructed that way. The edit distance dz. y) can be calculated as the length of r plus the length of y minus twice the length of their LCS. Example 3.16: The strings r = abcde and y = acfdeg from Example 3.15 have a unique LCS, which is acde. We can be sure it is the longest possible. because it contains every symbol appearing in both x and y. Fortunately, these common symbols appear in the same order in both strings, so we are able to use them all in an LCS. Note that the length of ris 5. the length of y is 6, and the length of their LCS is 4. The edit distance is thus 5+6 - 2 x 4 = 3. which agrees with the direct calculation in Example 3.15. For another example, consider a ba and bab. Their edit distance is 2. For example, we can convert to y by deleting the first a and then inserting b at the end. There are two LCS's ab and ba. Each can be obtained by deleting one symbol from each string. As must be the case for multiple L.CSS of the same pair of strings. both LCS's live the same length. Therefore, we may compute the exit distance is 33-222 Edit distance is a distance measure Surely no edit distance can be negative 100 CHAPTER 3. FINDING SIMILAR ITEMS 3.5.5 Edit Distance This distance makes sense when points are strings. The distance between two strings 112 and y=Vi Y m is the smallest number of insertions and deletions of single characters that will convert to y. Example 3.15: The edit distance between the strings I = abcde and y acfdeg is 3. To convert to y: 1. Delete b. 2. Insert f after c. 3. Insert g after e. to y. No sequence of fewer than three insertions and/or deletions will convert Thus, d(x,y) = 3. 0 Another way to define and calculate the edit distance dx,y) is to compute a longest common subsequence (LCS) of r and y. An LCS of u and y is a string that is constructed by deleting positions from x and y, and that is as long as any string that can be constructed that way. The edit distance dz. y) can be calculated as the length of r plus the length of y minus twice the length of their LCS. Example 3.16: The strings r = abcde and y = acfdeg from Example 3.15 have a unique LCS, which is acde. We can be sure it is the longest possible. because it contains every symbol appearing in both x and y. Fortunately, these common symbols appear in the same order in both strings, so we are able to use them all in an LCS. Note that the length of ris 5. the length of y is 6, and the length of their LCS is 4. The edit distance is thus 5+6 - 2 x 4 = 3. which agrees with the direct calculation in Example 3.15. For another example, consider a ba and bab. Their edit distance is 2. For example, we can convert to y by deleting the first a and then inserting b at the end. There are two LCS's ab and ba. Each can be obtained by deleting one symbol from each string. As must be the case for multiple L.CSS of the same pair of strings. both LCS's live the same length. Therefore, we may compute the exit distance is 33-222 Edit distance is a distance measure Surely no edit distance can be negative