Question
As you may suspect, the 4 length common substring heuristic is just a simple approximate technique to find words with common roots. It is liable
As you may suspect, the 4 length common substring heuristic is just a simple approximate technique to find words with common roots. It is liable to fail in many situations. One failure situation is unrelated words with common substrings. For example, consider the following words:
Ionization, Ionic, Actualization, Actual
A string is comprised of words defined as continuous runs of alphanumeric characters separated by separators (spaces, commas, periods, semi colons, exclamation marks, any other punctuation symbol except apostrophes').
So a string might look like this :
The hungry scanner keeps a suspicious watch on doctors and their unsuspecting patients
The scanner counts words, collecting those together where a common substring of length 4 or greater occurs.
For example, in the given sentence, suspicious and unsuspecting have a common substring of length 4 "susp". Thus the scanner would output something like this :
Clearly the first two words have a common root, as do the last two. Unfortunately, the simple approach also identifies Ionization and Actualization as having a common root tue to the presence of the string "tion". We have a situation where Actualization could be counted in two slots.
Find an approach to "break ties" in these cases. What logic can you apply to declare that [Actual, Actualization] is a better match than [Actualization, Ionization] ?
Describe and implement your logic as a separate subroutine called from the function implemented in Q1.
Step by Step Solution
There are 3 Steps involved in it
Step: 1
Get Instant Access to Expert-Tailored Solutions
See step-by-step solutions with expert insights and AI powered tools for academic success
Step: 2
Step: 3
Ace Your Homework with AI
Get the answers you need in no time with our AI-driven, step-by-step assistance
Get Started