Question
Project Description This project involves creating a spell-checker problem that accepts a word from the user, look up that word in an available corpus and
Project Description
This project involves creating a spell-checker problem that accepts a word from the user, look up that word in an available corpus and perform spell-correction on the word if the word is not present in the word corpus.
the word corpus has been loaded and is available in a string named word corpus.
You will need to do the following
- Download both this program file and the associated google-10000-english.txt file to your computer.
- Write a program using the WHILE loop that continuously asks the user to enter a word. If the user enters QUIT, then quit from the while loop and terminate the program. (20 points)
- Once the user has entered the word, you will ** Compare the word with the word corpus, if there is a match, then you will let the user know that the word is valid. Note that the comparison must be case insensitive. (20 points)
** If there is no match, then you will need to look up the corpus for the word that best matches the word that the user entered and display that word to the user. (40 points)
Extra credit(20 points)
- Allow the user to enter a paragraph and perform an automated spell correction of the paragraph. For example, if the user enters "Jack and Jill wen up the hills", your program would return something like "Jack and Jill went up the hill"
Other Points
- 10 points will be awarded for the overall quality of the user interaction.
- 10 points will be awarded for the proper use of Python including making sure that he code is optimal.
Hints
Typically, this is implemented by looking at each word in the list and determining the number of adds, updates, deletes that are needed in order to get from the candidate word to the input word. Each operation has a score associated with it, for example
Update - 2 point Add - 1 point Delete - 2 point
For example,
input word: wen
candidate word: win
- To get from wen to win requires 1 update
- Total score for win is 2 points
candidate word: went
- To get from wen to win requires 1 add
- Total score for win is 1 point
candidate word: hello
- To get from hello to win requires - 3 updates and 2 deletes
- Total score for hello is 10 points
At the end, after looking at all words in the list, you would pick the word with the lowest score as the match. In case you arrive a good match sooner, for performance reasons, you might want to stop and display the result.
Imports
In[1]:
import string
Run this cell to load the word corpus. The variable dictionary has the list of all words in your corpus
In[2]:
corpus_file = open('google-10000-english.txt',encoding='utf-8') dictionary = corpus_file.read().split(' ') dictionary[:5]
Out[2]:
['the', 'of', 'and', 'to', 'a']
Spellchecker
In[6]:
def similarity(inputWord,wordCorpus): # get list of words that share the relatively same size +/- 1 letter if len(inputWord) == 0: print("Please provide a valid word") return "INVALID_ENTRY" lowerWordLen = len(inputWord) - 1 upperWordLen = len(inputWord) + 1 # get the list of candidate words candidateWords = [] for entry in wordCorpus: # determine the set of words within one character distance of the input word # and place it in the list candidateWords if len(entry) >= lowerWordLen and len(entry) <= upperWordLen: candidateWords.append(entry) # perform similarity comparison # You will need to look for words in the candidateWords list that best match # the input word. For example, if the user input was "wen", a possible match is "went" # or if the input word is "rabbi", a possible match is "rabbit" # All candidate words are from the text "Alice in Wonderland" bestMatchWord = None ######################################## ### Write your code here ######################################## # display the best match print("Best Match Is:",bestMatchWord) return bestMatchWord
In[7]:
import time startTime = time.time() # take all words from alice and store them in memory wordCorpusFile = open('google-10000-english.txt',encoding='utf-8') wordCorpus = [] for line in wordCorpusFile: # remove newlines line = line.strip().lower() # get words words = line.split(" ") for word in words: if word.isalnum(): if word not in wordCorpus: wordCorpus.append(word) similarity("wen",wordCorpus) elapsedTime = time.time() - startTime print("Time taken:%s seconds" % elapsedTime)
Step by Step Solution
There are 3 Steps involved in it
Step: 1
Get Instant Access to Expert-Tailored Solutions
See step-by-step solutions with expert insights and AI powered tools for academic success
Step: 2
Step: 3
Ace Your Homework with AI
Get the answers you need in no time with our AI-driven, step-by-step assistance
Get Started