Question
Introduction Your task for this assignment is to build a command-line spell-checking program. You may be more familiar with spell checking as a feature inside
Introduction
Your task for this assignment is to build a command-line spell-checking program.
You may be more familiar with spell checking as a feature inside an editor, but
what you will build is a separate program that looks through a text file and
reports any misspelled words in the file. Additionally, when a word is misspelled,
the program should check whether the word is similar to a correct word, and if
so suggest the correction. To know what words are correct, the program will
take, as another argument, a file containing a list of the legal English words.
Well call this a dictionary file, though its just the words and not the definitions.
To get you started, weve written the parts of the code that read the input files
and identify individual words. Your job is to write the rest. In particular, any
places where you see TODO in a comment are places that you should add code.
There is one unusual feature that the program should have, according to a request
we are passing on from the marketing department. In a (perhaps misguided) plan
to differentiate the spell checker from its competition and show that it is prepared
for the Internet age, this will be the only spell checker that supports the @ (at)
sign in English words. Specifically the @ sign should be treated as a variation of
the lowercase letter a. For instance words like em@il and @tmosphere should
be counted as correct, just like the outdated 20th-century spellings email and
atmosphere.
This is the sort of program that you could write in any general-purpose language,
and it doesnt depend on any 2021-specific concepts; in fact it might seem like it
could have come from CSci 1933. Thats on purpose: weve doing this project
first to give you some practice writing programs in C in our Linux environment,
to help you prepare for later projects where youll have to work with C while
also exploring new concepts.
We ask that you write your program to store the word-list in memory using a
hash table data structure. Specifically, use what is called a separate-chaining
hash table, one in which all the values that hash to the same index are stored in
a linked list. This will require an efficient hash function to index each word
in the dictionary. You will populate the hashtable with words found in the
dictionary. Then, you will write a function to check if the words in an arbitrary
input file are found in the dictionary. Finally, write a rec
spellcheck [-d
The angle brackets arent part of the syntax; they indicate that while spellcheck
and -d are written as is, the arguments described as dictionary_file and
input_file need to be replaced by appropriate file names. The square brackets
mean that the -d option and its flags argument are optional; but if -d is given,
the next argument needs to be an integer representing debugging flags.
The program should produce output to the standard output (e.g., using printf)
with the following format:
...
In other words, there should be one line per misspelled word. Each incorrect
word should appear only once. If your program can find a correct word that is
similar to the incorrect word, print it after the incorrect word, separated by a
space and a colon. If your program cant find a suggestion, just print a question
mark after the incorrect word.
The dictionary_file has this specific format:
There is one word per line, each followed by a newline.
Words consist only of English letters A-Z or a-z, or apostrophes, and are
less than 100 letters.
Your program should be able to efficiently handle a dictionary on the order
of 100,000 lines.
Words that are capitalized or all-capital-letters in the dictionary represent
words that need to be capitalized or all-caps to be correct. For instance
America not america, and USA not Usa or usa. If multiple
capitalizations of a word appear in the dictionary, the ones that require
more capital letters will come first (e.g., Baker the last name before
baker the occupation).
The input_file should be a file containing ASCII text; beyond that it doesnt
have any particular format. Your program should also be able to deal efficiently
with a large input file. For instance checking a 10,000 word input file against
a 100,000 line dictionary to find 100 misspelled words should require only a
fraction of a second.
On the course web site near where you found this handout, youll be able
to download a compressed tar file ha1.tar.gz containing the files for this
assignment, such as a sample of a dictionary file and some input text files, and a
framework spellcheck.c containing the code weve already written for you.
Copy the tar file to your home directory and then expand it with this command:
2
tar -xzvf ha1.tar.gz
Part A - Data Structure
Your first step in building this program is to construct a data structure that will
be used to represent a dictionary.
The dictionary will begin as a text file containing a list of correctly spelled words.
Your program should read this text file and insert each word into the hashtable.
If youve taken 1913, 1933, or a similar class, you should already be familiar
with how hashtables work. If not, you can look at an old textbook or an web
resource like Wikipedia for a review, but remember that such outside sources
should only be helping you with concepts: you need to write your own code.
The file I/O is already written and taken care of, so you will only be responsible
for:
1. Creating the hashtable.
2. Creating the hash function.
3. Inserting each word in the dictionary into the hashtable.
Your hashtable should:
1. Have an appropriate number of bins to hold around 100,000 entries.
2. Have a well-written hash function that minimizes collisions.
3. Manage collisions properly when they happen.
4. Insert in O(1) time.
5. Find in O(1) expected time. (The time to find an item will likely depend
on whether there are collisions. But the time should be O(c) if there are c
collisions, and the average number of collisions should be small.)
Avoid having too many collisions in your hashtable. This will slow down the
performance of your program and defeats the purpose of using a hashtable. As
a rough estimate, if you insert as many entries into the hashtable as it hash
buckets, more than half of the buckets should have at least one entry. If only
a small fraction or even only one bucket is used, thats a good sign you have
a poor hash function. Because the number of words in English is not growing
much, it is OK for the number of buckets in your hashtable to be fixed, as long
as you choose a large enough fixed value.
While avoiding collisions, it is also necessary that your hash function be compatible
with the lookup operations you want to do: words can occur capitalized even
if they are not capitalized in the dictionary. So think about what this implies
for your hash function.
Part B - Spell Check
The second part of this program is the actual spell-check functionality. Given
an input text file, your program will need to read every word in the input file
(this part is done for you) and look it up in the dictionary. If the word does not
exist in the dictionary, either because it is misspelled, or the dictionary is not
comprehensive enough, the word should be printed to the console in the format
specified above.
Here are some simple rules to follow:
1. If a word doesnt appear in the dictionary, it is spelled wrong.
2. If a word is capitalized in the dictionary, it must be capitalized in the
same way to be spelled correctly. To be more specific, any letter that
is capitalized in the dictionary must be capitalized. It is OK if even
more letters are capitalized, like if part of a document is written IN ALL
CAPITAL LETTERS.
3. Otherwise (if the word is lowercase in the dictionary), capitalization does
not matter.
One specific corner case you will need to add some code for has to do with
the ' character, which can represent either a single quote or an apostrophe,
depending on where it appears. We want to count an apostrophe as part of a
word, since the correct placement of apostrophes is an important part of the
spelling of contractions, for instance. However single quotes shouldnt be part of
a word. The word recognition code that we have written for you always treats
the ' character as part of a word; but if a word appears in single quotes without
whitespace like 'this', the word your program should look up in the dictionary
is this, not 'this'.
It is recommended to test as many edge cases as you can think of. This program
should be robust and handle many different type of spelling mistakes.
Use the provided dictionary as a reference for your tests. This will be the main
list of words used in grading, and it is indicative of the size and format of
dictionary your program should support well.
Part C - Word Suggestion
The final part of any good spell check software is to suggest possible correct
words. You are asked to implement a simple version of this feature that outputs
one reasonable suggestion. The suggestion must be a word that already exists
in the dictionary. We are defining a reasonable suggestion as a new word that
differs from the misspelled word by one character. This includes:
1. Having one extra character somewhere in the word.
2. Having one fewer character somewhere in the word.
3. Having one substituted character somewhere in the word.
For example, suppose a word in the input text was fumd. This is obviously
misspelled and should be caught by the spell check. One reasonable suggestion
would be the word fund since there is one substituted character. However, an
unreasonable suggestion would be the word stop since it has nothing in common
with the original word.
Your reasonable suggestion should be printed next to the misspelled word in
the output. See the output structure above for more detail about the expected
output of your program.
In many cases, there may exist more than one reasonable suggestion. Depending
on how this portion of your program is written, it may find one replacement
before another. You are only required to output one such suggestion.
If no suggestions are found, print a question mark (?) after the misspelled word
as specified above.
---------------
Spellcheck.c Code
---------------
#include
#include
#include
#include
#include
#include
#define MAX_WORD_SIZE 100
#define DEBUG_INPUT_IO 0x1
#define DEBUG_DICTIONARY_IO 0x2
int debug;
const char *usage = "Usage: spellcheck [-d debug_level] dictionary input ";
/* Is this a character that could be part of a word? */
int is_word_char(int c) {
return isalpha(c) || c == '\'' || c == '@';
}
/* IO: Read in dictionary by word */
void read_dictionary(char *filename)
{
FILE *fh;
int i = 0;
int line_num = 1;
int c;
char word[MAX_WORD_SIZE];
fh = fopen(filename, "r");
if (!fh) {
fprintf(stderr, "Error: could not open wordlist %s: %s ",
filename, strerror(errno));
exit(1);
}
while ((c = fgetc(fh)) != EOF) {
if (c == ' ') {
line_num++;
if (i == 0)
continue;
word[i] = '\0';
i = 0;
if (debug & DEBUG_DICTIONARY_IO)
printf("%s ", word);
/* TODO Process word here, e.g. add_to_dictionary(word) */
} else if (i == MAX_WORD_SIZE - 1) {
word[i] = 0;
fprintf(stderr, "Word too long at wordlist line %d: %s... ",
line_num, word);
exit(2);
} else {
if (!is_word_char(c)) {
fprintf(stderr, "Bad character '%c' in word at line %d ",
c, line_num);
exit(2);
}
word[i] = c;
i++;
}
}
fclose(fh);
}
/* "word" should be a character buffer, where *i_ref is the number of
valid characters in the buffer (or equivalently, the index where
the next character would be added. If the buffer is not empty, then
terminate the buffer, pass it to check_word, and then empty the
buffer. */
void maybe_process_word(char *word, int *i_ref) {
if (*i_ref != 0) {
word[*i_ref] = '\0';
if (debug & DEBUG_INPUT_IO)
printf("DBG: i=%-2d word=\"%s\" ", *i_ref, word);
/* TODO: process word here, e.g. check_word(word) */
*i_ref = 0;
}
}
/* IO: Read in input by word */
void process_document(char *filename)
{
FILE *fh;
int i = 0;
int c;
int line_num = 1;
int check_hyphen_mode = 0;
int seen_hyphen_newline = 0;
char word[MAX_WORD_SIZE];
fh = fopen(filename, "r");
if (!fh) {
fprintf(stderr, "Error: could not open input %s: %s ", filename,
strerror(errno));
exit(1);
}
while ((c = fgetc(fh)) != EOF) {
if (c == ' ')
line_num++;
if (check_hyphen_mode) {
if (isspace(c)) {
if (c == ' ')
seen_hyphen_newline = 1;
continue;
} else {
check_hyphen_mode = 0;
if (!seen_hyphen_newline) {
/* If no newline, treat as two words */
maybe_process_word(word, &i);
} else {
/* If there was a newline, ignore all the
whitespace */
}
seen_hyphen_newline = 0;
}
}
if (c == '-') {
/* Can't decide yet whether this is the end of a word, it
depends what comes next. */
check_hyphen_mode = 1;
} else if (!is_word_char(c)) {
maybe_process_word(word, &i);
} else if (i == MAX_WORD_SIZE - 1) {
word[i] = 0;
fprintf(stderr, "Word too long in document at line %d: %s... ",
line_num, word);
exit(2);
} else {
assert(is_word_char(c));
word[i] = c;
i++;
}
}
maybe_process_word(word, &i);
fclose(fh);
}
int main(int argc, char **argv)
{
char *dict_fname;
char *doc_fname;
if (argc == 3) {
/* No options */
dict_fname = argv[1];
doc_fname = argv[2];
} else if (argc == 5) {
/* Maybe one option and its argument, which must come first. */
if (argv[1][0] == '-') {
if (argv[1][1] == 'd' && argv[1][2] == '\0') {
char *endptr;
debug = strtol(argv[2], &endptr, 0);
if (debug < 0 || endptr == argv[2] || *endptr != '\0') {
fprintf(stderr, "%s", usage);
fprintf(stderr, "Argument to -d should be a "
"non-negative integer ");
return 1;
}
dict_fname = argv[3];
doc_fname = argv[4];
} else {
fprintf(stderr, "%s", usage);
fprintf(stderr, "Unrecognized option '%s' ", argv[1]);
return 1;
}
} else {
fprintf(stderr, "%s", usage);
fprintf(stderr, "Too many non-option arguments ");
return 1;
}
} else {
fprintf(stderr, "%s", usage);
fprintf(stderr, "Wrong number of arguments ");
return 1;
}
read_dictionary(dict_fname);
process_document(doc_fname);
}
--------------
Input1.txt
---------------
Remember, misspelled words should only be lsted once in your ouput. Lsted words should have only one recommendation.
---------------
Dictionary
---------------
list of words, feel free to create dictionary for testing
Step by Step Solution
There are 3 Steps involved in it
Step: 1
Get Instant Access to Expert-Tailored Solutions
See step-by-step solutions with expert insights and AI powered tools for academic success
Step: 2
Step: 3
Ace Your Homework with AI
Get the answers you need in no time with our AI-driven, step-by-step assistance
Get Started