Answered step by step
Verified Expert Solution
Link Copied!

Question

1 Approved Answer

Introduction Your task for this assignment is to build a command-line spell-checking program. You may be more familiar with spell checking as a feature inside

Introduction

Your task for this assignment is to build a command-line spell-checking program.

You may be more familiar with spell checking as a feature inside an editor, but

what you will build is a separate program that looks through a text file and

reports any misspelled words in the file. Additionally, when a word is misspelled,

the program should check whether the word is similar to a correct word, and if

so suggest the correction. To know what words are correct, the program will

take, as another argument, a file containing a list of the legal English words.

Well call this a dictionary file, though its just the words and not the definitions.

To get you started, weve written the parts of the code that read the input files

and identify individual words. Your job is to write the rest. In particular, any

places where you see TODO in a comment are places that you should add code.

There is one unusual feature that the program should have, according to a request

we are passing on from the marketing department. In a (perhaps misguided) plan

to differentiate the spell checker from its competition and show that it is prepared

for the Internet age, this will be the only spell checker that supports the @ (at)

sign in English words. Specifically the @ sign should be treated as a variation of

the lowercase letter a. For instance words like em@il and @tmosphere should

be counted as correct, just like the outdated 20th-century spellings email and

atmosphere.

This is the sort of program that you could write in any general-purpose language,

and it doesnt depend on any 2021-specific concepts; in fact it might seem like it

could have come from CSci 1933. Thats on purpose: weve doing this project

first to give you some practice writing programs in C in our Linux environment,

to help you prepare for later projects where youll have to work with C while

also exploring new concepts.

We ask that you write your program to store the word-list in memory using a

hash table data structure. Specifically, use what is called a separate-chaining

hash table, one in which all the values that hash to the same index are stored in

a linked list. This will require an efficient hash function to index each word

in the dictionary. You will populate the hashtable with words found in the

dictionary. Then, you will write a function to check if the words in an arbitrary

input file are found in the dictionary. Finally, write a rec

spellcheck [-d ] .

The angle brackets arent part of the syntax; they indicate that while spellcheck

and -d are written as is, the arguments described as dictionary_file and

input_file need to be replaced by appropriate file names. The square brackets

mean that the -d option and its flags argument are optional; but if -d is given,

the next argument needs to be an integer representing debugging flags.

The program should produce output to the standard output (e.g., using printf)

with the following format:

:

:

?

...

:

In other words, there should be one line per misspelled word. Each incorrect

word should appear only once. If your program can find a correct word that is

similar to the incorrect word, print it after the incorrect word, separated by a

space and a colon. If your program cant find a suggestion, just print a question

mark after the incorrect word.

The dictionary_file has this specific format:

There is one word per line, each followed by a newline.

Words consist only of English letters A-Z or a-z, or apostrophes, and are

less than 100 letters.

Your program should be able to efficiently handle a dictionary on the order

of 100,000 lines.

Words that are capitalized or all-capital-letters in the dictionary represent

words that need to be capitalized or all-caps to be correct. For instance

America not america, and USA not Usa or usa. If multiple

capitalizations of a word appear in the dictionary, the ones that require

more capital letters will come first (e.g., Baker the last name before

baker the occupation).

The input_file should be a file containing ASCII text; beyond that it doesnt

have any particular format. Your program should also be able to deal efficiently

with a large input file. For instance checking a 10,000 word input file against

a 100,000 line dictionary to find 100 misspelled words should require only a

fraction of a second.

On the course web site near where you found this handout, youll be able

to download a compressed tar file ha1.tar.gz containing the files for this

assignment, such as a sample of a dictionary file and some input text files, and a

framework spellcheck.c containing the code weve already written for you.

Copy the tar file to your home directory and then expand it with this command:

2

tar -xzvf ha1.tar.gz

Part A - Data Structure

Your first step in building this program is to construct a data structure that will

be used to represent a dictionary.

The dictionary will begin as a text file containing a list of correctly spelled words.

Your program should read this text file and insert each word into the hashtable.

If youve taken 1913, 1933, or a similar class, you should already be familiar

with how hashtables work. If not, you can look at an old textbook or an web

resource like Wikipedia for a review, but remember that such outside sources

should only be helping you with concepts: you need to write your own code.

The file I/O is already written and taken care of, so you will only be responsible

for:

1. Creating the hashtable.

2. Creating the hash function.

3. Inserting each word in the dictionary into the hashtable.

Your hashtable should:

1. Have an appropriate number of bins to hold around 100,000 entries.

2. Have a well-written hash function that minimizes collisions.

3. Manage collisions properly when they happen.

4. Insert in O(1) time.

5. Find in O(1) expected time. (The time to find an item will likely depend

on whether there are collisions. But the time should be O(c) if there are c

collisions, and the average number of collisions should be small.)

Avoid having too many collisions in your hashtable. This will slow down the

performance of your program and defeats the purpose of using a hashtable. As

a rough estimate, if you insert as many entries into the hashtable as it hash

buckets, more than half of the buckets should have at least one entry. If only

a small fraction or even only one bucket is used, thats a good sign you have

a poor hash function. Because the number of words in English is not growing

much, it is OK for the number of buckets in your hashtable to be fixed, as long

as you choose a large enough fixed value.

While avoiding collisions, it is also necessary that your hash function be compatible

with the lookup operations you want to do: words can occur capitalized even

if they are not capitalized in the dictionary. So think about what this implies

for your hash function.

Part B - Spell Check

The second part of this program is the actual spell-check functionality. Given

an input text file, your program will need to read every word in the input file

(this part is done for you) and look it up in the dictionary. If the word does not

exist in the dictionary, either because it is misspelled, or the dictionary is not

comprehensive enough, the word should be printed to the console in the format

specified above.

Here are some simple rules to follow:

1. If a word doesnt appear in the dictionary, it is spelled wrong.

2. If a word is capitalized in the dictionary, it must be capitalized in the

same way to be spelled correctly. To be more specific, any letter that

is capitalized in the dictionary must be capitalized. It is OK if even

more letters are capitalized, like if part of a document is written IN ALL

CAPITAL LETTERS.

3. Otherwise (if the word is lowercase in the dictionary), capitalization does

not matter.

One specific corner case you will need to add some code for has to do with

the ' character, which can represent either a single quote or an apostrophe,

depending on where it appears. We want to count an apostrophe as part of a

word, since the correct placement of apostrophes is an important part of the

spelling of contractions, for instance. However single quotes shouldnt be part of

a word. The word recognition code that we have written for you always treats

the ' character as part of a word; but if a word appears in single quotes without

whitespace like 'this', the word your program should look up in the dictionary

is this, not 'this'.

It is recommended to test as many edge cases as you can think of. This program

should be robust and handle many different type of spelling mistakes.

Use the provided dictionary as a reference for your tests. This will be the main

list of words used in grading, and it is indicative of the size and format of

dictionary your program should support well.

Part C - Word Suggestion

The final part of any good spell check software is to suggest possible correct

words. You are asked to implement a simple version of this feature that outputs

one reasonable suggestion. The suggestion must be a word that already exists

in the dictionary. We are defining a reasonable suggestion as a new word that

differs from the misspelled word by one character. This includes:

1. Having one extra character somewhere in the word.

2. Having one fewer character somewhere in the word.

3. Having one substituted character somewhere in the word.

For example, suppose a word in the input text was fumd. This is obviously

misspelled and should be caught by the spell check. One reasonable suggestion

would be the word fund since there is one substituted character. However, an

unreasonable suggestion would be the word stop since it has nothing in common

with the original word.

Your reasonable suggestion should be printed next to the misspelled word in

the output. See the output structure above for more detail about the expected

output of your program.

In many cases, there may exist more than one reasonable suggestion. Depending

on how this portion of your program is written, it may find one replacement

before another. You are only required to output one such suggestion.

If no suggestions are found, print a question mark (?) after the misspelled word

as specified above.

---------------

Spellcheck.c Code

---------------

#include

#include

#include

#include

#include

#include

#define MAX_WORD_SIZE 100

#define DEBUG_INPUT_IO 0x1

#define DEBUG_DICTIONARY_IO 0x2

int debug;

const char *usage = "Usage: spellcheck [-d debug_level] dictionary input ";

/* Is this a character that could be part of a word? */

int is_word_char(int c) {

return isalpha(c) || c == '\'' || c == '@';

}

/* IO: Read in dictionary by word */

void read_dictionary(char *filename)

{

FILE *fh;

int i = 0;

int line_num = 1;

int c;

char word[MAX_WORD_SIZE];

fh = fopen(filename, "r");

if (!fh) {

fprintf(stderr, "Error: could not open wordlist %s: %s ",

filename, strerror(errno));

exit(1);

}

while ((c = fgetc(fh)) != EOF) {

if (c == ' ') {

line_num++;

if (i == 0)

continue;

word[i] = '\0';

i = 0;

if (debug & DEBUG_DICTIONARY_IO)

printf("%s ", word);

/* TODO Process word here, e.g. add_to_dictionary(word) */

} else if (i == MAX_WORD_SIZE - 1) {

word[i] = 0;

fprintf(stderr, "Word too long at wordlist line %d: %s... ",

line_num, word);

exit(2);

} else {

if (!is_word_char(c)) {

fprintf(stderr, "Bad character '%c' in word at line %d ",

c, line_num);

exit(2);

}

word[i] = c;

i++;

}

}

fclose(fh);

}

/* "word" should be a character buffer, where *i_ref is the number of

valid characters in the buffer (or equivalently, the index where

the next character would be added. If the buffer is not empty, then

terminate the buffer, pass it to check_word, and then empty the

buffer. */

void maybe_process_word(char *word, int *i_ref) {

if (*i_ref != 0) {

word[*i_ref] = '\0';

if (debug & DEBUG_INPUT_IO)

printf("DBG: i=%-2d word=\"%s\" ", *i_ref, word);

/* TODO: process word here, e.g. check_word(word) */

*i_ref = 0;

}

}

/* IO: Read in input by word */

void process_document(char *filename)

{

FILE *fh;

int i = 0;

int c;

int line_num = 1;

int check_hyphen_mode = 0;

int seen_hyphen_newline = 0;

char word[MAX_WORD_SIZE];

fh = fopen(filename, "r");

if (!fh) {

fprintf(stderr, "Error: could not open input %s: %s ", filename,

strerror(errno));

exit(1);

}

while ((c = fgetc(fh)) != EOF) {

if (c == ' ')

line_num++;

if (check_hyphen_mode) {

if (isspace(c)) {

if (c == ' ')

seen_hyphen_newline = 1;

continue;

} else {

check_hyphen_mode = 0;

if (!seen_hyphen_newline) {

/* If no newline, treat as two words */

maybe_process_word(word, &i);

} else {

/* If there was a newline, ignore all the

whitespace */

}

seen_hyphen_newline = 0;

}

}

if (c == '-') {

/* Can't decide yet whether this is the end of a word, it

depends what comes next. */

check_hyphen_mode = 1;

} else if (!is_word_char(c)) {

maybe_process_word(word, &i);

} else if (i == MAX_WORD_SIZE - 1) {

word[i] = 0;

fprintf(stderr, "Word too long in document at line %d: %s... ",

line_num, word);

exit(2);

} else {

assert(is_word_char(c));

word[i] = c;

i++;

}

}

maybe_process_word(word, &i);

fclose(fh);

}

int main(int argc, char **argv)

{

char *dict_fname;

char *doc_fname;

if (argc == 3) {

/* No options */

dict_fname = argv[1];

doc_fname = argv[2];

} else if (argc == 5) {

/* Maybe one option and its argument, which must come first. */

if (argv[1][0] == '-') {

if (argv[1][1] == 'd' && argv[1][2] == '\0') {

char *endptr;

debug = strtol(argv[2], &endptr, 0);

if (debug < 0 || endptr == argv[2] || *endptr != '\0') {

fprintf(stderr, "%s", usage);

fprintf(stderr, "Argument to -d should be a "

"non-negative integer ");

return 1;

}

dict_fname = argv[3];

doc_fname = argv[4];

} else {

fprintf(stderr, "%s", usage);

fprintf(stderr, "Unrecognized option '%s' ", argv[1]);

return 1;

}

} else {

fprintf(stderr, "%s", usage);

fprintf(stderr, "Too many non-option arguments ");

return 1;

}

} else {

fprintf(stderr, "%s", usage);

fprintf(stderr, "Wrong number of arguments ");

return 1;

}

read_dictionary(dict_fname);

process_document(doc_fname);

}

--------------

Input1.txt

---------------

Remember, misspelled words should only be lsted once in your ouput. Lsted words should have only one recommendation.

---------------

Dictionary

---------------

list of words, feel free to create dictionary for testing

Step by Step Solution

There are 3 Steps involved in it

Step: 1

blur-text-image

Get Instant Access to Expert-Tailored Solutions

See step-by-step solutions with expert insights and AI powered tools for academic success

Step: 2

blur-text-image_2

Step: 3

blur-text-image_3

Ace Your Homework with AI

Get the answers you need in no time with our AI-driven, step-by-step assistance

Get Started

Recommended Textbook for

Hands-On Database

Authors: Steve Conger

2nd Edition

0133024415, 978-0133024418

More Books

Students also viewed these Databases questions

Question

What does Processing of an OLAP Cube accomplish?

Answered: 1 week ago