Question

1 Approved Answer

Posted on Sep 09, 2024

This assignment: kwoc1.c In this assignment you will direct your learning of C by solving a problem involving a tool used by many research and

image text in transcribed

This assignment: kwoc1.c In this assignment you will direct your learning of C by solving a problem involving a tool used by many research and scholars that is called a concordance. Specifically we will consider one kind of concordance known as keyword-out-of-context or KWOC The Oxford English Dictionary defines a concordance is an "alphabetical arrangement of principal words contained in a book, with citations of the passages in which they occur". Someone using a concordance is therefore able to look up a word of interest to them for that book, and find all the locations (page numbers and line numbers) where the word is used. The concordances your program will create will be for texts much, much smaller than a complete book! Instead the concordance will relate words to the lines in which they occur in some text file being processed. As an example, consider the following small text file. Line numbers are given for your reference. (Among the test files for this first assignment, this corresponds to the contents of in04.txt.) 1 the fish a dog cat dog rabbit 2 the fish and cat 3 a rabbit or elephant The concordance for this file is to be generated using the following command (where I assume all of the needed text files are contained in the same directory as the executable). | $ ./kwoc1 -e english.txt in04.txt The program produces this output to the console: CAT CAT DOG ELEPHANT FISH FISH RABBIT RABBIT the fish a dog cat dog rabbit (1) the fish and cat (2) the fish a dog cat dog rabbit (14) a rabbit or elephant (3) the fish a dog cat dog rabbit (1) the fish and cat (2) the fish a dog cat dog rabbit (1) a rabbit or elephant (3) Notice that each line is made up of a keyword, followed by an input line within which that word appeared, and ends with a number reference. The keywords are printed in alphabetical order. The word cat (printed as CAT when shown as a keyword) appears on lines 1 and 2 of the input; the word dog appears in only one input line - line 1 - but does so more than once, hence the asterisk. And so on and so forth. You will also notice that some words contained within the input file do not appear as keywords in the concordance. For example, and, or, and the are not really of interest to anyone using a concordance - that is, they are excluded. A file with exclusion words was given to kwoc1 on the command line via the -e argument. How must kwoci determine the keywords? It may do so in effect by (a) reading the contents of the input file for which an index is needed, then (b) determining the unique words in that file, and finally (C) reading the words in the exclusion file and removing those words from the set discovered in the previous step. With respect to output, keywords must always appear at the start of the line. (If the keyword itself had been capitalized in the output sentence, then that style of concordance would have been called keyword-in-context or KWIC). To aid with formatting, the longest keyword determines how input lines themselves are indented, i.e., length of longest keyword + two spaces. The testfiles for this assignment can provide you with more examples of input and their expected output; see below for more details on the location of test files. Some simplifying assumptions In order to reduce the complexity of the program, I have provided the following simplifications and size limits. 1. All input consists of lower-case letters, and all words on a line are separated by a single space. There is no other punctuation such as commas, colons, quotations marks, periods, etc. That is, for the first assignment you need not worry about cases such as that keyword computer appearing as computer or "Computer or computer-aided or computer? or computer', etc. etc. 2. All provided exclusion-word files will have one word per line, and lines will be in alphabetical order. 3. Each input file (i.e., those for which a KWOC index is to be created): (a) will have at most 100 lines; (b) has input lines at most 80 characters long (including spaces and newline character); (c) has words no longer than 20 characters; (d) has no more than 500 unique keywords, although a specific keyword may appear many times in the input file. Also: files with exception words will have no more than 100 lines (i.e., no more than 100 words). I also have several more restrictions on coding which are actually helpful simplifications: 4. Your solution must not make use of the dynamic-memory functions malloc, calloc, valloc, realloc, etc. That is, all memory needed for your solution can be statically allocated i.e., program-scope variables) given the size limits described above in item 3. 5. All code must appear in a single source-code file named kwoc1.c. This assignment: kwoc1.c In this assignment you will direct your learning of C by solving a problem involving a tool used by many research and scholars that is called a concordance. Specifically we will consider one kind of concordance known as keyword-out-of-context or KWOC The Oxford English Dictionary defines a concordance is an "alphabetical arrangement of principal words contained in a book, with citations of the passages in which they occur". Someone using a concordance is therefore able to look up a word of interest to them for that book, and find all the locations (page numbers and line numbers) where the word is used. The concordances your program will create will be for texts much, much smaller than a complete book! Instead the concordance will relate words to the lines in which they occur in some text file being processed. As an example, consider the following small text file. Line numbers are given for your reference. (Among the test files for this first assignment, this corresponds to the contents of in04.txt.) 1 the fish a dog cat dog rabbit 2 the fish and cat 3 a rabbit or elephant The concordance for this file is to be generated using the following command (where I assume all of the needed text files are contained in the same directory as the executable). | $ ./kwoc1 -e english.txt in04.txt The program produces this output to the console: CAT CAT DOG ELEPHANT FISH FISH RABBIT RABBIT the fish a dog cat dog rabbit (1) the fish and cat (2) the fish a dog cat dog rabbit (14) a rabbit or elephant (3) the fish a dog cat dog rabbit (1) the fish and cat (2) the fish a dog cat dog rabbit (1) a rabbit or elephant (3) Notice that each line is made up of a keyword, followed by an input line within which that word appeared, and ends with a number reference. The keywords are printed in alphabetical order. The word cat (printed as CAT when shown as a keyword) appears on lines 1 and 2 of the input; the word dog appears in only one input line - line 1 - but does so more than once, hence the asterisk. And so on and so forth. You will also notice that some words contained within the input file do not appear as keywords in the concordance. For example, and, or, and the are not really of interest to anyone using a concordance - that is, they are excluded. A file with exclusion words was given to kwoc1 on the command line via the -e argument. How must kwoci determine the keywords? It may do so in effect by (a) reading the contents of the input file for which an index is needed, then (b) determining the unique words in that file, and finally (C) reading the words in the exclusion file and removing those words from the set discovered in the previous step. With respect to output, keywords must always appear at the start of the line. (If the keyword itself had been capitalized in the output sentence, then that style of concordance would have been called keyword-in-context or KWIC). To aid with formatting, the longest keyword determines how input lines themselves are indented, i.e., length of longest keyword + two spaces. The testfiles for this assignment can provide you with more examples of input and their expected output; see below for more details on the location of test files. Some simplifying assumptions In order to reduce the complexity of the program, I have provided the following simplifications and size limits. 1. All input consists of lower-case letters, and all words on a line are separated by a single space. There is no other punctuation such as commas, colons, quotations marks, periods, etc. That is, for the first assignment you need not worry about cases such as that keyword computer appearing as computer or "Computer or computer-aided or computer? or computer', etc. etc. 2. All provided exclusion-word files will have one word per line, and lines will be in alphabetical order. 3. Each input file (i.e., those for which a KWOC index is to be created): (a) will have at most 100 lines; (b) has input lines at most 80 characters long (including spaces and newline character); (c) has words no longer than 20 characters; (d) has no more than 500 unique keywords, although a specific keyword may appear many times in the input file. Also: files with exception words will have no more than 100 lines (i.e., no more than 100 words). I also have several more restrictions on coding which are actually helpful simplifications: 4. Your solution must not make use of the dynamic-memory functions malloc, calloc, valloc, realloc, etc. That is, all memory needed for your solution can be statically allocated i.e., program-scope variables) given the size limits described above in item 3. 5. All code must appear in a single source-code file named kwoc1.c