Question
I am trying to write a word-concordance program that will do the following: 1) Read the stop_words.txt file into a dictionary (use the same type
I am trying to write a word-concordance program that will do the following:
1) Read the stop_words.txt file into a dictionary (use the same type of dictionary that youre timing) containing only stop words, called stopWordDict. (WARNING: Strip the newline( ) character from the end of the stop word before adding it to stopWordDict)
2) Process the WarAndPeace.txt file one line at a time to build the word-concordance dictionary(called wordConcordanceDict) containing main words for the keys with a list of their associated line numbers as their values.
3) Traverse the wordConcordanceDict alphabetically by key to generate a text file containing the concordance words printed out in alphabetical order along with their corresponding line numbers.
I tested my program on a small file with a short list of stop words and it worked correctly (provided an example of this below). The outcome was what I expected, a list of the main words with their line count, not including words from the stop_words_small.txt file. The only difference between the small file I tested and the main file I am actually trying to test, is the main file is much longer and contains punctuation. So the problem I am running into is when I run my program with the main file, I am getting way more results then expected. The reason I am getting more results then expected is because the punctuation is not being removed from the file.
For example, below is a section of the outcome where my code counted the word Dmitri as four separate words because of the different capitalization and punctuation that follows the word. If my code were to remove the punctuation correctly, the word Dmitri would be counted as one word followed by all the locations found. My output is also separating upper and lower case words, so my code is not making the file lower case either.
What my code currently displays:
Dmitri : [2528, 3674, 3687, 3694, 4641, 41131] Dmitri! : [16671, 16672] Dmitri, : [2530, 3676, 3685, 13160, 16247] dmitri : [2000]
What my code should display:
dmitri : [2000, 2528, 2530, 3674, 3676, 3685, 3687, 3694, 4641, 13160, 16671, 16672, 41131]
Words are defined to be sequences of letters delimited by any non-letter. There should also be no distinction made between upper and lower case letters, but my program splits those up as well; however, blank lines are to be counted in the line numbering.
Below is my code and I would appreciate it if anyone could take a look at it and give me any feedback on what I am doing wrong. Thank you in advance.
import re def main(): stopFile = open("stop_words.txt","r") stopWordDict = dict() for line in stopFile: stopWordDict[line.lower().strip(" ")] = [] hwFile = open("WarAndPeace.txt","r") wordConcordanceDict = dict() lineNum = 1 for line in hwFile: wordList = re.split(" | |\.|\"|\)|\(", line) for word in wordList: word.strip(' ') if (len(word) != 0) and word.lower() not in stopWordDict: if word in wordConcordanceDict: wordConcordanceDict[word].append(lineNum) else: wordConcordanceDict[word] = [lineNum] lineNum = lineNum + 1 for word in sorted(wordConcordanceDict): print (word," : ",wordConcordanceDict[word]) if __name__ == "__main__": main()
Just as another example and reference here is the small file I test with the small list of stop words that worked perfectly.
stop_words_small.txt file
a, about, be, by, can, do, i, in, is, it, of, on, the, this, to, was
small_file.txt
This is a sample data (text) file to be processed by your word-concordance program. The real data file is much bigger.
correct output
bigger: 4 concordance: 2 data: 1 4 file: 1 4 much: 4 processed: 2 program: 2 real: 4 sample: 1 text: 1 word: 2 your: 2
Step by Step Solution
There are 3 Steps involved in it
Step: 1
Get Instant Access to Expert-Tailored Solutions
See step-by-step solutions with expert insights and AI powered tools for academic success
Step: 2
Step: 3
Ace Your Homework with AI
Get the answers you need in no time with our AI-driven, step-by-step assistance
Get Started