Answered step by step

Verified Expert Solution

Link Copied!

Question

1 Approved Answer

Posted on Sep 25, 2024

I am trying to write a word-concordance program that will do the following: 1) Read the stop_words.txt file into a dictionary (use the same type

I am trying to write a word-concordance program that will do the following:

1) Read the stop_words.txt file into a dictionary (use the same type of dictionary that youre timing) containing only stop words, called stopWordDict. (WARNING: Strip the newline( ) character from the end of the stop word before adding it to stopWordDict)

2) Process the WarAndPeace.txt file one line at a time to build the word-concordance dictionary(called wordConcordanceDict) containing main words for the keys with a list of their associated line numbers as their values.

3) Traverse the wordConcordanceDict alphabetically by key to generate a text file containing the concordance words printed out in alphabetical order along with their corresponding line numbers.

I tested my program on a small file with a short list of stop words and it worked correctly (provided an example of this below). The outcome was what I expected, a list of the main words with their line count, not including words from the stop_words_small.txt file. The only difference between the small file I tested and the main file I am actually trying to test, is the main file is much longer and contains punctuation. So the problem I am running into is when I run my program with the main file, I am getting way more results then expected. The reason I am getting more results then expected is because the punctuation is not being removed from the file.

For example, below is a section of the outcome where my code counted the word Dmitri as four separate words because of the different capitalization and punctuation that follows the word. If my code were to remove the punctuation correctly, the word Dmitri would be counted as one word followed by all the locations found. My output is also separating upper and lower case words, so my code is not making the file lower case either.

What my code currently displays:

Dmitri : [2528, 3674, 3687, 3694, 4641, 41131] Dmitri! : [16671, 16672] Dmitri, : [2530, 3676, 3685, 13160, 16247] dmitri : [2000]

What my code should display:

dmitri : [2000, 2528, 2530, 3674, 3676, 3685, 3687, 3694, 4641, 13160, 16671, 16672, 41131]

Words are defined to be sequences of letters delimited by any non-letter. There should also be no distinction made between upper and lower case letters, but my program splits those up as well; however, blank lines are to be counted in the line numbering.

Below is my code and I would appreciate it if anyone could take a look at it and give me any feedback on what I am doing wrong. Thank you in advance.

import re def main(): stopFile = open("stop_words.txt","r") stopWordDict = dict() for line in stopFile: stopWordDict[line.lower().strip(" ")] = [] hwFile = open("WarAndPeace.txt","r") wordConcordanceDict = dict() lineNum = 1 for line in hwFile: wordList = re.split(" | |\.|\"|\)|\(", line) for word in wordList: word.strip(' ') if (len(word) != 0) and word.lower() not in stopWordDict: if word in wordConcordanceDict: wordConcordanceDict[word].append(lineNum) else: wordConcordanceDict[word] = [lineNum] lineNum = lineNum + 1 for word in sorted(wordConcordanceDict): print (word," : ",wordConcordanceDict[word]) if __name__ == "__main__": main()

Just as another example and reference here is the small file I test with the small list of stop words that worked perfectly.

stop_words_small.txt file

a, about, be, by, can, do, i, in, is, it, of, on, the, this, to, was

small_file.txt

This is a sample data (text) file to be processed by your word-concordance program. The real data file is much bigger.

correct output

bigger: 4 concordance: 2 data: 1 4 file: 1 4 much: 4 processed: 2 program: 2 real: 4 sample: 1 text: 1 word: 2 your: 2

Step by Step Solution

There are 3 Steps involved in it

Step: 1

Get Instant Access to Expert-Tailored Solutions

See step-by-step solutions with expert insights and AI powered tools for academic success

Step: 2

Step: 3

Ace Your Homework with AI

Get the answers you need in no time with our AI-driven, step-by-step assistance

Get Started

Recommended Textbook for

Database And Transaction Processing

Authors: Philip M. Lewis, Arthur Bernstein, Michael Kifer

1st Edition

0201708728, 978-0201708721

More Books

Students also viewed these Databases questions

Question

★★★★★

Kevin purchases 1,000 shares of Bluebird Corporation stock on October 3, 2016, for $300,000. On December 12, 2016, Kevin purchases an additional 750 shares of Bluebird stock for $210,000. According...

Answered: 1 week ago

Question

★★★★★

Polaski Company manufactures and sells a single product called a Ret. Operating at capacity, the company can produce and sell 40,000 Rets per year. Costs associated with this level of production and...

Answered: 1 week ago

Question

★★★★★

1. Make a point of listening to students who speak out for causes on your campus. How do the speakers attempt to establish that they are informed, dynamic, and trustworthy (the dimensions of...

Answered: 1 week ago

Question

★★★★★

The master budget at Windsor, Inc., last period called for sales of 90,000 units at $36 each. The costs were estimated to be $15 variable per unit and $900,000 fixed. During the period, actual...

Answered: 1 week ago

Question

★★★★★

I am trying to write a word-concordance program that will do the following: 1) Read the stop_words.txt file into a dictionary (use the same type of dictionary that youre timing) containing only stop...

Answered: 1 week ago

Question

★★★★★

Preparing the [I] consolidation entries for sale of depreciable assets-Equity method years. The parent uses the equity method to account for its Equity Investment. a. Compute the annual...

Answered: 1 week ago

Question

★★★★★

Problem 5-22 CVP Applications; Contribution Margin Ratio; Break-Even Analysis; Cost Structure (LO5-1, LO5-3, LO5-4, LO5-5, LO5-6) Due to erratic sales of its sole product-a high-capacity battery for...

Answered: 1 week ago

Question

★★★★★

Directions: Using the digits 0 to 9, fill in the boxes so that the chart is accurate. Use each digit only once per blue box and once per red box. Logs are base 10. [Note: The green parenthesis...

Answered: 1 week ago

Question

★★★★★

Activity 3.b-Identifying the similar elements between the T-Account and the Four-Column Ledger Account After journalizing entries, information can be entered into T-Accounts and/or the four-column...

Answered: 1 week ago

Question

★★★★★

Suppose that the experimenter still wants 10 people in the treatment group, but instead of the restriction that the treatment group contain exactly 5 males and 5 females, the experimenter only...

Answered: 1 week ago

Question

★★★★★

On September 1, Shawn Dahl established Whitewater Rentals, a canoe and kayak rental business. The following transactions occurred in the month of September and affected the following accounts: Cash...

Answered: 1 week ago

Question

★★★★★

2. What do threats to validity have to do with training evaluation? Identify internal and external threats to validity. Are internal and external threats similar? Explain.

Answered: 1 week ago

Question

★★★★★

3. What are the strengths and weaknesses of each of the following designs: posttest-only, pretest/posttest with comparison group, and pretest/posttest only?

Answered: 1 week ago

Question

★★★★★

4. Trainees and their managers provide estimates of training benefits.

Answered: 1 week ago

Previous Question Next Question