Question
This assignment deals with loading a simple text file into a Python structure, lists, arrays, and dataframes. a. Locate a movie script, play script, poem,
This assignment deals with loading a simple text file into a Python structure, lists, arrays, and dataframes.
a. Locate a movie script, play script, poem, or book of your choice in .txt format*. Project Gutenburg is a great resource for this if you're not sure where to start.
b. Load the words of this structure, one-by-one, into a one-dimensional, sequential Python list (i.e. the first word should be the first element in the list, while the last word should be the last element). It's up to you how to deal with special chacters -- you can remove them manually, ignore them during the loading process, or even count them as words, for example.
c. Use your list to create and print a two-column pandas data-frame with the following properties: i. Each index should mark the first occurrence of a unique word (independent of case) in the text. ii. The first column for each index should represent the word in question at that index iii. The second column should represent the number of times that particular word appears in the text.
Ex: if the first word in your text is "the" which occurs 500 times and the second is "balcony" which only appears twice, your data-frame should begin like the following:
Word | Count | |
---|---|---|
1 | "the" | 500 |
2 | "balcony" | 2 |
... | ... | ... |
d. The co-occurrence of two events represents the likelihood of the two occurring together. A simple example of co-occurrence in texts is a predecessor-successor relationship -- that is, the frequency with which one word immediately follows another. The word "cellar," for example, is commonly followed by "door."
For this task, you are to construct a 2-dimensional predecessor-successor co-occurrence array as follows**: i. The row index corresponds to the word from the same index in part c.'s data-frame. ii. The column index likewise corresponds to the word in the same index in the data-frame. iii. The value in each array location represents the count of the number of times the word corresponding to the row index immediately precedes the word correponding to the column index in the text.
e. Based on the data-frame derived in part c. and array derived in part d., determine and print the following information: i. The first occurring word in the text. ii. The unique word that first occurs last within the text. iii. The most common word iv. The least common word v. Words A and B such that B follows A more than any other combination of words. vi. The word that most commonly follows the least common word
* If you have experience with and prefer another format feel to use it. Also, I recommend sticking to relatively short documents (avoid extremely long novels).
use python.
ref code
file_name = input("Enter file name:") file1 = open(file_name, "r") d = dict() print(" File Contents are: ") for line in file1: print(line, end='') line = line.strip() line = line.lower() words = line.split(" ") for word in words: if word in d: d[word] = d[word] + 1 else: d[word] = 1 print(" Number of occurrences of each word in given text file is:") print(" =============== ") for key in list(d.keys()): print(key, ":", d[key]) file1.close()
Step by Step Solution
There are 3 Steps involved in it
Step: 1
Get Instant Access to Expert-Tailored Solutions
See step-by-step solutions with expert insights and AI powered tools for academic success
Step: 2
Step: 3
Ace Your Homework with AI
Get the answers you need in no time with our AI-driven, step-by-step assistance
Get Started