Answered step by step
Verified Expert Solution
Link Copied!

Question

1 Approved Answer

Use Python: One of the goals of the Cliff Note Generator was to generate a list of characters in a novel. We can actually use

Use Python:

One of the goals of the Cliff Note Generator was to generate a list of characters in a novel. We can actually use our current skill set and include the techniques discussed in the nGrams lesson to extract (with a good level of accuracy) the main characters of a novel. We will also make some improvements with some of the parsing, cleaning, and preparation of the data. It would be best to read this entire lesson before doing any coding. Also note that this lesson is a bit different in that you will be responsible for more of the code writing. What is being specified is a minimum. We highly recommend that you decompose any complex processes into multiple functions.

Preparation

Before doing anything, read through the entire set of directions first. You will get a sense of the restrictions and overall goals.

Step 1: Start a New Colab Notebook and name it INFO490-FindingCharacters (see the previous coding challenge on details if needed).

Step 2: Create a new Code cell.

The tab starter.py has all the functions you will need. You can copy the contents and fill in the functions from previous lessons (ngrams, split_text_into_tokens). Note that normalize_token, read_remote and load_stop_words are already done. You should use these in your solution.

Update the function remove_stop_words such that token is removed if the token (regardless of case) is a stopword.

Step 3: Test your code.

The following should now work

 
 

def demo_test():

 

text = read_remote(HUCK_URL) stop = load_stop_words() tokens = split_text_into_tokens(text) cleaned = remove_stop_words(tokens, stop) grams = bi_grams(cleaned) print(top_n(grams, 10))

 

 

demo_test()

You should see the following:

[(('old', 'man'), 49), (('Mary', 'Jane'), 41), (('Tom', 'Sawyer'), 40), (('Aunt', 'Sally'), 39), (('pretty', 'soon'), 37), (('never', 'see'), 33), (('ever', 'see'), 29), (('Jim', 'said'), 28), (('every', 'time'), 25), (('come', 'along'), 24)]

Note how this compares to the output when we didn't account for the case of the word.

Once that is working, comment out the call to demo_test

Finding the Characters

With this machinery in place, we are ready to find characters in a novel (I hope you are reading this with great anticipation) using different strategies. Each of the strategies has a function to implement that strategy.

Method #1

One attribute (or feature) of the text we are analyzing is that proper nouns are capitalized. Lets capitalize on this and find all single words in the text whose first character is an uppercase letter and the word is NOT a stop word.

Create and define the function find_characters_v1(text, stoplist, top):

  • Tokenize and clean the text using the function split_text_into_tokens

  • Filter the tokens so it has no stop words in it (regardless of case). The parameter stoplist is the array returned from load_stop_words

  • Create a new list of tokens (keep the order) of words that are capitalized. You can test the first character of the token.

  • Return the top words as a list of tuples (the first element is the word, the second is the count)

For Huck Finn, you should get the following (the output is formatted for clarity):

 
 

text = read_remote(HUCK_URL)

 

stop = load_stop_words()

 

v1 = find_characters_v1(text, stop, 15)

 

print(v1)

You should see:

 
 

('Jim', 341),

 

('Well', 318),

 

('Tom', 217),

 

('Huck', 70),

 

('Yes', 68),

 

('Oh', 65),

 

('Miss', 63),

 

('Mary', 60),

 

('Aunt', 53),

 

('Now', 53),

 

('Sally', 46),

 

('CHAPTER', 43),

 

('Sawyer', 43),

 

('Jane', 43),

 

('Buck', 38),

Notice with this very simple method we found 8 characters in the top 15 (those in bold). You also found an Aunt and a Miss too. You might be inclined to start fiddling with the stop-words. The one you could add is 'CHAPTER' and 'Well' -- the interjection, since we know that word does not provide much content in this context. But as we mentioned in the stop words lesson, that's a dangerous game, since other novels might include some of these:

Method #2

Another feature of characters in a novel is that many of them have two names (Tom Sawyer, Aunt Polly, etc).

Create and define the following function:

find_characters_v2(text, stoplist, top):

  • Tokenize and clean the text using the function split_text_into_tokens

  • Convert the list of tokens into a list of bigrams (using your bi_grams method)

  • Filter out all bigrams such that only if both words are capitalized (just the first character) then are they used.

  • Neither word should (either lower or upper) be in stoplist

  • Remember stoplist could be the empty list

  • Return the top bigrams as a list of tuples: The first element is the bigram tuple, the second is the count

Note that we are NOT removing the stopwords from the text. We are now using the stopwords to make decisions on the text. The stopwords lesson has more details on this as well.

With the text of Huckleberry Finn, the following is the output with stopwords being the empty list:

 
 

v2 = find_characters_v2(text, [], 15)

 

print(v2)

 
 

(('Mary', 'Jane'), 41),

 

(('Tom', 'Sawyer'), 40),

 

(('Aunt', 'Sally'), 39),

 

(('Miss', 'Watson'), 20),

 

(('Miss', 'Mary'), 19),

 

(('Mars', 'Tom'), 16),

 

(('Huck', 'Finn'), 15),

 

(('Uncle','Silas'), 15),

 

(('Aunt', 'Polly'), 11),

 

(('Judge','Thatcher'), 10),

 

(('But', 'Tom'), 9),

 

(('Ben', 'Rogers'), 8),

 

(('So', 'Tom'), 8),

 

(('St', 'Louis'), 7),

 

(('Miss', 'Sophia'), 7)

That found 11 characters in the top 15 bigrams frequency table. This method is pretty good and the method didn't even consider stop words. What happens if you consider stop words?

Note: in order to match these outputs, use the collections.Counter class. Otherwise, it's possible that your version of sorting will handle those tuples with equal counts differently (unstable sorting).

Titles

Another feature of characters is that many of them have a title (also called honorifics) precede them (Dr. Mr. Mrs. Miss. Ms. Rev. Prof. Sir. etc). We will look for bi-grams that have these titles. However, we will NOT hard code the titles. We will let the data tell us what the 'titles' are.

Here's the process to use to self discover titles:

  • Let's define a title as a capital letter followed by 1 to 3 lower case letters followed by a period. This is not perfect, but it captures a good majority of them.

  • Create a list named title_tokens that the text matches the above criteria (hint: use regular expressions) for example:

 
 

title_tokens = regex1.findall(text)

  • Now we need to remove words that might have ended a sentence with those same title characteristics (e.g. Tom. Bill. Pat. Etc. ). These names could have been in a sentence like "Please go Tom." Tom is NOT a title, but it would have been found by our definition.

  • Use the same definition for titles (above) but instead of ending with a period, the token must end with whitespace. The idea is that hopefully somewhere in the text the same name will appear but without a period. Its very likely that you would encounter 'Tom' somewhere in the text without a period, but its unlikely that Mr., Mrs., Dr., etc would appear without a period. Let's call this list pseudo_titles.

 
 

pseudo_titles = regex2.findall(text)

  • the set of titles is essentially the first list of tokens, title_tokens with all the tokens in the second set (pseudo_titles) removed. For example, the first list might have 'Dr.', 'Tom.' and 'Mr.' in it and the second set might have 'Tom' and 'Ted' in it. The final title list would include 'Dr' and 'Mr'

  • Name your function get_titles that encapsulates the above logic; it should return a list of titles

 
 

def get_titles(txt):

 

return [] # see process above

Once you have get_titles working, the following should work:

 
 

titles = get_titles(text)

 

print(titles)

You should get 7 computed titles in Huckleberry Finn:

 
 

['Col', 'Dr', 'Mr', 'Mrs', 'Otto', 'Rev', 'St']

Method #3

Create and define the following function

find_characters_v3(text, stoplist, top):

  • Tokenize and clean the text

  • Convert the list of tokens into a list of bigrams

  • Filter out all bigrams such that the first word in the bigram is a title and the second word is capitalized (hint: use the output of get_titles)

  • the second word (either lower or upper) should not be in stoplist

  • Return the top bigrams as a list of tuples: The first element is the bigram tuple, the second is the count

 
 

v3 = find_characters_v3(text, stop, 15)

 

print(v3)

For Huck Finn, you should get the following:

 
 

(('St', 'Louis'), 7),

 

(('Mr', 'Lothrop's'), 6),

 

(('Mrs', 'Phelps'), 4),

 

(('St', 'Petersburg'), 3),

 

(('Dr', 'Robinson'), 3),

 

(('Mr', 'Garrick'), 2),

 

(('Mr', 'Kean'), 2),

 

(('Mr', 'Wilks'), 2),

 

(('Mr', 'Mark'), 1),

 

(('Mrs', 'Judith'), 1),

 

(('Mr', 'Parker'), 1),

 

(('Dr', 'Gunn's'), 1),

 

(('Col', 'Grangerford'), 1),

 

(('Dr', 'Armand'), 1),

 

(('St', 'Jacques'), 1)

Clearly, that yields a lot of good information. Although looking at the counts, none of them are that prominent. We also found a few places as well as people.

Machine Learning?

You may have heard of the NLTK Python library thats a popular choice for processing text. These libraries include models that were built by processing large amounts of text. We will use both the NLTK and SpaCy NLP libraries to do something similar in another lesson. However, these libraries have models built from using large data sets to extract entities (called NER for named entity recognition). These entities include organizations, people, places, money.

The models that were built essentially learned what features (like capitalization or title words) were important when analyzing text and came up with a model that attempts to do the same thing we did here. However, we hard coded the rules (use bigrams, remove stop words, look for capital letters, etc). This is sometimes referred to as a rule-based system. The analysis is built on manually crafted rules.

In machine learning (sometimes referred to as an automatic system), some of the algorithms essentially learn what features are important (or can learn how much weight to apply to each feature) to build a model and then uses the model to classify tokens as named entities. The biggest issue is that these models could be built with a very different text source (e.g. journal articles or twitter feed) than what you are processing. Also the models themselves require a large set of resources (memory, cpu) that you may not have available. What you built in this lesson is efficient, fast and fairly accurate.

Submission Guidelines:

You will upload your notebook to Gradescope.com for grading.

  • submit your code to 'Finding Characters' Assignment on Gradescope

  • do NOT use any external Python library other than requests, collections and re (nothing else).

  • do NOT use the zip function (we will soon though)

  • try to solve all of these problems by yourself with your own brain and a piece of paper. Surely there are solutions available, but copying will not make you a better programmer. This is not the time to copy or share code.

  • You should test the code you are writing against sample sentences instead of the full text; once you have it working, then try the full data set

  • you are free to write as many helper functions as you need. The following functions will be tested:

get_titles

find_characters_v[1-3]

  • each of the find_characters_v functions should use your top_n function

  • the output of find_characters_v[1-3] should always be a list of tuples AND match the example output before you 'run tests'

Step by Step Solution

There are 3 Steps involved in it

Step: 1

blur-text-image

Get Instant Access with AI-Powered Solutions

See step-by-step solutions with expert insights and AI powered tools for academic success

Step: 2

blur-text-image

Step: 3

blur-text-image

Ace Your Homework with AI

Get the answers you need in no time with our AI-driven, step-by-step assistance

Get Started

Students also viewed these Accounting questions

Question

organizational culture based on the "completing values framework".

Answered: 1 week ago