Answered step by step
Verified Expert Solution
Link Copied!

Question

1 Approved Answer

As mentioned in the introduction, this is a somewhat simplified form of this program. While we will have all the essential parts of a working

image text in transcribed
image text in transcribed
image text in transcribed
image text in transcribed
image text in transcribed
image text in transcribed
image text in transcribed
image text in transcribed
As mentioned in the introduction, this is a somewhat simplified form of this program. While we will have all the essential parts of a working piece of software, I've reduced those parts to their list, string, and dictionary cores. The following functions are required. Full descriptions will be given later. my_split(text) This function takes a text sample and returns a list of the words in that text sample splitting the one string into a list of strings) count (words) This function takes a list of words and returns a dictionary whose keys are the words in the list, and whose values are the count of times each word appears. word_count_similarity(counti, count2) This function takes two dictionaries, each a count of words, and computes the "cosine similarity equation over the two dictionaries. This returns a number from 0 to 1, with higher numbers indicating a more similar word distribution. best_guess (known_author_counts, unknown_count) This function takes two dictionar- ies, the first contains word counts with known authors - the keys are author names, and the values are word count dictionaries for those authors. The second is a single word count dictionary with unknown author. This function will return the most likely author's name, based on the word count similarity with the known author's writing samples. 5.1 my split The my-split function has one parameter - text a large string containing a long text sample. In real text, this string may contain many numbers and all the full complexities of English text, punctuation, and formatting. For this lab, however, you can assume that the text string contains only letters (uppercase and lowercase, English alphabet) punctuation (,.?!) and spaces. The my_split function should return a list of strings containing the words from the text sample, in the order they appear. You can assume that words are separated with spaces. Words may, or may not contain punctuation (?!) after the last letter, which should be removed. So given the input "a dog, a goat, and an apple!", my split should return ["a", "dog", "a", "goat", "and", "an", "apple"] Note: The python split function does much of this behavior and is therefore NOT ALLOWED in this lab. I want you to practice splitting this string yourself. Hints: . Try to figure out how to isolate just the first word into a separate string. Test this code on it's own before trying to put it into a loop The Slicing operator can be used with strings to get a substring containing only one word (so long as you know where the word starts and stops) This problem is easier if you "remove words from the text string as you add them to a list of results, trying to do this without updating the text string is a fair deal harder. (I.E. go from "a happy dog" to "a" (in one string) and "happy dog" in another string, then next loop can start with "happy dog" and break it down further Strings, like most sequences, have many useful functions, including those that will find indexes of certain letters. You may find it helpful to write a function to "clean" a word, removing punctuation, etc, rather than trying to include that logic directly in my.split. 5.2 count(words) The count function has one parameter - A list of strings (words). This list of words represents a text sample, and would be the return value of one or more call to my-split. This function should return a dictionary. The return dictionary should have strings as keys and integers as values. The keys should be the unique words in the input list, and the values for those keys would be the number of times those words are in the list. As an example, if given the input ("a", "dog", "a", "goat", "and", "an", "apple"], count should return {'a':2, 'dog':1, 'goat':1, 'and': 1, 'an': 1, 'apple':1} It's quite possible that there is a built-in python function to do this. Python has a fantastic library of built-in functions. You are, of course, not allowed to use that, I expect you to manually loop over the words list counting the words. A note on tests: The test file has two types of tests for this. The first test simply compares the output of your function with the expected output printing True (equal) of False (unequal) While this is sufficient for most purposes, it is not helpful for debugging, as it doesn't help you discover what is counted wrong! There is a boolean variable at the top of the test file that can be set to True to print all words, in order, from your dictionary, which can then be compared with output for the reference solutions. This can help you debug, but is quite long. This function should return a double value from 0 to 1 representing the cosine similar- ity metric computed between these two counts. Cosine similarity metric is only tenuously connected to the trigonometric function cos(6), instead this comes from machine learning metrics, and the study of high-dimension vectors. Mathematically, the equation to compute is: word Ecount in count2 countlword count2[word (word count countl(word))(wordecount2 count2(word)) However, since that equation is hard to read without prior exposure to the mathematical syntax I will also summarize it in pseudocode: 1. Compute Si as the sum of the square of each word count in count1 2. Compute S2 as the sum of the square of each word count in count2 3. Compute S3 as the sum over each word that is in both countl and count 2 of (count 1(word) * count2(word)) 4. return 5932 For example given count1 = {'a':2, 'b':3, 'c':1} and count2 = {'a':1, 'b':3, d:4} we would have S1 = 22 + 32 + 12 = 4+9+1 = 14 S2 = 12 + 32 +42 = 1 + 9 + 16 = 26 S3 = 2*1+3*3= 2 +9 = 11 result = 126 = 0.3021978021 5.4 best guess The best guess function takes two parameters, known_author_counts This is a dictionary containing word counts by known authors. The keys of this dictionary would be strings - denoting an author's name, and the values would be word count dictionaries, as would be returned by count. unknown_count This is a single word count dictionary as would be returned by count, representing the word counts of a sample of text written by an unknown author. This function should return a string - one of the known authors names. The known author name that should be returned is the one whose word count dictionary is most similar to the unknown sample (as measured by the word count similarity function. In this way we guess the author whose word choice is most similar to the word choices made in the unknown sample

Step by Step Solution

There are 3 Steps involved in it

Step: 1

blur-text-image

Get Instant Access with AI-Powered Solutions

See step-by-step solutions with expert insights and AI powered tools for academic success

Step: 2

blur-text-image

Step: 3

blur-text-image

Ace Your Homework with AI

Get the answers you need in no time with our AI-driven, step-by-step assistance

Get Started

Students also viewed these Databases questions

Question

What was the positive value of Max Weber's model of "bureaucracy?"

Answered: 1 week ago