Answered step by step
Verified Expert Solution
Link Copied!

Question

1 Approved Answer

Question 2: Natural Language Processing [40 points, 5 points each] This question will require you to do some minimum coding or scripting to get the

image text in transcribed

Question 2: Natural Language Processing [40 points, 5 points each] This question will require you to do some minimum coding or scripting to get the answers, but it is not a programming question. Do not turn in any code. Instead, simply put your answers in the same pdf file as the other questions. You may use any programming language or tools. Download the data (zip) file provided with the homework assignment. Unzip it on your computer, you should see 363 text files. These are essays from homework 1. We call the whole collection the corpus. 1. Briefly describe in English, the most convenient way you can think of for a computer program to break the corpus into "computer words." For example, if you use Java you may consider Scanner; if you use unix commandline you may use we. We want you to use the simplest method that you can think of. It is OK if these computer words do not fully agree with our natural language definition of words: For example, you may have a computer word like However, with that comma attached Your method is a crude "natural language tokenizer." When we mention "word" we mean your computer word 2. Under your method, how many computer word tokens (occurrences) are there in the corpus? How 3. Sort the word types by their counts in the corpus, from large to small. List the top 20 word types and . Pick 20 bottom word types (they should al have count 1) and list them. You may want to include many computer word types (distinct words) are there in the corpus? their counts. some strange ones, if any 5. Let the word type with the largest count be rank 1, the word type with the second largest count be rank 2, and so on. If multiple word types have the same count, you may break the rank tie arbitrarily This produces ()., (r,n) where r, is the rank of the ith word type, and c is the corresponding count of that word type in the corpus; n is the number of word types. Plot r on the x-axis and con the y-axis, namely each (ri, e) is a point in that 2D space (you can choose to connect the points or not 6. Plot log(r) on the x-axis and log(c) on the y-axis. You may choose the base. 7. Briefly explain what the shape of the two curves mean 8. Discuss tuo potential major issues with your computer words, if one wants to use them for natural Question 2: Natural Language Processing [40 points, 5 points each] This question will require you to do some minimum coding or scripting to get the answers, but it is not a programming question. Do not turn in any code. Instead, simply put your answers in the same pdf file as the other questions. You may use any programming language or tools. Download the data (zip) file provided with the homework assignment. Unzip it on your computer, you should see 363 text files. These are essays from homework 1. We call the whole collection the corpus. 1. Briefly describe in English, the most convenient way you can think of for a computer program to break the corpus into "computer words." For example, if you use Java you may consider Scanner; if you use unix commandline you may use we. We want you to use the simplest method that you can think of. It is OK if these computer words do not fully agree with our natural language definition of words: For example, you may have a computer word like However, with that comma attached Your method is a crude "natural language tokenizer." When we mention "word" we mean your computer word 2. Under your method, how many computer word tokens (occurrences) are there in the corpus? How 3. Sort the word types by their counts in the corpus, from large to small. List the top 20 word types and . Pick 20 bottom word types (they should al have count 1) and list them. You may want to include many computer word types (distinct words) are there in the corpus? their counts. some strange ones, if any 5. Let the word type with the largest count be rank 1, the word type with the second largest count be rank 2, and so on. If multiple word types have the same count, you may break the rank tie arbitrarily This produces ()., (r,n) where r, is the rank of the ith word type, and c is the corresponding count of that word type in the corpus; n is the number of word types. Plot r on the x-axis and con the y-axis, namely each (ri, e) is a point in that 2D space (you can choose to connect the points or not 6. Plot log(r) on the x-axis and log(c) on the y-axis. You may choose the base. 7. Briefly explain what the shape of the two curves mean 8. Discuss tuo potential major issues with your computer words, if one wants to use them for natural

Step by Step Solution

There are 3 Steps involved in it

Step: 1

blur-text-image

Get Instant Access to Expert-Tailored Solutions

See step-by-step solutions with expert insights and AI powered tools for academic success

Step: 2

blur-text-image

Step: 3

blur-text-image

Ace Your Homework with AI

Get the answers you need in no time with our AI-driven, step-by-step assistance

Get Started

Recommended Textbook for

Advances In Databases And Information Systems 22nd European Conference Adbis 2018 Budapest Hungary September 2 5 2018 Proceedings Lncs 11019

Authors: Andras Benczur ,Bernhard Thalheim ,Tomas Horvath

1st Edition

3319983970, 978-3319983974

More Books

Students also viewed these Databases questions

Question

What is Working Capital ? Explain its types.

Answered: 1 week ago