Answered step by step
Verified Expert Solution
Link Copied!

Question

1 Approved Answer

Problem 3. NLP problem setup Encyclopedia Britannicas 3rd edition contains approximately ten thousand articles. After being scanned and converted to text using optical character recognition

Problem 3. NLP problem setup

Encyclopedia Britannicas 3rd edition contains approximately ten thousand articles. After being scanned and converted to text using optical character recognition software, you are given a segment of it in a single text file. The file contains 100,000 text lines / 900,000 words / 300 articles, and has been manually marked up for article start and article finish.

For your reference, an excerpt from the raw text and the marked text is given in the files brit3-excerpt.txt and brit3-excerpt-marked.txt correspondingly. Feel free to open these files in your favorite text editor and have a look.

Instructions: For each of the questions in each of the two problems below, give a 1-2 sentence answer. You don't have to write any code for this problem. Please fill out your answers in the cells below (as markdown text). Note that this problem does not have a single best answer. Use your imagination and be creative!

3.1 Imagine that you need to build a system that would split the given text into articles. Describe how you can cast this task as a classification problem:

  1. What are the instances that you will need to classify?
  2. What are the labels for the instances that your classification function will need to assign?
  3. Assuming you use 2/3 of your marked up data for training how many instances will you have in your training set?
  4. Give at leaves 5 examples of boolean features you might wish to include when building such a classifier.

YOUR ANSWER HERE

3.2 Now imagine that you need to build a system that would both split the text into articles and identify article titles. Again, assume that the titles have been marked in your training set. How can you cast this task as a classification problem?

Please specify answers to (1), (2), (3), and (4) above for this new task.

Step by Step Solution

There are 3 Steps involved in it

Step: 1

blur-text-image

Get Instant Access to Expert-Tailored Solutions

See step-by-step solutions with expert insights and AI powered tools for academic success

Step: 2

blur-text-image

Step: 3

blur-text-image

Ace Your Homework with AI

Get the answers you need in no time with our AI-driven, step-by-step assistance

Get Started

Recommended Textbook for

Decisions Based On Data Analytics For Business Excellence

Authors: Bastian Weber

1st Edition

9358681683, 978-9358681680

More Books

Students also viewed these Databases questions

Question

5. Structure your speech to make it easy to listen to

Answered: 1 week ago

Question

1. Describe the goals of informative speaking

Answered: 1 week ago