Answered step by step
Verified Expert Solution
Link Copied!

Question

1 Approved Answer

Training Data We'll be using the English news Corpus ( 2 0 1 8 year ) as our training data. There are around 1 0

Training Data
We'll be using the English news Corpus (2018 year) as our training data. There are around 10,000 sentences. eng_news_2018_10K-sentences.txtDownload eng_news_2018_10K-sentences.txt
Problem: N-gram language model
You need to build 2 language models, the Unigram model, and the Bigram model using Laplace smoothing. With each model, you will do the following tasks:
Display 10 generated sentences from this model.
Score the probabilities of the provided test sentences and display the average and standard deviance of these sentences.
once for the provided test set
once for the test set that you curate
Part 1: Build an n-gram language model (6 marks)
Pre-processing of the data.
Split the data for training and testing.
You need to develop an n-gram model that could model any order n-gram, which we'll be using specifically to look at unigrams and bigrams. Specifically, you'll write code that builds this language model from the training data and provides functions that can take a sentence in (formatted the same as in the training data) and return the probability assigned to that sentence by your model.
Handling of unknown words and smoothing.
Evaluating the language model.
Part 2: Implement Sentence Generation (4 points)
In this part, youll implement sentence generation for your Language Model. Start by generating the token, then sampling from the n-grams beginning with . Stop generating words when you hit an token.
Notes:
When generating sentences for unigrams, do not count the pseudo-word as part of the unigram probability mass after you've chosen it as the beginning token in a sentence.
All unigram sentences that you generate should start with one and end with one
For n-grams larger than 1, the sentences you generate should start with n1 tokens. They should end with n1 tokens.
Justification of the output obtained for all the above tasks is mandatory

Step by Step Solution

There are 3 Steps involved in it

Step: 1

blur-text-image

Get Instant Access to Expert-Tailored Solutions

See step-by-step solutions with expert insights and AI powered tools for academic success

Step: 2

blur-text-image

Step: 3

blur-text-image

Ace Your Homework with AI

Get the answers you need in no time with our AI-driven, step-by-step assistance

Get Started

Recommended Textbook for

Beginning ASP.NET 4.5 Databases

Authors: Sandeep Chanda, Damien Foggon

3rd Edition

1430243805, 978-1430243809

More Books

Students also viewed these Databases questions

Question

Identify ways that country culture influences global business.

Answered: 1 week ago

Question

Define human resource ethics.

Answered: 1 week ago

Question

Describe the human resource management profession.

Answered: 1 week ago