Answered step by step
Verified Expert Solution
Question
1 Approved Answer
The learning goals of this assignment are to: Understand how to compute language model probabilities using maximum likelihood estimation. Implement back - off. Have fun
The learning goals of this assignment are to:
Understand how to compute language model probabilities using maximum likelihood estimation.
Implement backoff.
Have fun using a language model to probabilistically generate texts.
Compare wordlevel langauage models and characterlevel language models.
import random
from collections import
import numpy as np
We'll start by loading the data. The WikiText language modeling dataset is a collection of tokens extracted from the set of verified Good and Featured articles on Wikipedia.
data test: 'train': 'valid':
for datasplit in data:
fname "wiki.tokens".formatdatasplit
with openfnamer as fwiki:
datadatasplit fwiki.readlowersplit
vocab listsetdatatrain
Now have a look at the data by running this cell.
printtrain : s datatrain:
printdev : s datavalid:
printtest : s datatest:
printfirst words in vocab: s vocab:
Q: Train Ngram language model pts
Complete the following trainngramlm function based on the following inputoutput specifications. If you've done it right, you should pass the tests in the cell below.
Input:
data: the data object created in the cell above that holds the tokenized Wikitext data
order: the order of the model ie the n in ngram" model If order we compute
Output:
lm: A dictionary where the key is the history and the value is a probability distribution over the next word computed using the maximum likelihood estimate from the training data. Importantly, this dictionary should include backoff probabilities as well; eg for order we want to store
as well as
and
Each key should be a single string where the words that form the history have been concatenated using spaces. Given a key, its corresponding value should be a dictionary where each word type in the vocabulary is associated with its probability of appearing after the key. For example, the entry for the history w w should look like:
lmw ww: w : ew : ew:
In this example, we also want to store lmw and lm which contain the bigram and unigram distributions respectively.
Hint: You might find the defaultdict and Counter classes in the collections module to be helpful.
def trainngramlmdata order:
Train ngram language model
# pad order special tokens to the left
# for the first token in the text
order
data order data #
lm defaultdictCounter
for i in rangelendata order:
IMPLEMENT ME
pass
def testngramlm:
printchecking empty history
lm trainngramlmdatatrain order
assert in lm "empty history should be in the language model!"
printchecking probability distributions
lm trainngramlmdatatrain order
sample sumlmkvalues for k in random.samplelistlm
assert alla and a for a in samplelmhistoryword should sum to
printchecking lengths of histories
lm trainngramlmdatatrain order
assert lensetlenksplit for k in listlmlm object should store histories of all sizes!"
printchecking word distribution values
assert lmthe and lmthe and
lmthefirst and lmthefirst and
lmthe first'time and lmthe first'time
"values do not match!"
printCongratulations you passed the ngram check!"
testngramlm
Step by Step Solution
There are 3 Steps involved in it
Step: 1
Get Instant Access to Expert-Tailored Solutions
See step-by-step solutions with expert insights and AI powered tools for academic success
Step: 2
Step: 3
Ace Your Homework with AI
Get the answers you need in no time with our AI-driven, step-by-step assistance
Get Started