Answered step by step
Verified Expert Solution
Link Copied!

Question

1 Approved Answer

The learning goals of this assignment are to: Understand how to compute language model probabilities using maximum likelihood estimation. Implement back - off. Have fun

The learning goals of this assignment are to:
Understand how to compute language model probabilities using maximum likelihood estimation.
Implement back-off.
Have fun using a language model to probabilistically generate texts.
Compare word-level langauage models and character-level language models.
import random
from collections import *
import numpy as np
We'll start by loading the data. The WikiText language modeling dataset is a collection of tokens extracted from the set of verified Good and Featured articles on Wikipedia.
data ={'test': '', 'train': '', 'valid': ''}
for data_split in data:
fname = "wiki.{}.tokens".format(data_split)
with open(fname,'r') as f_wiki:
data[data_split]= f_wiki.read().lower().split()
vocab = list(set(data['train']))
Now have a look at the data by running this cell.
print('train : %s ...'% data['train'][:10])
print('dev : %s ...'% data['valid'][:10])
print('test : %s ...'% data['test'][:10])
print('first 10 words in vocab: %s'% vocab[:10])
Q1.1: Train N-gram language model (25 pts)
Complete the following train_ngram_lm function based on the following input/output specifications. If you've done it right, you should pass the tests in the cell below.
Input:
data: the data object created in the cell above that holds the tokenized Wikitext data
order: the order of the model (i.e., the "n" in "n-gram" model). If order=3, we compute
.
Output:
lm: A dictionary where the key is the history and the value is a probability distribution over the next word computed using the maximum likelihood estimate from the training data. Importantly, this dictionary should include backoff probabilities as well; e.g., for order=4, we want to store
as well as
and
.
Each key should be a single string where the words that form the history have been concatenated using spaces. Given a key, its corresponding value should be a dictionary where each word type in the vocabulary is associated with its probability of appearing after the key. For example, the entry for the history 'w1 w2' should look like:
lm['w1 w2']={'w0': 0.001,'w1' : 1e-6,'w2' : 1e-6,'w3': 0.003,...}
In this example, we also want to store lm['w2'] and lm[''], which contain the bigram and unigram distributions respectively.
Hint: You might find the defaultdict and Counter classes in the collections module to be helpful.
def train_ngram_lm(data, order=3):
"""
Train n-gram language model
"""
# pad (order-1) special tokens to the left
# for the first token in the text
order -=1
data =['']* order + data #
lm = defaultdict(Counter)
for i in range(len(data)- order):
"""
IMPLEMENT ME!
"""
pass
def test_ngram_lm():
print('checking empty history ...')
lm1= train_ngram_lm(data['train'], order=1)
assert '' in lm1, "empty history should be in the language model!"
print('checking probability distributions ...')
lm2= train_ngram_lm(data['train'], order=2)
sample =[sum(lm2[k].values()) for k in random.sample(list(lm2),10)]
assert all([a >0.999 and a <1.001 for a in sample]),"lm[history][word] should sum to 1!"
print('checking lengths of histories ...')
lm3= train_ngram_lm(data['train'], order=3)
assert len(set([len(k.split()) for k in list(lm3)]))==3,"lm object should store histories of all sizes!"
print('checking word distribution values ...')
assert lm1['']['the']<0.064 and lm1['']['the']>0.062 and \
lm2['the']['first']<0.017 and lm2['the']['first']>0.016 and \
lm3['the first']['time']<0.106 and lm3['the first']['time']>0.105,\
"values do not match!"
print("Congratulations, you passed the ngram check!")
test_ngram_lm()

Step by Step Solution

There are 3 Steps involved in it

Step: 1

blur-text-image

Get Instant Access to Expert-Tailored Solutions

See step-by-step solutions with expert insights and AI powered tools for academic success

Step: 2

blur-text-image

Step: 3

blur-text-image

Ace Your Homework with AI

Get the answers you need in no time with our AI-driven, step-by-step assistance

Get Started

Recommended Textbook for

Database Concepts

Authors: David M. Kroenke, David J. Auer

7th edition

133544621, 133544626, 0-13-354462-1, 978-0133544626

More Books

Students also viewed these Databases questions

Question

find all matrices A (a) A = 13 (b) A + A = 213

Answered: 1 week ago