Answered step by step

Verified Expert Solution

Link Copied!

Question

1 Approved Answer

Posted on Jul 26, 2024

The learning goals of this assignment are to: Understand how to compute language model probabilities using maximum likelihood estimation. Implement back - off. Have fun

The learning goals of this assignment are to:

Understand how to compute language model probabilities using maximum likelihood estimation.

Implement back

-

off.

Have fun using a language model to probabilistically generate texts.

Compare word

-

level langauage models and character

-

level language models.

import random

from collections import

*

import numpy as np

We'll start by loading the data. The WikiText language modeling dataset is a collection of tokens extracted from the set of verified Good and Featured articles on Wikipedia.

data

= {'

test

'

'',

'train':

'',

'valid':

''}

for data

_

split in data:

fname

=

"wiki.

{} .

tokens".format

(

data

_

split

)

with open

(

fname

,'

')

as f

_

wiki:

data

[

data

_

split

] =

_

wiki.read

() .

lower

() .

split

()

vocab

=

list

(

set

(

data

['

train

']))

Now have a look at the data by running this cell.

('

train :

%

. . .' %

data

['

train

'] [

10])

('

dev :

%

. . .' %

data

['

valid

'] [

10])

('

test :

%

. . .' %

data

['

test

'] [

10])

('

first

10

words in vocab:

%

' %

vocab

[

10])

1.1

: Train N

-

gram language model

(25

pts

)

Complete the following train

_

ngram

_

lm function based on the following input

/

output specifications. If you've done it right, you should pass the tests in the cell below.

Input:

data: the data object created in the cell above that holds the tokenized Wikitext data

order: the order of the model

(

.

.,

the

"

"

"

-

gram" model

) .

If order

= 3,

we compute

.

Output:

lm: A dictionary where the key is the history and the value is a probability distribution over the next word computed using the maximum likelihood estimate from the training data. Importantly, this dictionary should include backoff probabilities as well; e

.

.,

for order

= 4,

we want to store

as well as

and

.

Each key should be a single string where the words that form the history have been concatenated using spaces. Given a key, its corresponding value should be a dictionary where each word type in the vocabulary is associated with its probability of appearing after the key. For example, the entry for the history

'

1

2'

should look like:

['

1

2'] = {'

0'

0.001,'

1'

1

- 6,'

2'

1

- 6,'

3'

0.003, . . .}

In this example, we also want to store lm

['

2']

and lm

[''],

which contain the bigram and unigram distributions respectively.

Hint: You might find the defaultdict and Counter classes in the collections module to be helpful.

def train

_

ngram

_

(

data

,

order

= 3)

" " "

Train n

-

gram language model

" " "

# pad

(

order

- 1)

special tokens to the left

# for the first token in the text

order

- = 1

data

= [''] *

order

+

data #

=

defaultdict

(

Counter

)

for i in range

(

len

(

data

) -

order

)

" " "

IMPLEMENT ME

!

" " "

pass

def test

_

ngram

_

()

('

checking empty history

. . .')

1 =

train

_

ngram

_

(

data

['

train

'],

order

= 1)

assert

''

in lm

1,

"empty history should be in the language model!"

('

checking probability distributions

. . .')

2 =

train

_

ngram

_

(

data

['

train

'],

order

= 2)

sample

= [

sum

(

2 [

] .

values

())

for k in random.sample

(

list

(

2), 10)]

assert all

([

> 0.999

and a

< 1.001

for a in sample

]), "

[

history

] [

word

]

should sum to

1! "

('

checking lengths of histories

. . .')

3 =

train

_

ngram

_

(

data

['

train

'],

order

= 3)

assert len

(

set

([

len

(

.

split

())

for k in list

(

3)])) = = 3, "

lm object should store histories of all sizes!"

('

checking word distribution values

. . .')

assert lm

1 [''] ['

the

'] < 0.064

and lm

1 [''] ['

the

'] > 0.062

and

\

2 ['

the

'] ['

first

'] < 0.017

and lm

2 ['

the

'] ['

first

'] > 0.016

and

\

3 ['

the first'

] ['

time

'] < 0.106

and lm

3 ['

the first'

] ['

time

'] > 0.105, \

"values do not match!"

("

Congratulations

,

you passed the ngram check!"

)

test

_

ngram

_

()

Step by Step Solution

There are 3 Steps involved in it

Step: 1

Get Instant Access to Expert-Tailored Solutions

See step-by-step solutions with expert insights and AI powered tools for academic success

Step: 2

Step: 3

Ace Your Homework with AI

Get the answers you need in no time with our AI-driven, step-by-step assistance

Get Started

Recommended Textbook for

Database Concepts

Authors: David M. Kroenke, David J. Auer

7th edition

★★★★★

To what extent does the emergence of HRM reflect the rise and ideology of neo-liberalism?

Answered: 1 week ago

Previous Question Next Question