Answered step by step

Verified Expert Solution

Link Copied!

Question

1 Approved Answer

Posted on Jul 30, 2024

Part I Sentence completion using N - gram: ( 3 Marks ) Recommend the top 3 words to complete the given sentence using N -

Part I

Sentence completion using N

-

gram:

(3

Marks

)

Recommend the top

3

words to complete the given sentence using N

-

gram language model. The goal is to demonstrate the relevance of recommended words based on the occurrence of Bigram within the corpus. Use all the instances in the dataset as a training corpus.

Test Sentence: "how could

________________. "

#Load necessary packages

import pandas as pd

import nltk

nltk

.

download

('

punkt

')

from nltk import bigrams

from collections import defaultdict

import random

# Load the movie dialog dataset

=

.

read

_

csv

(

" /

content

/

movie

-

dialog

-

corpus

/

movie

_

lines.tsv

",

encoding

=

'utf

- 8 -

sig',

sep

=' \

',

_

bad

_

lines

=

"skip",

header

=

None,

names

= ['

lineID

',

'charID', 'movieID', 'charName', 'text'

],

index

_

col

= ['

lineID

']

)

# Create a placeholder for the bigram model

bigram

_

model

=

defaultdict

(

lambda: defaultdict

(

lambda:

0))

# Count frequency of co

-

occurrence for bigrams

for sentence in df

['

text

'] .

dropna

()

if isinstance

(

sentence

,

str

)

tokens

=

nltk

.

word

_

tokenize

(

sentence

)

for w

1,

2

in bigrams

(

tokens

,

pad

_

right

=

True, pad

_

left

=

True

)

bigram

_

model

[

1] [

2] + = 1

# Let's transform the counts to probabilities for bigrams

for w

1

in bigram

_

model:

total

_

count

=

float

(

sum

(

bigram

_

model

[

1] .

values

()))

for w

2

in bigram

_

model

[

1]

bigram

_

model

[

1] [

2] / =

total

_

count

# Generate three sentences

for

_

in range

(3)

# starting word

text

= ["

how

",

"could"

]

sentence

_

finished

=

False

while not sentence

_

finished:

# select a random probability threshold

=

random.random

()

accumulator

= . 0

if isinstance

(

text

[- 1],

str

)

tokens

=

nltk

.

word

_

tokenize

(

text

[- 1])

for word in bigram

_

model

[

tokens

[- 1]] .

keys

()

accumulator

+ =

bigram

_

model

[

tokens

[- 1]] [

word

]

# select words that are above the probability threshold

if accumulator

> =

text.append

(

word

)

break

if text

[- 1] = =' .'

sentence

_

finished

=

True

generated

_

sentence

='' .

join

([

t for t in text if t

])

(

generated

_

sentence

)

Problem Statement:

The goal of Part I of the task is to use raw textual data in language models for recommendation based application.

The goal of Part II of task is to implement comprehensive preprocessing steps for a given dataset, enhancing the quality and relevance of the textual information. The preprocessed text is then transformed into a feature

-

rich representation using a chosen vectorization method for further use in the application to perform similarity analysis.

Part I

Sentence completion using N

-

gram:

(3

Marks

)

Recommend the top

3

words to complete the given sentence using N

-

gram language model. The goal is to demonstrate the relevance of recommended words based on the occurrence of Bigram within the corpus. Use all the instances in the dataset as a training corpus.

Test Sentence: "how could

________________. "

Part II

Perform the below sequential tasks on the given dataset.

)

Text Preprocessing:

(2

Marks

)

Tokenization

Lowercasing

Stop Words Removal

Stemming

Lemmatization

)

Feature Extraction:

(2

Marks

)

Use the pre

-

processed data from previous step and implement the below vectorization methods to extract features.

Word Embedding using TF

-

IDF

iii

)

Similarity Analysis:

(3

Marks

)

Use the vectorized representation from previous step and implement a method to identify and print the names of top two similar documents that exhibit significant similarity. Justify your choice of similarity metric and feature design. Visualize a subset of vector embedding in