Answered step by step

Verified Expert Solution

Link Copied!

Question

1 Approved Answer

Posted on Sep 29, 2024

import nltk from nltk . util import ngrams from collections import Counter ! pip install kaggle ! mkdir ~ / . kaggle ! echo '

import nltk

from nltk

.

util import ngrams

from collections import Counter

!

pip install kaggle

!

mkdir ~

/ .

kaggle

!

echo

' {"

username

"

" ",

"key":"a

092

dce

5

877

31

5

0

3314

33333 "}' >

/ .

kaggle

/

kaggle

.

json

!

chmod

600

/ .

kaggle

/

kaggle

.

json

!

kaggle datasets download

-

d Cornell

-

University

/

movie

-

dialog

-

corpus

!

unzip

-

q movie

-

dialog

-

corpus.zip

-

d movie

-

dialog

-

corpus

# Replace with the path to your downloaded dataset

with open

("

movie

-

dialog

-

corpus

/

movie

_

lines.tsv

", "

")

as f:

data

=

.

readlines

()

# Replace with the path to your downloaded dataset

with open

("

movie

-

dialog

-

corpus

/

movie

_

lines.tsv

", "

")

as f:

data

=

.

readlines

()

nltk

.

download

('

punkt

')

# Tokenize the dataset

tokenized

_

data

= [

nltk

.

word

_

tokenize

(

sentence

.

lower

())

for sentence in data

]

# Flatten the list of tokenized sentences

flat

_

tokenized

_

data

= [

word for sentence in tokenized

_

data for word in sentence

]

# Create Bigrams

bigrams

=

list

(

ngrams

(

flat

_

tokenized

_

data,

2))

# Count occurrences of each bigram

bigram

_

counts

=

Counter

(

bigrams

)

# Test Sentence

test

_

sentence

=

"how could

________________. "

test

_

tokens

=

nltk

.

word

_

tokenize

(

test

_

sentence.lower

())

# Find the most probable next word based on bigram occurrences

_

word

_

candidates

= [

word for

(

word

1,

word

2),

count in bigram

_

counts.items

()

if word

1 = =

test

_

tokens

[- 1]]

# Rank the candidates based on occurrences

ranked

_

candidates

=

sorted

(

_

word

_

candidates, key

=

lambda x: bigram

_

counts

[(

test

_

tokens

[- 1],

)],

reverse

=

True

) [

3]

# Print the top

3

recommended words

("

Top

3

recommended words to complete the sentence:"

)

for word in ranked

_

candidates:

(

" - {

word

} ")

# Import necessary libraries

import nltk

from nltk

.

corpus import stopwords

from nltk

.

stem import PorterStemmer, WordNetLemmatizer

# Download NLTK resources

nltk

.

download

('

stopwords

')

nltk

.

download

('

punkt

')

nltk

.

download

('

wordnet

')

# Text Preprocessing

def preprocess

_

text

(

text

)

# Tokenization

tokens

=

nltk

.

word

_

tokenize

(

text

)

# Lowercasing

tokens

= [

token

.

lower

()

for token in tokens

]

# Stop Words Removal

stop

_

words

=

set

(

stopwords

.

words

('

english

'))

tokens

= [

token for token in tokens if token not in stop

_

words

]

# Stemming

stemmer

=

PorterStemmer

()

tokens

= [

stemmer

.

stem

(

token

)

for token in tokens

]

# Lemmatization

lemmatizer

=

WordNetLemmatizer

()

tokens

= [

lemmatizer

.

lemmatize

(

token

)

for token in tokens

]

return tokens

# Apply preprocessing to the dataset

preprocessed

_

data

= [

preprocess

_

text

(

sentence

)

for sentence in corpus

]

from sklearn.feature

_

extraction.text import TfidfVectorizer

# Flatten the preprocessed data into sentences

flattened

_

data

= ['' .

join

(

tokens

)

for tokens in preprocessed

_

data

]

# TF

-

IDF Vectorization

vectorizer

=

TfidfVectorizer

()

tfidf

_

matrix

=

vectorizer.fit

_

transform

(

flattened

_

data

)

from sklearn.metrics.pairwise import cosine

_

similarity

from sklearn.decomposition import PCA

import matplotlib.pyplot as plt

# Calculate cosine similarity

cosine

_

sim

_

matrix

=

cosine

_

similarity

(

tfidf

_

matrix, tfidf

_

matrix

)

Is it right and change accordingly

Problem Statement:

The goal of Part I of the task is to use raw textual data in language models for recommendation based application.

The goal of Part II of task is to implement comprehensive preprocessing steps for a given dataset, enhancing the quality and relevance of the textual information. The preprocessed text is then transformed into a feature

-

rich representation using a chosen vectorization method for further use in the application to perform similarity analysis.

Part I

Sentence completion using N

-

gram:

Recommend the top

3

words to complete the given sentence using N

-

gram language model. The goal is to demonstrate the relevance of recommended words based on the occurrence of Bigram within the corpus. Use all the instances in the dataset as a training corpus.

Test Sentence: "how could

________________. "

Part II

Perform the below sequential tasks on the given dataset.

)

Text Preprocessing:

Tokenization

Lowercasing

Stop Words Removal

Stemming

Lemmatization

)

Feature Extraction:

Use the pre

-

processed data from previous step and implement the below vectorization methods to extract features.

Word Embedding using TF

-

IDF

iii

)

Similarity Analysis:

Use the vectorized representation from previous step and implement a method to identify and print the names of top two similar documents that exhibit significant similarity. Justify your choice of similarity metric and feature design. Visualize a subset of vector embedding in