Answered step by step
Verified Expert Solution
Link Copied!

Question

1 Approved Answer

Part I Sentence completion using N - gram: ( 3 Marks ) Recommend the top 3 words to complete the given sentence using N -

Part I
Sentence completion using N-gram: (3 Marks)
Recommend the top 3 words to complete the given sentence using N-gram language model. The goal is to demonstrate the relevance of recommended words based on the occurrence of Bigram within the corpus. Use all the instances in the dataset as a training corpus.
Test Sentence: "how could ________________."
#Load necessary packages
import pandas as pd
import nltk
nltk.download('punkt')
from nltk import bigrams
from collections import defaultdict
import random
# Load the movie dialog dataset
df = pd.read_csv(
"/content/movie-dialog-corpus/movie_lines.tsv",
encoding='utf-8-sig',
sep='\t',
on_bad_lines="skip",
header = None,
names =['lineID', 'charID', 'movieID', 'charName', 'text'],
index_col=['lineID']
)
# Create a placeholder for the bigram model
bigram_model = defaultdict(lambda: defaultdict(lambda: 0))
# Count frequency of co-occurrence for bigrams
for sentence in df['text'].dropna():
if isinstance(sentence, str):
tokens = nltk.word_tokenize(sentence)
for w1, w2 in bigrams(tokens, pad_right=True, pad_left=True):
bigram_model[w1][w2]+=1
# Let's transform the counts to probabilities for bigrams
for w1 in bigram_model:
total_count = float(sum(bigram_model[w1].values()))
for w2 in bigram_model[w1]:
bigram_model[w1][w2]/= total_count
# Generate three sentences
for _ in range(3):
# starting word
text =["how", "could"]
sentence_finished = False
while not sentence_finished:
# select a random probability threshold
r = random.random()
accumulator =.0
if isinstance(text[-1], str):
tokens = nltk.word_tokenize(text[-1])
for word in bigram_model[tokens[-1]].keys():
accumulator += bigram_model[tokens[-1]][word]
# select words that are above the probability threshold
if accumulator >= r:
text.append(word)
break
if text[-1]=='.':
sentence_finished = True
generated_sentence =''.join([t for t in text if t])
print(generated_sentence)
Problem Statement:
The goal of Part I of the task is to use raw textual data in language models for recommendation based application.
The goal of Part II of task is to implement comprehensive preprocessing steps for a given dataset, enhancing the quality and relevance of the textual information. The preprocessed text is then transformed into a feature-rich representation using a chosen vectorization method for further use in the application to perform similarity analysis.
Part I
Sentence completion using N-gram: (3 Marks)
Recommend the top 3 words to complete the given sentence using N-gram language model. The goal is to demonstrate the relevance of recommended words based on the occurrence of Bigram within the corpus. Use all the instances in the dataset as a training corpus.
Test Sentence: "how could ________________."
Part II
Perform the below sequential tasks on the given dataset.
i) Text Preprocessing: (2 Marks)
Tokenization
Lowercasing
Stop Words Removal
Stemming
Lemmatization
ii) Feature Extraction: (2 Marks)
Use the pre-processed data from previous step and implement the below vectorization methods to extract features.
Word Embedding using TF-IDF
iii) Similarity Analysis: (3 Marks)
Use the vectorized representation from previous step and implement a method to identify and print the names of top two similar documents that exhibit significant similarity. Justify your choice of similarity metric and feature design. Visualize a subset of vector embedding in 2D semantic space suitable for this use case. HINT: (Use PCA for Dimensionality reduction)
Verify and Need solution

Step by Step Solution

There are 3 Steps involved in it

Step: 1

blur-text-image

Get Instant Access with AI-Powered Solutions

See step-by-step solutions with expert insights and AI powered tools for academic success

Step: 2

blur-text-image

Step: 3

blur-text-image

Ace Your Homework with AI

Get the answers you need in no time with our AI-driven, step-by-step assistance

Get Started

Students also viewed these Databases questions