Answered step by step
Verified Expert Solution
Link Copied!

Question

1 Approved Answer

import nltk from nltk . util import ngrams from collections import Counter ! pip install kaggle ! mkdir ~ / . kaggle ! echo '

import nltk
from nltk.util import ngrams
from collections import Counter
!pip install kaggle
!mkdir ~/.kaggle
!echo '{"username":"","key":"a092dce5f877da31e5aa0f3314e33333"}'> ~/.kaggle/kaggle.json
!chmod 600 ~/.kaggle/kaggle.json
!kaggle datasets download -d Cornell-University/movie-dialog-corpus
!unzip -q movie-dialog-corpus.zip -d movie-dialog-corpus
# Replace with the path to your downloaded dataset
with open("movie-dialog-corpus/movie_lines.tsv","r") as f:
data = f.readlines()
# Replace with the path to your downloaded dataset
with open("movie-dialog-corpus/movie_lines.tsv","r") as f:
data = f.readlines()
nltk.download('punkt')
# Tokenize the dataset
tokenized_data =[nltk.word_tokenize(sentence.lower()) for sentence in data]
# Flatten the list of tokenized sentences
flat_tokenized_data =[word for sentence in tokenized_data for word in sentence]
# Create Bigrams
bigrams = list(ngrams(flat_tokenized_data, 2))
# Count occurrences of each bigram
bigram_counts = Counter(bigrams)
# Test Sentence
test_sentence = "how could ________________."
test_tokens = nltk.word_tokenize(test_sentence.lower())
# Find the most probable next word based on bigram occurrences
next_word_candidates =[word for (word1, word2), count in bigram_counts.items() if word1== test_tokens[-1]]
# Rank the candidates based on occurrences
ranked_candidates = sorted(next_word_candidates, key=lambda x: bigram_counts[(test_tokens[-1], x)], reverse=True)[:3]
# Print the top 3 recommended words
print("Top 3 recommended words to complete the sentence:")
for word in ranked_candidates:
print(f"-{word}")
# Import necessary libraries
import nltk
from nltk.corpus import stopwords
from nltk.stem import PorterStemmer, WordNetLemmatizer
# Download NLTK resources
nltk.download('stopwords')
nltk.download('punkt')
nltk.download('wordnet')
# Text Preprocessing
def preprocess_text(text):
# Tokenization
tokens = nltk.word_tokenize(text)
# Lowercasing
tokens =[token.lower() for token in tokens]
# Stop Words Removal
stop_words = set(stopwords.words('english'))
tokens =[token for token in tokens if token not in stop_words]
# Stemming
stemmer = PorterStemmer()
tokens =[stemmer.stem(token) for token in tokens]
# Lemmatization
lemmatizer = WordNetLemmatizer()
tokens =[lemmatizer.lemmatize(token) for token in tokens]
return tokens
# Apply preprocessing to the dataset
preprocessed_data =[preprocess_text(sentence) for sentence in corpus]
from sklearn.feature_extraction.text import TfidfVectorizer
# Flatten the preprocessed data into sentences
flattened_data =[''.join(tokens) for tokens in preprocessed_data]
# TF-IDF Vectorization
vectorizer = TfidfVectorizer()
tfidf_matrix = vectorizer.fit_transform(flattened_data)
from sklearn.metrics.pairwise import cosine_similarity
from sklearn.decomposition import PCA
import matplotlib.pyplot as plt
# Calculate cosine similarity
cosine_sim_matrix = cosine_similarity(tfidf_matrix, tfidf_matrix)
Is it right and change accordingly
Problem Statement:
The goal of Part I of the task is to use raw textual data in language models for recommendation based application.
The goal of Part II of task is to implement comprehensive preprocessing steps for a given dataset, enhancing the quality and relevance of the textual information. The preprocessed text is then transformed into a feature-rich representation using a chosen vectorization method for further use in the application to perform similarity analysis.
Part I
Sentence completion using N-gram:
Recommend the top 3 words to complete the given sentence using N-gram language model. The goal is to demonstrate the relevance of recommended words based on the occurrence of Bigram within the corpus. Use all the instances in the dataset as a training corpus.
Test Sentence: "how could ________________."
Part II
Perform the below sequential tasks on the given dataset.
i) Text Preprocessing:
Tokenization
Lowercasing
Stop Words Removal
Stemming
Lemmatization
ii) Feature Extraction:
Use the pre-processed data from previous step and implement the below vectorization methods to extract features.
Word Embedding using TF-IDF
iii) Similarity Analysis:
Use the vectorized representation from previous step and implement a method to identify and print the names of top two similar documents that exhibit significant similarity. Justify your choice of similarity metric and feature design. Visualize a subset of vector embedding in 2D semantic space suitable for this use case. HINT: (Use PCA for Dimensionality reduction)

Step by Step Solution

There are 3 Steps involved in it

Step: 1

blur-text-image

Get Instant Access to Expert-Tailored Solutions

See step-by-step solutions with expert insights and AI powered tools for academic success

Step: 2

blur-text-image_2

Step: 3

blur-text-image_3

Ace Your Homework with AI

Get the answers you need in no time with our AI-driven, step-by-step assistance

Get Started

Recommended Textbook for

Professional Android 4 Application Development

Authors: Reto Meier

3rd Edition

1118223853, 9781118223857

More Books

Students also viewed these Programming questions