Answered step by step
Verified Expert Solution
Question
1 Approved Answer
import nltk from nltk . util import ngrams from collections import Counter ! pip install kaggle ! mkdir ~ / . kaggle ! echo '
import nltk
from nltkutil import ngrams
from collections import Counter
pip install kaggle
mkdir ~kaggle
echo username:"key":"adcefdaeaafe ~kagglekagglejson
chmod ~kagglekagglejson
kaggle datasets download d CornellUniversitymoviedialogcorpus
unzip q moviedialogcorpus.zip d moviedialogcorpus
# Replace with the path to your downloaded dataset
with openmoviedialogcorpusmovielines.tsvr as f:
data freadlines
# Replace with the path to your downloaded dataset
with openmoviedialogcorpusmovielines.tsvr as f:
data freadlines
nltkdownloadpunkt
# Tokenize the dataset
tokenizeddata nltkwordtokenizesentencelower for sentence in data
# Flatten the list of tokenized sentences
flattokenizeddata word for sentence in tokenizeddata for word in sentence
# Create Bigrams
bigrams listngramsflattokenizeddata,
# Count occurrences of each bigram
bigramcounts Counterbigrams
# Test Sentence
testsentence "how could
testtokens nltkwordtokenizetestsentence.lower
# Find the most probable next word based on bigram occurrences
nextwordcandidates word for word word count in bigramcounts.items if word testtokens
# Rank the candidates based on occurrences
rankedcandidates sortednextwordcandidates, keylambda x: bigramcountstesttokens x reverseTrue:
# Print the top recommended words
printTop recommended words to complete the sentence:"
for word in rankedcandidates:
printfword
# Import necessary libraries
import nltk
from nltkcorpus import stopwords
from nltkstem import PorterStemmer, WordNetLemmatizer
# Download NLTK resources
nltkdownloadstopwords
nltkdownloadpunkt
nltkdownloadwordnet
# Text Preprocessing
def preprocesstexttext:
# Tokenization
tokens nltkwordtokenizetext
# Lowercasing
tokens tokenlower for token in tokens
# Stop Words Removal
stopwords setstopwordswordsenglish
tokens token for token in tokens if token not in stopwords
# Stemming
stemmer PorterStemmer
tokens stemmerstemtoken for token in tokens
# Lemmatization
lemmatizer WordNetLemmatizer
tokens lemmatizerlemmatizetoken for token in tokens
return tokens
# Apply preprocessing to the dataset
preprocesseddata preprocesstextsentence for sentence in corpus
from sklearn.featureextraction.text import TfidfVectorizer
# Flatten the preprocessed data into sentences
flatteneddata jointokens for tokens in preprocesseddata
# TFIDF Vectorization
vectorizer TfidfVectorizer
tfidfmatrix vectorizer.fittransformflatteneddata
from sklearn.metrics.pairwise import cosinesimilarity
from sklearn.decomposition import PCA
import matplotlib.pyplot as plt
# Calculate cosine similarity
cosinesimmatrix cosinesimilaritytfidfmatrix, tfidfmatrix
Is it right and change accordingly
Problem Statement:
The goal of Part I of the task is to use raw textual data in language models for recommendation based application.
The goal of Part II of task is to implement comprehensive preprocessing steps for a given dataset, enhancing the quality and relevance of the textual information. The preprocessed text is then transformed into a featurerich representation using a chosen vectorization method for further use in the application to perform similarity analysis.
Part I
Sentence completion using Ngram:
Recommend the top words to complete the given sentence using Ngram language model. The goal is to demonstrate the relevance of recommended words based on the occurrence of Bigram within the corpus. Use all the instances in the dataset as a training corpus.
Test Sentence: "how could
Part II
Perform the below sequential tasks on the given dataset.
i Text Preprocessing:
Tokenization
Lowercasing
Stop Words Removal
Stemming
Lemmatization
ii Feature Extraction:
Use the preprocessed data from previous step and implement the below vectorization methods to extract features.
Word Embedding using TFIDF
iii Similarity Analysis:
Use the vectorized representation from previous step and implement a method to identify and print the names of top two similar documents that exhibit significant similarity. Justify your choice of similarity metric and feature design. Visualize a subset of vector embedding in D semantic space suitable for this use case. HINT: Use PCA for Dimensionality reduction
Step by Step Solution
There are 3 Steps involved in it
Step: 1
Get Instant Access to Expert-Tailored Solutions
See step-by-step solutions with expert insights and AI powered tools for academic success
Step: 2
Step: 3
Ace Your Homework with AI
Get the answers you need in no time with our AI-driven, step-by-step assistance
Get Started