Answered step by step
Verified Expert Solution
Link Copied!

Question

1 Approved Answer

In this part, you will be working with the reviews.csv file providing reviews for the listings, and more specifically, the 'comments' column. Question 3a -

In this part, you will be working with the reviews.csv file providing reviews for the listings, and more specifically, the 'comments' column.

Question 3a - Pointwise Mutual Information

In this question, you implement and apply the pointwise mutual information (PMI) metric, a word association metric introduced in 1992, to the Airbnb reviews. The purpose of PMI is to extract, from free text, pairs of words than tend to co-occur together more often than expected by chance. For example, PMI('new', 'york') would give a higher score than PMI('new', 'car') because the chance of finding 'new' and 'york' together in text is higher than 'new' and 'car', despite 'new' being a more frequent word than 'york'. By extracting word pairs with high PMI score in our reviews, we will be able to understand better how people feel and talk about certain items of interest (e.g., 'windows' or 'location').

The formula for PMI (where x and y are two words) is:

Watch this video to understand how to estimate these probabilities.

Your solution will involve the following steps:

1. (4 marks) Processing the raw reviews, applying the following steps in this specific order: a. Tokenize all reviews. Use nltk's word_tokenize method.

b. Part-of-speech (PoS) tagging: to be able to differentiate nouns from adjectives or verbs. Use nltk's pos_tag method.

c. Lower case: to reduce the size of the vocabulary.

What to implement: A function process_reviews(df) that will take as input the original dataframe and will return it with three additional columns: tokenized, tagged and lower_tagged, which correspond to steps a, b and c described above.

2. (5 marks) Starting from the output of step 1.c (tokenized, PoS-tagged and lower cased reviews), create a vocabulary of 'center' (the x in the PMI equation) and 'context' (the y in the PMI equation) words. Your vocabulary of center words will be the 1,000 most frequent NOUNS (words with a PoS tag starting with 'N'), and the context words will be the 1,000 most frequent words tagged as either VERB or ADJECTIVE (words with any PoS tag starting with either 'J' or 'V').

What to implement: A function get_vocab(df) which takes as input the DataFrame generated in step 1.c, and returns two lists, one for the 1,000 most frequent center words (nouns) and one for the 1,000 most frequent context words (either verbs or adjectives).

3. (8 marks) With these two 1,000-word vocabularies, create a co-occurrence matrix where, for each center word, you keep track of how many of the context words co-occur with it. Consider this short review with only one sentence as an example, where we want to get co-occurrences for verbs and adjectives for the center word restaurant: a. 'A big restaurant served delicious food in big dishes'

>>> {'restaurant': {'big': 2, 'served':1, 'delicious':1}}

What to implement: A function get_coocs(df, center_vocab, context_vocab) which takes as input the DataFrame generated in step 1, and the lists generated in step 2 and returns a dictionary of dictionaries, of the form in the example above. It is up to you how you define context (full review? per sentence? a sliding window of fixed size?), and how to deal with exceptional cases (center words occurring more than once, center and context words being part of your vocabulary because they are frequent both as a noun and as a verb, etc). Use comments in your code to justify your approach.

4. (3 marks) After you have computed co-occurrences from all the reviews, you should convert the co-occurrence dictionary as a pandas DataFrame. The DataFrame should have 1,000 rows and 1,000 columns.

What to implement: A function called cooc_dict2df(cooc_dict), which takes as input the dictionary of dictionaries generated in step 3 and returns a DataFrame where each row corresponds to one center word, and each column corresponds to one context word, and cells are their corresponding co-occurrence value. Some (x,y) pairs will never co-occur, you should have a 0 value for those cases.

5. (5 marks) Then, convert the co-occurrence values to PMI scores.

What to implement: A function cooc2pmi(df) that takes as input the DataFrame generated in step 4, and returns a new DataFrame with the same rows and columns, but with PMI scores instead of raw co-occurrence counts.

6. (5 marks) Finally, implement a method to retrieve context words with highest PMI score for a given center word.

What to implement: A function topk(df, center_word, N=10) that takes as input: (1) the DataFrame generated in step 5, (2) a center word (a string like 'towels'), and (3) an optional named argument called N with default value of 10; and returns a list of N strings, in descending order of their PMI score with the center_word. You do not need to handle cases for which the word center_word is not found in df.

YOU ARE ONLY ALLOWED TO USE BELOW PYTHON LIBRARIES/PACKAGES

image text in transcribed
collections itertools matplotlib nltk numpy pandas random re scipy seaborn statsmodels string time # the point of both is to act as a logger, not needed for actual solutions datetime tadm

Step by Step Solution

There are 3 Steps involved in it

Step: 1

blur-text-image

Get Instant Access to Expert-Tailored Solutions

See step-by-step solutions with expert insights and AI powered tools for academic success

Step: 2

blur-text-image

Step: 3

blur-text-image

Ace Your Homework with AI

Get the answers you need in no time with our AI-driven, step-by-step assistance

Get Started

Students also viewed these Programming questions