Answered step by step
Verified Expert Solution
Link Copied!

Question

1 Approved Answer

I need help with the parts that say IMPLEMENT ME and answer in one or two lines. # Run this cell! It sets some things

I need help with the parts that say IMPLEMENT ME and answer in one or two lines.

# Run this cell! It sets some things up for you.

import matplotlib.pyplot as plt

from __future__ import division # this line is important to avoid unexpected behavior from division

import os

import zipfile

import math

import time

import operator

from collections import defaultdict, Counter

%matplotlib inline

plt.rcParams['figure.figsize'] = (5, 4) # set default size of plots

if not os.path.isdir('data'):

os.mkdir('data') # make the data directory

if not os.path.isdir('./checkpoints'):

os.mkdir('./checkpoints') # directory to save checkpoints

f_dataset = drive.CreateFile({'id': '1_v9YxmDmlGRNpUEcbcSXcTIZtCmS620l'})

f_dataset.GetContentFile('large_movie_review_dataset.zip')

# Extract the data from the zipfile and put it into the data directory

with zipfile.ZipFile('large_movie_review_dataset.zip', 'r') as zip_ref:

zip_ref.extractall('data')

print("IMDb data downloaded!")

PATH_TO_DATA = 'data/large_movie_review_dataset' # path to the data directory

POS_LABEL = 'pos'

NEG_LABEL = 'neg'

TRAIN_DIR = os.path.join(PATH_TO_DATA, "train")

TEST_DIR = os.path.join(PATH_TO_DATA, "test")

for label in [POS_LABEL, NEG_LABEL]:

if len(os.listdir(TRAIN_DIR + "/" + label)) == 12500:

print ("Great! You have 12500 {} reviews in {}".format(label, TRAIN_DIR + "/" + label))

else:

print ("Oh no! Something is wrong. Check your code which loads the reviews")

The following cell contains code that will be referred to as the Preprocessing Block from now on. It contains a function that tokenizes the document passed to it, and functions that return counts of word types and tokens.

###### PREPROCESSING BLOCK ######

###### DO NOT MODIFY THIS FUNCTION #####

def tokenize_doc(doc):

"""

Tokenize a document and return its bag-of-words representation.

doc - a string representing a document.

returns a dictionary mapping each word to the number of times it appears in doc.

"""

bow = defaultdict(float)

tokens = doc.split()

lowered_tokens = map(lambda t: t.lower(), tokens)

for token in lowered_tokens:

bow[token] += 1.0

return dict(bow)

###### END FUNCTION #####

def n_word_types(word_counts):

'''

Implement Me!

return a count of all word types in the corpus

using information from word_counts

'''

pass

def n_word_tokens(word_counts):

'''

Implement Me!

return a count of all word tokens in the corpus

using information from word_counts

'''

pass

Question 1.1 (5 points)

Complete the cell below to fill out the word_counts dictionary variable. word_counts keeps track of how many times a word type appears across the corpus. For instance, word_counts["movie"] should store the number 61492 -- the count of how many times the word movie appears in the corpus.

import glob

import codecs

word_counts = Counter() # Counters are often useful for NLP in python

for label in [POS_LABEL, NEG_LABEL]:

for directory in [TRAIN_DIR, TEST_DIR]:

for fn in glob.glob(directory + "/" + label + "/*txt"):

doc = codecs.open(fn, 'r', 'utf8') # Open the file with UTF-8 encoding

#IMPLEMENT ME

pass

Question 1.3 (5 points)

Using the word_counts dictionary you just created, make a new dictionary called sorted_dict where the words are sorted according to their counts, in decending order:

# IMPLEMENT ME

Now print the first 30 values from sorted_dict.

# IMPLEMENT ME

Zipf's Law

Question 1.4 (10 points)

In this section, you will verify a key statistical property of text: Zipf's Law.

Zipf's Law describes the relations between the frequency rank of words and frequency value of words. For a word w, its frequency is inversely proportional to its rank:

countw=K1rankw

K is a constant, specific to the corpus and how words are being defined.

What would this look like if you took the log of both sides of the equation?

  • Write your answer in one or two lines here.

Therefore, if Zipf's Law holds, after sorting the words descending on frequency, word frequency decreases in an approximately linear fashion under a log-log scale.

Now, please make such a log-log plot by plotting the rank versus frequency

Hint: Make use of the sorted dictionary you just created. Use a scatter plot where the x-axis is the log(rank), and y-axis is log(frequency). You should get this information from word_counts; for example, you can take the individual word counts and sort them. dict methods .items() and/or values() may be useful. (Note that it doesn't really matter whether ranks start at 1 or 0 in terms of how the plot comes out.) You can check your results by comparing your plots to ones on Wikipedia; they should look qualitatively similar.

Please remember to label the meaning of the x-axis and y-axis.

import math

import operator

x = []

y = []

X_LABEL = "log(rank)"

Y_LABEL = "log(frequency)"

# implement me! you should fill the x and y arrays. Add your code here

# running this cell should produce your plot below

plt.scatter(x, y)

plt.xlabel(X_LABEL)

plt.ylabel(Y_LABEL)

Step by Step Solution

There are 3 Steps involved in it

Step: 1

blur-text-image

Get Instant Access to Expert-Tailored Solutions

See step-by-step solutions with expert insights and AI powered tools for academic success

Step: 2

blur-text-image

Step: 3

blur-text-image

Ace Your Homework with AI

Get the answers you need in no time with our AI-driven, step-by-step assistance

Get Started

Recommended Textbook for

Progress Monitoring Data Tracking Organizer

Authors: Teacher'S Aid Publications

1st Edition

B0B7QCNRJ1

More Books

Students also viewed these Databases questions

Question

Explain basic guidelines for effective multicultural communication.

Answered: 1 week ago

Question

Identify communication barriers and describe ways to remove them.

Answered: 1 week ago

Question

Explain the communication process.

Answered: 1 week ago