Question
I need help with the parts that say IMPLEMENT ME and answer in one or two lines. # Run this cell! It sets some things
I need help with the parts that say IMPLEMENT ME and answer in one or two lines.
# Run this cell! It sets some things up for you.
import matplotlib.pyplot as plt
from __future__ import division # this line is important to avoid unexpected behavior from division
import os
import zipfile
import math
import time
import operator
from collections import defaultdict, Counter
%matplotlib inline
plt.rcParams['figure.figsize'] = (5, 4) # set default size of plots
if not os.path.isdir('data'):
os.mkdir('data') # make the data directory
if not os.path.isdir('./checkpoints'):
os.mkdir('./checkpoints') # directory to save checkpoints
f_dataset = drive.CreateFile({'id': '1_v9YxmDmlGRNpUEcbcSXcTIZtCmS620l'})
f_dataset.GetContentFile('large_movie_review_dataset.zip')
# Extract the data from the zipfile and put it into the data directory
with zipfile.ZipFile('large_movie_review_dataset.zip', 'r') as zip_ref:
zip_ref.extractall('data')
print("IMDb data downloaded!")
PATH_TO_DATA = 'data/large_movie_review_dataset' # path to the data directory
POS_LABEL = 'pos'
NEG_LABEL = 'neg'
TRAIN_DIR = os.path.join(PATH_TO_DATA, "train")
TEST_DIR = os.path.join(PATH_TO_DATA, "test")
for label in [POS_LABEL, NEG_LABEL]:
if len(os.listdir(TRAIN_DIR + "/" + label)) == 12500:
print ("Great! You have 12500 {} reviews in {}".format(label, TRAIN_DIR + "/" + label))
else:
print ("Oh no! Something is wrong. Check your code which loads the reviews")
The following cell contains code that will be referred to as the Preprocessing Block from now on. It contains a function that tokenizes the document passed to it, and functions that return counts of word types and tokens.
###### PREPROCESSING BLOCK ######
###### DO NOT MODIFY THIS FUNCTION #####
def tokenize_doc(doc):
"""
Tokenize a document and return its bag-of-words representation.
doc - a string representing a document.
returns a dictionary mapping each word to the number of times it appears in doc.
"""
bow = defaultdict(float)
tokens = doc.split()
lowered_tokens = map(lambda t: t.lower(), tokens)
for token in lowered_tokens:
bow[token] += 1.0
return dict(bow)
###### END FUNCTION #####
def n_word_types(word_counts):
'''
Implement Me!
return a count of all word types in the corpus
using information from word_counts
'''
pass
def n_word_tokens(word_counts):
'''
Implement Me!
return a count of all word tokens in the corpus
using information from word_counts
'''
pass
Question 1.1 (5 points)
Complete the cell below to fill out the word_counts dictionary variable. word_counts keeps track of how many times a word type appears across the corpus. For instance, word_counts["movie"] should store the number 61492 -- the count of how many times the word movie appears in the corpus.
import glob
import codecs
word_counts = Counter() # Counters are often useful for NLP in python
for label in [POS_LABEL, NEG_LABEL]:
for directory in [TRAIN_DIR, TEST_DIR]:
for fn in glob.glob(directory + "/" + label + "/*txt"):
doc = codecs.open(fn, 'r', 'utf8') # Open the file with UTF-8 encoding
#IMPLEMENT ME
pass
Question 1.3 (5 points)
Using the word_counts dictionary you just created, make a new dictionary called sorted_dict where the words are sorted according to their counts, in decending order:
# IMPLEMENT ME
Now print the first 30 values from sorted_dict.
# IMPLEMENT ME
Zipf's Law
Question 1.4 (10 points)
In this section, you will verify a key statistical property of text: Zipf's Law.
Zipf's Law describes the relations between the frequency rank of words and frequency value of words. For a word w, its frequency is inversely proportional to its rank:
countw=K1rankw
K is a constant, specific to the corpus and how words are being defined.
What would this look like if you took the log of both sides of the equation?
- Write your answer in one or two lines here.
Therefore, if Zipf's Law holds, after sorting the words descending on frequency, word frequency decreases in an approximately linear fashion under a log-log scale.
Now, please make such a log-log plot by plotting the rank versus frequency
Hint: Make use of the sorted dictionary you just created. Use a scatter plot where the x-axis is the log(rank), and y-axis is log(frequency). You should get this information from word_counts; for example, you can take the individual word counts and sort them. dict methods .items() and/or values() may be useful. (Note that it doesn't really matter whether ranks start at 1 or 0 in terms of how the plot comes out.) You can check your results by comparing your plots to ones on Wikipedia; they should look qualitatively similar.
Please remember to label the meaning of the x-axis and y-axis.
import math
import operator
x = []
y = []
X_LABEL = "log(rank)"
Y_LABEL = "log(frequency)"
# implement me! you should fill the x and y arrays. Add your code here
# running this cell should produce your plot below
plt.scatter(x, y)
plt.xlabel(X_LABEL)
plt.ylabel(Y_LABEL)
Step by Step Solution
There are 3 Steps involved in it
Step: 1
Get Instant Access to Expert-Tailored Solutions
See step-by-step solutions with expert insights and AI powered tools for academic success
Step: 2
Step: 3
Ace Your Homework with AI
Get the answers you need in no time with our AI-driven, step-by-step assistance
Get Started