Question

1 Approved Answer

Posted on Sep 25, 2024

(2 PARTS) USING THE FOLLOWING PYTHON VERSION AS AN EXAMPLE, write two t erm frequency programs following the following two STYLE constraints and requirements in

(2 PARTS)

USING THE FOLLOWING PYTHON VERSION AS AN EXAMPLE, write two term frequency programs following the following two STYLE constraints and requirements in JAVASCRIPT with NODE.JS: ** Program must run on command line and take an input file of text called pride-and-prejudice.txt and must output only the TOP 25 most frequent words with their counts and MUST be in order of most frequent at the top and MUST output to a new text file called output.txt NOT the command line. It must FILTER out the STOP WORDS from the list below and take the stop_words.txt file as input (not a string of words hardcoded). Make sure to have the appropriate filters to avoid including output under 2 characters or anything that will make the output different than what I have included below!

stop_words.txt:

a,able,about,across,after,all,almost,also,am,among,an,and,any,are,as,at,be,because,been,but,by,can,cannot,could,dear,did,do,does,either,else,ever,every,for,from,get,got,had,has,have,he,her,hers,him,his,how,however,i,if,in,into,is,it,its,just,least,let,like,likely,may,me,might,most,must,my,neither,no,nor,not,of,off,often,on,only,or,other,our,own,rather,said,say,says,she,should,since,so,some,than,that,the,their,them,then,there,these,they,this,tis,to,too,twas,us,wants,was,we,were,what,when,where,which,while,who,whom,why,will,with,would,yet,you,your

*****Correct output will look like this if written correctly so ENSURE THE STOP WORDS ARE PROPERLY REMOVED TO PROVIDE THE FOLLOWING OUTPUT BEFORE POSTING SOLUTION OR IT WILL BE DOWNVOTED!*****

output.txt: mr - 786 elizabeth - 635 very - 488 darcy - 418 such - 395 mrs - 343 much - 329 more - 327 bennet - 323 bingley - 306 jane - 295 miss - 283 one - 275 know - 239 before - 229 herself - 227 though - 226 well - 224 never - 220 sister - 218 soon - 216 think - 211 now - 209 time - 203 good - 201

PART 1: STYLE 1 CONSTRAINTS: - Existence of one or more units that execute concurrently - Existence of one or more data spaces where concurrent units store and retrieve data - No direct data exchanges between the concurrent units, other than via the data spaces Possible style names: Dataspaces, Linda

PYTHON CODE FOR STYLE 1: import re, sys, operator, queue, threading

# Two data spaces word_space = queue.Queue() freq_space = queue.Queue()

stopwords = set(open('../stop_words.txt').read().split(','))

# Worker function that consumes words from the word space # and sends partial results to the frequency space def process_words(): word_freqs = {} while True: try: word = word_space.get(timeout=1) except queue.Empty: break if not word in stopwords: if word in word_freqs: word_freqs[word] += 1 else: word_freqs[word] = 1 freq_space.put(word_freqs)

# Let's have this thread populate the word space for word in re.findall('[a-z]{2,}', open(sys.argv[1]).read().lower()): word_space.put(word)

# Let's create the workers and launch them at their jobs workers = [] for i in range(5): workers.append(threading.Thread(target = process_words)) [t.start() for t in workers]

# Let's wait for the workers to finish [t.join() for t in workers]

# Let's merge the partial frequency results by consuming # frequency data from the frequency space word_freqs = {} while not freq_space.empty(): freqs = freq_space.get() for (k, v) in freqs.items(): if k in word_freqs: count = sum(item[k] for item in [freqs, word_freqs]) else: count = freqs[k] word_freqs[k] = count for (w, c) in sorted(word_freqs.items(), key=operator.itemgetter(1), reverse=True)[:25]: print(w, '-', c)

PART 2: STYLE 2 CONSTRAINTS: Very similar to style 1, but with an additional twist - Input data is divided in chunks, similar to what an inverse multiplexer does to input signals - A map function applies a given worker function to each chunk of data, potentially in parallel - The results of the many worker functions are reshuffled in a way that allows for the reduce step to be also parallelized - The reshuffled chunks of data are given as input to a second map function that takes a reducible function as input Possible style names:Map-reduce, Hadoop style, Double inverse multiplexer

PYTHON CODE FOR STYLE 2: import sys, re, operator, string from functools import reduce # # Functions for map reduce # def partition(data_str, nlines): """ Partitions the input data_str (a big string) into chunks of nlines. """ lines = data_str.split(' ') for i in range(0, len(lines), nlines): yield ' '.join(lines[i:i+nlines])

def split_words(data_str): """ Takes a string, returns a list of pairs (word, 1), one for each word in the input, so [(w1, 1), (w2, 1), ..., (wn, 1)] """ def _scan(str_data): pattern = re.compile('[\W_]+') return pattern.sub(' ', str_data).lower().split()

def _remove_stop_words(word_list): with open('../stop_words.txt') as f: stop_words = f.read().split(',') stop_words.extend(list(string.ascii_lowercase)) return [w for w in word_list if not w in stop_words]

# The actual work of the mapper result = [] words = _remove_stop_words(_scan(data_str)) for w in words: result.append((w, 1)) return result

def regroup(pairs_list): """ Takes a list of lists of pairs of the form [[(w1, 1), (w2, 1), ..., (wn, 1)], [(w1, 1), (w2, 1), ..., (wn, 1)], ...] and returns a dictionary mapping each unique word to the corresponding list of pairs, so { w1 : [(w1, 1), (w1, 1)...], w2 : [(w2, 1), (w2, 1)...], ...} """ mapping = {} for pairs in pairs_list: for p in pairs: if p[0] in mapping: mapping[p[0]].append(p) else: mapping[p[0]] = [p] return mapping

def count_words(mapping): """ Takes a mapping of the form (word, [(word, 1), (word, 1)...)]) and returns a pair (word, frequency), where frequency is the sum of all the reported occurrences """ def add(x, y): return x+y

return (mapping[0], reduce(add, (pair[1] for pair in mapping[1])))

# # Auxiliary functions # def read_file(path_to_file): with open(path_to_file) as f: data = f.read() return data

def sort(word_freq): return sorted(word_freq, key=operator.itemgetter(1), reverse=True)

# # The main function # splits = map(split_words, partition(read_file(sys.argv[1]), 200)) splits_per_word = regroup(splits) word_freqs = sort(map(count_words, splits_per_word.items()))

for (w, c) in word_freqs[0:25]: print(w, '-', c)

**ENSURE BOTH SOLUTIONS ARE WORKING AND FOLLOW THE CORRESPONDING STYLES BEFORE YOU SUBMIT THEM OR I WILL HAVE TO DOWNVOTE! YOU CAN TEST THEM LOCALLY FIRST BY LOOKING UP THE PRIDE-AND-PREJUDICE.TXT FILE AND USING IT SINCE I CAN'T PASTE IT HERE, BUT IT IS WIDELY AVAILABLE.