Answered step by step
Verified Expert Solution
Link Copied!

Question

1 Approved Answer

The trie data structure: To complete the program, you will create a data structure called a trie (pronounced as try) to store both the words

The trie data structure:

To complete the program, you will create a data structure called a trie (pronounced as try) to store both the words that appear on a web page and how many times each word occurred. A trie is similar to a tree: each data structure consists of nodes. The starting point in the trie (or tree) is called the root node, which is typically drawn at the top when drawing a trie (or tree). Each node can have zero or more children nodes. If a node has no children, it is called a leaf; otherwise, it is an interior node.

In a tree, data items are associated with different nodes in the tree, so each node (typically) contains a data item. (See the binary tree example in Section 6.5 of K&R for a particular example of a tree.) In contrast, in a trie, the data associated with a node is stored along the entire path from the root to a node. For example, consider the following trie of word counts:

image text in transcribed

Each link in the trie is labeled with a letter. The letters along the path from the root to a node compose the word associated with that node. For example, the bottom right node is for the word me. The count there indicates that the word me has been seen 7 times. Note that some words have prefixes that are also words (e.g., the prefix a of as is a word, which has been seen twice), but some prefixes will have 0 counts (e.g., m). Each node can have up to 26 children (one for each letter), but children that are not needed are simply not created. For example, if there are no words that start with z, then the root node will not have a z child. If there are no words that start with aq, then the a child of the root node will have no q child.

Overall, this trie holds the following word counts: (a, 2), (as, 4), (at, 2), (i, 3), (me, 7).

Trie implementation instructions and hints:

You are to define and use a self-referential structure for a trie node. Since you are dealing with lower-case words, each node in your trie will have from 0 to 26 children. Your trie node must keep track of

The count for its word

All of its children (pointers and which letter each child is associated with)

The binary tree in Section 6.5 of K&R may prove a useful reference

You MUST understand the trie data structure before you begin coding. Cowboy coding will lead to regrets!

The nodes in the trie should be allocated using malloc, deallocated using free

You will probably want to use recursion in some of the functions. Think of any node in the trie as the root of a sub-trie: the node itself and all their children (and grandchildren etc).

Parsing the web page:

To get the word counts for a web page, you should use the "getText" function (e.g., the function in indexPage.c that uses getText.py). The getText function will take a URL, a buffer, and the buffer size as arguments. It will fill the buffer with the text from the page given by the URL. It will truncate the text, if needed, to avoid overflowing the buffer. Upon return, the buffer is guaranteed to be null-terminated. The getText function returns the number of bytes filled into the buffer. Your program should be able to handle pages up to 300000 characters long, indexing the first 300000 characters of any pages that are longer than that.

Once getText returns a string, part of your job is to parse the string into terms (i.e., words), which you can count using your trie data structure. For this project, we are only dealing with terms that consist of alphabetic characters. Thus, when parsing the string, you should skip all non-alphabetic characters, including whitespace characters (' ', '\t', ' ', ' '), punctuation, and digits. Every contiguous substring of alphabetic characters forms a term. Note that you should convert all alphabetic characters to lower-case, in order to use your trie.

Parsing example:

Suppose getText returns the string

" Feb. 27

all events at a glance for today

Events

calendar ...

© 2010 IPFW | 2101 E. Coliseum Blvd., Fort Wayne, IN 46805 1-866-597-0010

IPFW is an Equal Opportunity/Equal Access University."

You should add the following words into your trie for that page:

feb

all

events

at

a

glance

for

today

events

calendar

copy

ipfw

e

coliseum

blvd

fort

wayne

in

ipfw

is

an

equal

opportunity

equal

access

university

Each addition will increment that term's count by one.

Program requirements:

The program should take a URL as a command-line argument and print out information as it indexes that page. The main function of indexPage.c should be very, very simple. By keeping "main" simple and putting the code in other functions, you will be able to use your code for the next project with few modifications. Specifically, the main function should do three things:

Call a function that indexes the web page and returns a pointer to your trie

Call a function that prints out the counts of all the words in the trie

Call a function to destroy the trie (free memory) when it is no longer needed

You should also create a function to add a word occurrence to the trie.

For compatibility, the functions to index the web page and print out the word counts must use the exact formatting and messages shown below.

When indexing the web page, print out the web address, followed by each term that you encounter, converted to lower-case, in the order they appear in the web page. Each term should appear on its own line, preceded by a tab. Here is a fictitious example:

http://www.cs.ipfw.edu

computer

science

computer

science

department

...

When printing out the counts of all the words in the trie, print out one line for each word with non-zero count, in alphabetical order. Do not print out words with 0 counts. To traverse the nodes in alphabetical order, you should print out the count for the root node of your trie (if non-zero), then recurse upon each child in alphabetical order. To keep track of the word associated with the current node, you can use a character buffer as a stack: every time you go down a level in the trie (i.e., make a recursive call), you add the corresponding character to the end of the buffer (i.e., push on the stack). When you go back up a level in the trie (i.e., return from a recursive call), you remove a character from the end of the buffer (i.e., pop off the stack). You should keep track of the buffer length and print an error message if the buffer is too small.

Here is an example of the output for the word counts for the example trie shown above:

a: 2

as: 4

at: 2

i: 3

me: 7

Use the format string "%s: %d " to get this output. Note that the first character on these lines must be a lower-case letter (i.e., no leading spaces).

------------------------------getText.py file code-------------------------

#from BeautifulSoup import BeautifulSoup from bs4 import BeautifulSoup import sys import re import pprint import string import socket import errno

doc = sys.stdin.read() soup = BeautifulSoup(doc, "html5lib")

strings = soup.findAll(text=True)

try: for s in strings: cleanStr=s.strip() if(len(cleanStr) > 0): print cleanStr.encode("ascii", "replace") #pprint.pprint(cleanStr) # We close these in the "try" block to avoid # broken pipe errors when the program quits sys.stdout.close() sys.stderr.close() sys.stdin.close()

except socket.error, e: # A socket error: that's okay x=7; except IOError, e: if e.errno == errno.EPIPE: x=7; else: print "IOError"

---------------------IndexPage.c file, where all the code needs to be-----------------

/* File: indexPage.c */ /* Author: Britton Wolfe */ /* Date: September 3rd, 2010 */

/* This program indexes a web page, printing out the counts of words on that page */

#include #include #include

/* TODO: structure definitions */

/* NOTE: int return values can be used to indicate errors (typically non-zero) or success (typically zero return value) */

/* TODO: change this return type */ void indexPage(const char* url);

int addWordOccurrence(const char* word, const int wordLength /* TODO: other parameters you need */);

void printTrieContents(/* TODO: any parameters you need */);

int freeTrieMemory(/* TODO: any parameters you need */);

int getText(const char* srcAddr, char* buffer, const int bufSize);

int main(int argc, char** argv){ /* TODO: write the (simple) main function

/* argv[1] will be the URL to index, if argc > 1 */

return 0; }

/* TODO: define the functions corresponding to the above prototypes */

/* TODO: change this return type */ void indexPage(const char* url) {}

int addWordOccurrence(const char* word, const int wordLength /* TODO: other parameters you need */) {}

void printTrieContents(/* TODO: any parameters you need */) {}

int freeTrieMemory(/* TODO: any parameters you need */) {}

/* You should not need to modify this function */ int getText(const char* srcAddr, char* buffer, const int bufSize){ FILE *pipe; int bytesRead;

snprintf(buffer, bufSize, "curl -s \"%s\" | python getText.py", srcAddr);

pipe = popen(buffer, "r"); if(pipe == NULL){ fprintf(stderr, "ERROR: could not open the pipe for command %s ", buffer); return 0; }

bytesRead = fread(buffer, sizeof(char), bufSize-1, pipe); buffer[bytesRead] = '\0';

pclose(pipe);

return bytesRead; }

Thank you in advance!! It needs to be able to compile!!

0 2 3 0 4 0 2 3 0 4

Step by Step Solution

There are 3 Steps involved in it

Step: 1

blur-text-image

Get Instant Access to Expert-Tailored Solutions

See step-by-step solutions with expert insights and AI powered tools for academic success

Step: 2

blur-text-image

Step: 3

blur-text-image

Ace Your Homework with AI

Get the answers you need in no time with our AI-driven, step-by-step assistance

Get Started

Recommended Textbook for

Students also viewed these Databases questions

Question

explain how conventions may conflict with each other;

Answered: 1 week ago

Question

socialist egalitarianism which resulted in wage levelling;

Answered: 1 week ago

Question

soyuznye (all-Union, controlling enterprises directly from Moscow);

Answered: 1 week ago