Answered step by step
Verified Expert Solution
Link Copied!

Question

1 Approved Answer

The Problem There are lots of ways to try to compare the similarity of documents. One way is to focus on the character n-grams that

The Problem

There are lots of ways to try to compare the similarity of documents. One way is to focus on the character n-grams that occur in two tracts of text.

n-grams and text comparison

The n-grams found in a text consists of every sequence of n characters. We are going to record the bigrams found in every word (space separated element) and use the counts from two documents to find their similarity.

Consider the sample sentence fragment:

"the rat race" In order, the bigrams (n=2) would be: th, he, ra, at, ra, ac, ce.

Overall the counts would be th:1, he:1, ra:2, at:1, ce:1. If we take all the spaces out of the sentence to get theratrace, the tri-grams (n=3) would be:

the her era rat atr tra rac ace

Using this information, we could measure the similarity of two documents based on the bigram profile: the list of bigrams and their counts. One (not great) measure is https://en.wikipedia.org/wiki/Cosine_similarity#Ochiai_coefficient . Essentially this would be:

image text in transcribed

Where n is the number of elements in the two quantities A and B. The measure ranges from 0.0 1.0 . We are going to measure this coefficient for various n-grams of provided documents.

Basic Premise

You are going to build a map where the string represents a particular n-gram and the long represents the frequency of finding that n-gram in the document being processed.

Your Tasks

Complete the Project 7 by writing code for the following functions. Details of type for the functions can be found in functions.h (provided for you, see details below). The

map_to_string : returns a single string of the map, where each element is printed as string:long, comma separated elements. No comma at the end, see test cases

vector_to_string : returns a single string of the argument vector>, where each element is printed as string:long, comma

separated elements. No comma at the end! This vector is convenient for sorting, (can't sort a map) as

you'll see below, see test cases

clean_string : for the provided string argument returns a new string where the only

contents are alphabetic characters in lower case of the argument string

generate_ngrams : for the provided argument string, generates all the ngrams of a given n for

that string. Returns a vector of all the ngrams found (no counting at this point, just a

vector of ngrams, could have repeats).

process_line : does the following

o calls clean_string to clean up the provided string argument o calls generate_ngrams on the string (n is provided as an argument) o records the frequency of the ngrams in the argument map reference.

pair_frequency_greaterthan : takes in two pairs and returns whether the first pair is greaterthan the second pair based on the .second of both pairs.

top_n : return a vector> of the argument map that is sorted in frequency order, highest to lowest. Provide only the top n values, n provided as an argument. If you sort the vector to find this ordering, than pair_frequency_greaterthan is useful. It is a two-layered ordering: first by frequency, second alphabetically. Thus you always print the element with the highest frequency. However, if two or more elements are tied in frequency, then print those tied frequency elements in alphabetical order.

pair_string_lessthan : takes in two pair and returns whether the first pair is lessthan the second pair based on the .first of both pairs.

ochiai : calculates the ochiai value for the two argument maps. If you use set_intersection to find the intersection set, then pair_string_lessthan is useful.

Test Cases

Test cases are provided in the subdirectory test of the project directory (because it is getting confusing), at least one for each function . However! It is time you start writing your own tests. The tests I provide are not complete, and we are free to write extra tests for grading purposes. Do a good job and test your code. Just because you pass the one (probably simple) test does not necessarily mean your code is fine. Think about your own testing!!!!

Assignment Notes

You are given the following files:

main.cpp This file includes the main function where the test cases will be run.

Do not modify this file.

functions.h This file is the header file for your functions.cpp file. Do not

modify this file.

input#.txt These four text files will be used to run the test cases.

correct_output_#.txt These text files will be used to grade your output based on the

corresponding input files. Be sure that your output matches these text files exactly to

get credit for the test cases.

You will write only functions.cpp and compile that file using functions.h and

main.cpp in the same directory (as you have done in the lab). You are only turning in

functions.cpp for the project.

Comparing outputs: You can use redirected output to compare the output of your program to

the correct output. Use redirected output and the diff command in the Unix terminal to accomplish this:

Run your executable on the desired input file. Lets assume youre testing input1.txt, the first test case. Redirect the output using the following line when in your proj06 directory: ./a.out output1.txt

Now your output is in the file output1.txt. Compare output1.txt to correct_output1.txt using the following line: diff output1.txt correct_output1.txt

If the two files match, nothing will be output to the terminal. Otherwise, diff will print the two outputs.

main.cpp:

#include using std::cout; using std::cin; using std::endl; using std::boolalpha; #include using std::setprecision; #include using std::map; #include using std::vector; #include using std::pair; #include using std::string; using std::getline; #include using std::ifstream; #include using std::ostream_iterator; #include "functions.h" int main (){ int case_no; ostream_iterator oss(cout, ", "); cout > case_no; switch(case_no){ case 1 : { cin.ignore(100, ' '); string s; getline(cin, s); cout > n; cin.ignore(100, ' '); getline(cin, s); auto v = generate_ngrams(s, n); copy (v.begin(), v.end(), oss); cout  m{ {"a",3}, {"b",2}, {"c",1} }; cout  m; cin >> n; cin.ignore(100, ' '); getline(cin, s); process_line(m, s, n); cout > s >> l; pair p1(s,l); cin >> s >> l; pair p2(s,l); cout > v{ {"a",1}, {"b",2}, {"c",3} }; cout  m; cin >> ngram >> topn; cin.ignore(100, ' '); getline(cin, s); process_line(m, s, ngram); auto v = top_n(m, topn); cout  m1, m2; cin >> ngram >> fname1 >> fname2; ifstream if1(fname1); ifstream if2(fname2); while (getline(if1, line) ) process_line(m1, line, ngram); while (getline(if2, line) ) process_line(m2, line, ngram); cout  

function.h:

#ifndef DOCUMENT_SIM #define DOCUMENT_SIM #include using std::string; #include using std::map; #include using std::pair; #include using std::vector; string map_to_string(map &m); string vector_to_string(vector> &v); string clean_string(string w); vector generate_ngrams(string w, size_t n); void process_line(map& m, string line, size_t n); bool pair_string_lessthan(const pair &p1, const pair &p2); bool pair_frequency_greaterthan(const pair &p1, const pair &p2); vector> top_n(map &m, int num); double ochiai(map &m1, map &m2); // optional, see @461 in piazza double ochiai2(map &m1, map &m2, int num); #endif 

Finally, save your functions.cpp -- your completion of the functions based on main.cpp and functions.h

n CA n B n (A) n(B)

Step by Step Solution

There are 3 Steps involved in it

Step: 1

blur-text-image

Get Instant Access to Expert-Tailored Solutions

See step-by-step solutions with expert insights and AI powered tools for academic success

Step: 2

blur-text-image

Step: 3

blur-text-image

Students also viewed these Databases questions

Question

Prepare and properly label figures and tables for written reports.

Answered: 1 week ago