Answered step by step
Verified Expert Solution
Link Copied!

Question

1 Approved Answer

This is a programming exercise. You will create functions that process a corpus of text files to collect statistics, build a simple index as postings,

This is a programming exercise. You will create functions that process a corpus of text files to collect statistics, build a simple index as postings, and support queries.

For the main analysis you will: i) compute for each unigram and bigram the main IR statistics: collection frequency (cf), document frequency (df), and the (unnormalized) inverse document frequency as defined by formula 6.7 in the text: log (N/df), where N is the number of documents. Afterwards all the data should be output to a single file with each statistic as labelled pair, e.g. (cf cvalue) (df dvalue) (idf ivalue). You will also create a separate file that includes the postings, which gives for each converted token a list of all the document ids, where the term occurs. Assign IDS in order that they appear in the input file. You will also need to provide ib) a function to map ids back to strings, and functions ii) to retrieve documents, iii) to retrieve terms, and iv) to test your code on a variety of inputs.

When processing the corpus and the queries, you should convert all tokens to lower case and remove punctuation, before counting. Leave any hyphens in.

FUNCTION 1 (processCorpus):

The main function "processCorpus" should take as input a file containing a list of the names of files in the local directory, or the name of a directory. For each test file, "filename", create the following 4 files, with one converted token or bigram per line followed by the statistics or document names or ids as specified above.

HW1.test1.unigram.stats

HW1.test1.bigram.stats

which, for a term "apple" occurring 850 times over 3 documents when N= 1000 contains:

apple (cf 850) (df 3) (idf 2.52)

and HW1.test1.bigrams.stats is similar except that you count bigrams

HW1.test1unigram.postings

HW1.test1bigram.postings

which, if "apple" occurred in documents 18, 50 and 700, HW1.test1.unigram.postings contains:

apple 18 50 700

FUNCTION 1b getFname(id) should return the string corresponding to the name of the file for ID

Python hints: you can get filenames from a directory with "glob". Also, NLTK has a function to get bigrams and functions for creating frequency distributions - see the NLTK book and discussion online. bigrm = list(nltk.bigrams(text.split()))

 

FUNCTION 2:

query(item, maxDocs) is a function taking a string (unigram or bigram) and, based on the current corpus processed, should return the up to maxDocs documents, by document names or IDS, that contain the item (ie term or bigram) ordered by the tf*idf weight.

GRADUATE STUDENTS (Or for UNDERGRADUATES as Extra Credit):

FUNCTION 3:

Create functions: topUnigrams(doc, num) and topBigrams(doc,num) that return the top number of unigrams in the given document, ordered by decreasing tf-idf value as computed above. Your function should work with either the name or the ID.

FUNCTION 4 (EVERYONE):

TESTING:

You should write a function testAll() to run your code for several different input scenarios, saving the output in different files, sufficient to demonstrate that your code works properly. You should also create a brief screencast video (<3 minutes) of your running code and include either a mp4, avi file or link to a cloud location such as youtube. (

DELIVERABLES: In a single ZIP file (only), turn in your code, instructions for running it (and any known limitations), and the saved input and output files. As a separate file you can provide your video (or a link to it) .

Sample input files

p1.txt:

Background: While biomedical text mining is emerging as an important

research area, practical

results have proven difficult to achieve. We believe that an important

first step towards more

accurate text-mining lies in the ability to identify and characterize text

that satisfies various types

of information needs. We report here the results of our inquiry into

properties of scientific text

that have sufficient generality to transcend the confines of a narrow

subject area, while supporting

practical mining of text for factual information. Our ultimate goal is to

annotate a significant corpus

of biomedical text and train machine learning methods to automatically

categorize such text along

certain dimensions that we have defined.

p2.txt:

Results: We have identified five qualitative dimensions that we believe

characterize a broad range

of scientific sentences, and are therefore useful for supporting a general

approach to text-mining:

focus, polarity, certainty, evidence, and directionality. We define these

dimensions and describe the

guidelines we have developed for annotating text with regard to them.

p3.txt

To examine the effectiveness of the guidelines, twelve annotators

independently annotated the

same set of 101 sentences that were randomly selected from current

biomedical periodicals.

Analysis of these annotations shows 7080% inter-annotator agreement,

suggesting that our

guidelines indeed present a well-defined, executable and reproducible

task.

p4.txt:

Conclusion: We present our guidelines defining a text annotation task,

along with annotation

results from multiple independently produced annotations, demonstrating

the feasibility of the task.

The annotation of a very large corpus of documents along these guidelines

is currently ongoing.

These annotations form the basis for the categorization of text along

multiple dimensions, to

support viable text mining for experimental results, methodology

statements, and other forms of

information. We are currently developing machine learning methods, to be

trained and tested on

the annotated corpus, that would allow for the automatic categorization of

biomedical text along

the general dimensions that we have presented. The guidelines in full

detail, along with annotated

examples, are publicly available.

p5.txt:

Text is the predominant medium for

information exchange among experts.

The volume of biomedical literature is

increasing at such a rate that it is difficult

to efficiently locate, retrieve and manage

relevant information without the use of

text-mining applications. In order

to share the vast amounts of biomedical

knowledge effectively, textual evidence

needs to be linked to ontologies as the

main repositories of formally represented

knowledge. Ontologies are conceptual

models that aim to support consistent and

unambiguous knowledge sharing and that

provide a framework for knowledge

integration. An ontology links concept

labels to their interpretations, ie

specifications of their meanings including

concept definitions and relations to other

concepts. Apart from relations such as isa

and part-of, generally present in almost

any domain, ontologies also model

domain-specific relations, eg has-location,

clinically-associated-with and hasmanifestation

are relations specific for the

biomedical domain. Therefore, ontologies

reflect the structure of the domain and

constrain the potential interpretations of

terms.

Step by Step Solution

There are 3 Steps involved in it

Step: 1

blur-text-image

Get Instant Access to Expert-Tailored Solutions

See step-by-step solutions with expert insights and AI powered tools for academic success

Step: 2

blur-text-image

Step: 3

blur-text-image

Ace Your Homework with AI

Get the answers you need in no time with our AI-driven, step-by-step assistance

Get Started

Recommended Textbook for

Students also viewed these Databases questions

Question

=+ What topics are contained in the contracts?

Answered: 1 week ago

Question

Determine miller indices of plane X z 2/3 90% a/3

Answered: 1 week ago