Question
This is a programming exercise. You will create functions that process a corpus of text files to collect statistics, build a simple index as postings,
This is a programming exercise. You will create functions that process a corpus of text files to collect statistics, build a simple index as postings, and support queries.
For the main analysis you will: i) compute for each unigram and bigram the main IR statistics: collection frequency (cf), document frequency (df), and the (unnormalized) inverse document frequency as defined by formula 6.7 in the text: log (N/df), where N is the number of documents. Afterwards all the data should be output to a single file with each statistic as labelled pair, e.g. (cf cvalue) (df dvalue) (idf ivalue). You will also create a separate file that includes the postings, which gives for each converted token a list of all the document ids, where the term occurs. Assign IDS in order that they appear in the input file. You will also need to provide ib) a function to map ids back to strings, and functions ii) to retrieve documents, iii) to retrieve terms, and iv) to test your code on a variety of inputs.
When processing the corpus and the queries, you should convert all tokens to lower case and remove punctuation, before counting. Leave any hyphens in.
FUNCTION 1 (processCorpus):
The main function "processCorpus" should take as input a file containing a list of the names of files in the local directory, or the name of a directory. For each test file, "filename", create the following 4 files, with one converted token or bigram per line followed by the statistics or document names or ids as specified above.
HW1.test1.unigram.stats
HW1.test1.bigram.stats
which, for a term "apple" occurring 850 times over 3 documents when N= 1000 contains:
apple (cf 850) (df 3) (idf 2.52)
and HW1.test1.bigrams.stats is similar except that you count bigrams
HW1.test1unigram.postings
HW1.test1bigram.postings
which, if "apple" occurred in documents 18, 50 and 700, HW1.test1.unigram.postings contains:
apple 18 50 700
FUNCTION 1b getFname(id) should return the string corresponding to the name of the file for ID
Python hints: you can get filenames from a directory with "glob". Also, NLTK has a function to get bigrams and functions for creating frequency distributions - see the NLTK book and discussion online. bigrm = list(nltk.bigrams(text.split()))
FUNCTION 2:
query(item, maxDocs) is a function taking a string (unigram or bigram) and, based on the current corpus processed, should return the up to maxDocs documents, by document names or IDS, that contain the item (ie term or bigram) ordered by the tf*idf weight.
GRADUATE STUDENTS (Or for UNDERGRADUATES as Extra Credit):
FUNCTION 3:
Create functions: topUnigrams(doc, num) and topBigrams(doc,num) that return the top number of unigrams in the given document, ordered by decreasing tf-idf value as computed above. Your function should work with either the name or the ID.
FUNCTION 4 (EVERYONE):
TESTING:
You should write a function testAll() to run your code for several different input scenarios, saving the output in different files, sufficient to demonstrate that your code works properly. You should also create a brief screencast video (<3 minutes) of your running code and include either a mp4, avi file or link to a cloud location such as youtube. (
DELIVERABLES: In a single ZIP file (only), turn in your code, instructions for running it (and any known limitations), and the saved input and output files. As a separate file you can provide your video (or a link to it) .
Sample input files
p1.txt:
Background: While biomedical text mining is emerging as an important
research area, practical
results have proven difficult to achieve. We believe that an important
first step towards more
accurate text-mining lies in the ability to identify and characterize text
that satisfies various types
of information needs. We report here the results of our inquiry into
properties of scientific text
that have sufficient generality to transcend the confines of a narrow
subject area, while supporting
practical mining of text for factual information. Our ultimate goal is to
annotate a significant corpus
of biomedical text and train machine learning methods to automatically
categorize such text along
certain dimensions that we have defined.
p2.txt:
Results: We have identified five qualitative dimensions that we believe
characterize a broad range
of scientific sentences, and are therefore useful for supporting a general
approach to text-mining:
focus, polarity, certainty, evidence, and directionality. We define these
dimensions and describe the
guidelines we have developed for annotating text with regard to them.
p3.txt
To examine the effectiveness of the guidelines, twelve annotators
independently annotated the
same set of 101 sentences that were randomly selected from current
biomedical periodicals.
Analysis of these annotations shows 7080% inter-annotator agreement,
suggesting that our
guidelines indeed present a well-defined, executable and reproducible
task.
p4.txt:
Conclusion: We present our guidelines defining a text annotation task,
along with annotation
results from multiple independently produced annotations, demonstrating
the feasibility of the task.
The annotation of a very large corpus of documents along these guidelines
is currently ongoing.
These annotations form the basis for the categorization of text along
multiple dimensions, to
support viable text mining for experimental results, methodology
statements, and other forms of
information. We are currently developing machine learning methods, to be
trained and tested on
the annotated corpus, that would allow for the automatic categorization of
biomedical text along
the general dimensions that we have presented. The guidelines in full
detail, along with annotated
examples, are publicly available.
p5.txt:
Text is the predominant medium for
information exchange among experts.
The volume of biomedical literature is
increasing at such a rate that it is difficult
to efficiently locate, retrieve and manage
relevant information without the use of
text-mining applications. In order
to share the vast amounts of biomedical
knowledge effectively, textual evidence
needs to be linked to ontologies as the
main repositories of formally represented
knowledge. Ontologies are conceptual
models that aim to support consistent and
unambiguous knowledge sharing and that
provide a framework for knowledge
integration. An ontology links concept
labels to their interpretations, ie
specifications of their meanings including
concept definitions and relations to other
concepts. Apart from relations such as isa
and part-of, generally present in almost
any domain, ontologies also model
domain-specific relations, eg has-location,
clinically-associated-with and hasmanifestation
are relations specific for the
biomedical domain. Therefore, ontologies
reflect the structure of the domain and
constrain the potential interpretations of
terms.
Step by Step Solution
There are 3 Steps involved in it
Step: 1
Get Instant Access to Expert-Tailored Solutions
See step-by-step solutions with expert insights and AI powered tools for academic success
Step: 2
Step: 3
Ace Your Homework with AI
Get the answers you need in no time with our AI-driven, step-by-step assistance
Get Started