Question
We are finally ready to put all the pieces together! We can now measure documents, train our classifier, and score documents per language. Write a
We are finally ready to put all the pieces together! We can now measure documents, train our classifier, and score documents per language. Write a function classify_doc(document,lang_counts=default_lang_counts) which takes a string document and a dictionary of normalised lang_counts, and returns a language based on the score of each language.
As before, we have provided a hidden implementation of score_document(document, lang_counts) in a hidden module (already imported) which takes a document and returns a dictionary of scores per language, as in the previous question. We have also provided a number of documents to play with.
Your function should return the language with the highest score. In the event of a tie it should return 'English' since the most common document in the training set is written in English, suggesting that if the document comes from the same source (Wikipedia), it is probably written in English. Obviously not a perfect assumption, but better than nothing given no information.
But how do we determine a tie? If the two top-ranking scores lie within 1e-10 of one another, then we shall say it's a tie (why do we do this, rather than testing equality directly?).
Your function should behave as follows:
>>> s = open('en_163083.txt').read()
>>> classify_doc(s)
'English'
>>> classify_doc('asdfhlj')
'Icelandic'
>>> s = open('pl_188313.txt').read()
>>> classify_doc(s)
'Polish'
>>> classify_doc('Hello Bob')
'Italian'
How to code this using python?
Step by Step Solution
There are 3 Steps involved in it
Step: 1
Get Instant Access to Expert-Tailored Solutions
See step-by-step solutions with expert insights and AI powered tools for academic success
Step: 2
Step: 3
Ace Your Homework with AI
Get the answers you need in no time with our AI-driven, step-by-step assistance
Get Started