Answered step by step
Verified Expert Solution
Link Copied!

Question

1 Approved Answer

1 ) First, the client wants to know whether it is possible to use machine learning to identify the topic of a paragraph of text,

1) First, the client wants to know whether it is possible to use machine learning to identify the topic of a paragraph of text, for some specific topics of interest. Specifically, they want to know whether a given piece of text is about artificial intelligence,movies about artificial intelligence,programming,philosophy or biographies(5 classes in total). They also provide a feature indicating whether each paragraph contains references to a person, and organisation and/or a product, which they think might provide further relevant information. a. For the avoidance of doubt, the topic to predict is in the column titled category. The input features to use are: paragraph and has_entity. b. The client will consider the results successful if the model is better than a trivial baseline, if it does not overfit to the training dataset and if, for each class, no more than 10% of the paragraphs get misclassified into an unrelated class (artificial intelligence text being misclassified as programming is the one error they are willing to overlook). c. They also want to know which other scalar performance metric would be most informative to understand how well the algorithm is performing in general. 2) Then, the client want to explore whether it would be possible to automatically detect if a given paragraph is written clearly enough. They are planning to use the results to automatically reject edits and additions to the websites knowledge base if they are not clear enough. However, they do not have any labels for this task outside of the first few rows of the dataset. So, they want you to build a prototype by first labelling a subset of the data (they give an optional suggestion of 100 data points), and then building a machine learning algorithm to predict these labels from the text and any other feature, as relevant. Specifically, they want you to use two labels: clear_enough and not_clear_enough, to denote the level of text clarity. You should add your labels in the column called text_clarity. This column will then be your output feature. a. They have heard there is now lots of interest about responsible use of machine learning. So, they would like you to review the ethical implications and risks of using an algorithm to automatically reject users work (for example, in terms of potential bias). Depending on the risks identified, they are open to consider applying the algorithm in a different way and are looking for suitable suggestions. b. The client will develop the prototype further if the algorithm produces results that 4 do not overfit on the training data and are better than simply guessing the majority class all the time. They also want to know your top suggestion for improvement. c. The client is particularly interested in a prototype that includes some more advanced techniques (the main suggestions given are being able to make use of both labelled and unlabelled data points or using pre-trained word embeddings). They want you to write the results of your analysis and implementation in a report. More details about what to include in the report are provided below. The dataset can be downloaded from this Moodle link. The dataset has been adapted to the requirements of this module; the original textual content is licensed under the GNU Free Documentation License (GFDL) and the Creative Commons Attribution-Share-Alike 3.0 License by Wikipedia. 1 The table below gives some background information about the dataset features. For task 2, you will need to carefully select which extra features, in addition to the paragraph column, can be helpful to use as input. FEATURE NAME BRIEF DESCRIPTION par_id Unique identifier for each paragraph to classify. paragraph Text to classify. has_entity Whether the text contains a reference to a product (yes/no), an organisation (yes/no), or a person (yes/no). lexicon_count The number of words in the text. difficult_words The number of difficult words2 in the text. last_editor_gender The gender of the latest person to edit the text. category The category into which the text should be classified. text_clarity The clarity level of the text. Very few data points are labelled at first.

Step by Step Solution

There are 3 Steps involved in it

Step: 1

blur-text-image

Get Instant Access to Expert-Tailored Solutions

See step-by-step solutions with expert insights and AI powered tools for academic success

Step: 2

blur-text-image

Step: 3

blur-text-image

Ace Your Homework with AI

Get the answers you need in no time with our AI-driven, step-by-step assistance

Get Started

Recommended Textbook for

Data Management Databases And Organizations

Authors: Richard T. Watson

2nd Edition

0471180742, 978-0471180746

More Books

Students also viewed these Databases questions

Question

2(-4)2 + 3(-4) 7 Perform the indicated operations by hand.

Answered: 1 week ago

Question

Tell what the word schizophrenia means.

Answered: 1 week ago