Answered step by step
Verified Expert Solution
Question
1 Approved Answer
1 ) First, the client wants to know whether it is possible to use machine learning to identify the topic of a paragraph of text,
First, the client wants to know whether it is possible to use machine learning to identify the topic of a paragraph of text, for some specific topics of interest. Specifically, they want to know whether a given piece of text is about artificial intelligencemovies about artificial intelligenceprogrammingphilosophy or biographies classes in total They also provide a feature indicating whether each paragraph contains references to a person, and organisation andor a product, which they think might provide further relevant information. a For the avoidance of doubt, the topic to predict is in the column titled category The input features to use are: paragraph and hasentity b The client will consider the results successful if the model is better than a trivial baseline, if it does not overfit to the training dataset and if for each class, no more than of the paragraphs get misclassified into an unrelated class artificial intelligence text being misclassified as programming is the one error they are willing to overlook c They also want to know which other scalar performance metric would be most informative to understand how well the algorithm is performing in general. Then, the client want to explore whether it would be possible to automatically detect if a given paragraph is written clearly enough. They are planning to use the results to automatically reject edits and additions to the websites knowledge base if they are not clear enough. However, they do not have any labels for this task outside of the first few rows of the dataset. So they want you to build a prototype by first labelling a subset of the data they give an optional suggestion of data points and then building a machine learning algorithm to predict these labels from the text and any other feature, as relevant. Specifically, they want you to use two labels: clearenough and notclearenough to denote the level of text clarity. You should add your labels in the column called textclarity This column will then be your output feature. a They have heard there is now lots of interest about responsible use of machine learning. So they would like you to review the ethical implications and risks of using an algorithm to automatically reject users work for example, in terms of potential bias Depending on the risks identified, they are open to consider applying the algorithm in a different way and are looking for suitable suggestions. b The client will develop the prototype further if the algorithm produces results that do not overfit on the training data and are better than simply guessing the majority class all the time. They also want to know your top suggestion for improvement. c The client is particularly interested in a prototype that includes some more advanced techniques the main suggestions given are being able to make use of both labelled and unlabelled data points or using pretrained word embeddings They want you to write the results of your analysis and implementation in a report. More details about what to include in the report are provided below. The dataset can be downloaded from this Moodle link. The dataset has been adapted to the requirements of this module; the original textual content is licensed under the GNU Free Documentation License GFDL and the Creative Commons AttributionShareAlike License by Wikipedia. The table below gives some background information about the dataset features. For task you will need to carefully select which extra features, in addition to the paragraph column, can be helpful to use as input. FEATURE NAME BRIEF DESCRIPTION parid Unique identifier for each paragraph to classify. paragraph Text to classify. hasentity Whether the text contains a reference to a product yesno an organisation yesno or a person yesno lexiconcount The number of words in the text. difficultwords The number of difficult words in the text. lasteditorgender The gender of the latest person to edit the text. category The category into which the text should be classified. textclarity The clarity level of the text. Very few data points are labelled at first.
Step by Step Solution
There are 3 Steps involved in it
Step: 1
Get Instant Access to Expert-Tailored Solutions
See step-by-step solutions with expert insights and AI powered tools for academic success
Step: 2
Step: 3
Ace Your Homework with AI
Get the answers you need in no time with our AI-driven, step-by-step assistance
Get Started