Question
A collection of reviews about comedy movies (data D) contains the following keywords and binary labels for whether each movie was funny (+) or not
A collection of reviews about comedy movies (data D) contains the following keywords and binary labels for whether each movie was funny (+) or not funny (-). The data are shown below: for example, the cell at the intersection of "Review 1" and "laugh" indicates that the text of Review 1 contains 2 tokens of the word "laugh." Review laugh hilarious awesome dull yawn bland | Y 1 1 1 0 + 2 0 0 0 + 0 0 0 1 + 0 2 1 0 - 1 2 0 You may find it easier to complete this problem if you copy the data into a spreadsheet and use formulas for calculations, rather than doing calculations by hand. Please report all scores as log-probabilities, with 3 significant figures (10 pts (a) Assume that you have trained a Naive Bayes model on data D to detect funny vs. not funny movie reviews. Compute the model's predicted score for funny and not-funny to the following sentence S i.e. P(+S) and P(-1S)), and determine which label the model will apply to S. (4 pts) S: "This film was hilarious! I didn't yawn once. Not a single bland moment. Every minute was a laugh." (b) The counts in the original data are sparse and may lead to overfitting, e.g. a strong prior on assigning the "not funny" label to reviews that contain "yawn." What would happen if you applied smoothing? Apply add-1 smoothing and recompute the Naive Bayes model's predicted scores for S. Did the label change? (4 pts) (c) What is an additional feature that you could extract from text to improve the classification of sentences like S, and how would it help improve the classification? (2 pt]
Step by Step Solution
There are 3 Steps involved in it
Step: 1
Get Instant Access to Expert-Tailored Solutions
See step-by-step solutions with expert insights and AI powered tools for academic success
Step: 2
Step: 3
Ace Your Homework with AI
Get the answers you need in no time with our AI-driven, step-by-step assistance
Get Started