Answered step by step

Verified Expert Solution

Link Copied!

Question

1 Approved Answer

Posted on Sep 10, 2024

Ann needs to build a text categorization system for spam detection. She has a set of e-mails for which she has the correct category (i.e.,

Ann needs to build a text categorization system for spam detection. She has a set of e-mails for which she has the correct category (i.e., spam or good e-mail). With those text samples, she can analyze the distributions of the terms in both categories.

To summarize the information, Ann generates a set of contingency table, one for each term. Her training set contains of n=600 e-mails in which the word free appears in total in 110 e-mails. More precisely, the term free appears in 100 spam e-mails and 10 times in good e-mails. We can summarize the information in the following table.

image text in transcribed

To evaluate the discriminative power of each term before building a nave Bayes model, Ann suggests computing the pointwise mutual information (PMI). For Ann, this term selection strategy seems a good one because she saw this description in many books, and scientific articles. So, she will use this approach to select the 100 most appropriate terms. a) How can she estimate the following four probabilities? P[free], P[SPAM], P[free appearing in category SPAM], and P[free | SPAM] Give clearly the values used to estimation these probabilities. b) She has another term: system with the following contingency table. When comparing the PMI values for the terms free and system for the category SPAM, do you think that one is (or both are) good discriminator for this category? Justify your answer.

image text in transcribed

c) For John, the pointwise mutual information measure is stupid because the value in the numerator (or P[tk,ci]) is equal to the value in the denominator (or P[tk] P[ci]). Every educated person knows that the estimation of P[tk,ci] = P[tk] P[ci].Is Johns argument always correct? Always incorrect? Sometimes correct, sometimes incorrect? Justify your answer. d) For this application, which preprocessing do you propose to apply to the raw text before obtaining the contingency table for each term? Justify your choice.

\begin{tabular}{|c|c|c|c|} \cline { 2 - 4 } \multicolumn{1}{c|}{} & SPAM & not SPAM & \multicolumn{1}{c|}{} \\ \hline "free" & 100 & 10 & 110 \\ \hline not "free" & 150 & 340 & 490 \\ \hline \multicolumn{1}{c|}{} & 250 & 350 & 600 \\ \hline \end{tabular} \begin{tabular}{|c|c|c|c|} \cline { 2 - 3 } \multicolumn{1}{c|}{} & SPAM & not SPAM & \multicolumn{1}{|c}{} \\ \hline "svstem" & 90 & 130 & 220 \\ \hline not "svstem" & 160 & 220 & 380 \\ \hline \multirow{2}{*}{} & 250 & 350 & 600 \\ \cline { 2 - 3 } & & & \end{tabular}

Step by Step Solution

There are 3 Steps involved in it

Step: 1

blur-text-image

Get Instant Access to Expert-Tailored Solutions

See step-by-step solutions with expert insights and AI powered tools for academic success

Step: 2

blur-text-image

Step: 3

blur-text-image

Ace Your Homework with AI

Get the answers you need in no time with our AI-driven, step-by-step assistance

Get Started

Recommended Textbook for

Database Processing Fundamentals Design And Implementation

Database Processing Fundamentals Design And Implementation

Authors: David M. Kroenke

5th Edition

B000CSIH5A, 978-0023668814

More Books

Students also viewed these Databases questions

Question

★★★★★

The City Electric Utility (CEU), which a city accounts for in its enterprise fund, provides cash rebates to customers who install insulation, storm windows, or energy-saving appliances. The payments...

Answered: 1 week ago

Question

★★★★★

General Optic Corporation operates a manufacturing plant in Arizona. Due to a significant decline in demand for the product manufactured at the Arizona site, an impairment test is deemed appropriate....

Answered: 1 week ago

Question

★★★★★

8. Identify the meeting with the goddess in The Elephant Man.

Answered: 1 week ago

Question

★★★★★

Tim Hortons restaurants operate in a variety of formats. A standard Tim Hortons restaurant is a freestanding building typically ranging in size from 1,400 to 3,090 square feet with a dining room and...

Answered: 1 week ago

Question

★★★★★

A sporting goods manufacturer budgets production of 55,000 pairs of ski boots in the first quarter and 46,000 pairs in the second quarter of the upcoming year. Each pair of boots requires 2 kilograms...

Answered: 1 week ago

Question

★★★★★

Smythe Co. makes furniture. The following data are taken from its production plans for the year Exercise 17-16 Comparing costs under ABC to traditional plantwide overhead rate P1 P3 A1 A2 5,870,000...

Answered: 1 week ago

Question

★★★★★

In your readings, you looked at the EDUCAUSE "IT Governance Toolkit" which provides a solid foundation for creating effective IT governance. Here you will explore a corresponding IT Governance...

Answered: 1 week ago

Question

★★★★★

Please write a stored procedure named pHW_6_xxxx( student) which will display the student's transcript by the input (semester, year). Your program needs to meet the following requirements and test...

Answered: 1 week ago

Question

★★★★★

A clothing manufacturer purchased some newly designed sewing machines in the hopes that production would be increased. The production records (in units/week) of a random sample of workers are shown...

Answered: 1 week ago

Question

★★★★★

Consider water, originally a saturated liquid at 100 Celsius. The water is heated in an isobaric manner to a saturated vapor state. a. Determine the initial pressure b. Determine the final...

Answered: 1 week ago

Question

★★★★★

A . Suppose worker A averages 1 0 0 picks per hour, worker B averages 6 0 picks per hour, and worker C averages 4 0 picks per hour. If the average order requires 1 0 0 picks would be the average rate...

Answered: 1 week ago

Question

★★★★★

1 How could the problems in the Blue Sword joint venture, leading to its eventual failure, have been avoided?

Answered: 1 week ago

Question

★★★★★

Praise. Companies do not praise their employees very often. Individual praise in front of the group is not often used deliberately except in the hotel industry. A view elsewhere is that this may be...

Answered: 1 week ago

Question

★★★★★

Pressure and punishment. Direct punishment was found only in the hotel sector, where some companies punish their staff for bad behaviour and not working. Deductions from salary or bonuses are seen to...

Answered: 1 week ago

Previous Question Next Question