Answered step by step
Verified Expert Solution
Link Copied!

Question

1 Approved Answer

Discrete structures, document classification using Bernoulli Naive Bayes. Problem 1 and 2. Thank you Worksheet: Document classification using Bernoulli Naive Bayes Define a diction comprising

Discrete structures, document classification using Bernoulli Naive Bayes. Problem 1 and 2. Thank you image text in transcribed
Worksheet: Document classification using Bernoulli Naive Bayes Define a diction comprising of eight words wi goal, wh tutor, w3 variance, wa- speed. drink. we defense, wr perf wg field. Consider a set of documents, each of ormance, which i related either (s)or to Informatics of 8 words implies each document can be represented as a sequence of 8 binary elements f, f (called a nary Each binary element findicates if the corresponding word wh is present in document or not. For example, a binary vector (0 0 1 0 1 1 0 0) indicates that the document the words wy, ws and W6 because the binary elements f f and fe are 1 The data through which we estimate probabilities or make inferences is called "training" presented below as a matrix for each category (or "class", in which each row represents one document. For example, matrix Bsport has 6 rows which implies there are 6 documents in the class S. How many documents are there in the class ICnformatics)? 1 0 0 0 1 1 1 1 0 0 1 0 1 1 0 0 0 1 1 0 0 0 1 0 0 1 0 1 0 1 1 0 BSport 1 1 0 1 0 0 1 1 1 0 0 1 0 1 0 1 0 1 1 0 0 1 0 0 1 0 0 0 1 0 1 1 0 0 0 0 0 0 0 0 0 0 1 1 0 0 1 1 0 0 1 0 1 0 1 0 roblem Classify the following documents represented by binary vectors into Sports or Informatics: 1. b13 (1 0 0 1 11 01) 2. b2 (01 1 0 1 0 1 0) Algorithm Given a training set of documents (each labeled with a class- S or we can estimate a Bernoulli document classification model as follows: Count the following in the training set (represented by matrices Bsport and B above): N, the total number of documents N., the number of documents labeled with class C k, for k- S, I(hat is, what is the number of documents in class S I? nk(wt), the number of documents of class C for k S, Ithat contain the word w, at is, frequency of occurrence of word w, in sports/informatics documents) How many wt's do we have? 8 (there are 8 words) o ns(ws) of documents in class S that have the word w1 3 Estimate p(wtIC nk(wi)/ N (hat is, estimate p(wtIS and p (wt l D) Estimate p(C N N (hat is, estimate p(s) and p00) To classify a new unlabeled document D represented by binary vector bnew Ui, fa fa, the probability p (oID--) for each class C k (ka S, using the equation ft (1- p (wt IC)) C)p(C) and determine which out of p (c SID.-) and p(c D.) is higher

Step by Step Solution

There are 3 Steps involved in it

Step: 1

blur-text-image

Get Instant Access to Expert-Tailored Solutions

See step-by-step solutions with expert insights and AI powered tools for academic success

Step: 2

blur-text-image

Step: 3

blur-text-image

Ace Your Homework with AI

Get the answers you need in no time with our AI-driven, step-by-step assistance

Get Started

Students also viewed these Databases questions