Question

1 Approved Answer

Posted on Jul 09, 2024

I am not familiar with r-code yet so please be detailed in solution. This entire question uses R. As a part of the investigation in

image text in transcribed

I am not familiar with r-code yet so please be detailed in solution.

This entire question uses R. As a part of the investigation in Question 2, the law firm has obtained 11 documents, and has determined the number of crrors that they contain and their total word counts. This information is stored in the comma separated file "Document Summary.csv". It is known that cach document summarized was typed by one of either suspect A, suspect B, or suspect C. As before, the number of errors introduced to a document by suspect A is assumed to follow a Poisson process with a rate of 3 errors per 1000 words. Errors introduced by suspects B and C also approximately follow a Poisson process with error rates of 1 and 7 crrors per 1000 words, respectively. The goal of this question is to try and identify who typed Cach document. Begin by reading Document Summary.csv" into R. In order for the following code to work, you must have saved Document Summary.csv" into the working directory of your R session. Otherwise, you can include a full file path. a) For each of the 14 documents, compute the probability that suspect A would have produced a document of the given length with the same number of errors. Store thehe 1 numbers in a vector named "ProbA", and then print "ProbA" Do the same thing for suspects B and C, producing and printing vectors "ProbB" and "ProbC" b) For cach document, determine which suspect had the highest probability of producing a document of the given length with the same number of errors. Produce s vector of characters "A", "B", and "C" of length 14 named "Most LikelySuspect", which encoxles for each document the suspect that had the highest probability of producing a document of the given length with the same number of crrors. Append this vector as a Ith column to DocSum, and give it the name "Most.Likely. Suspect" c) Preamble: Suppose that it is known that the 14 documents studied were drawn at random from a large pool of documents of which suspect A produced 28.57%, suspect B produced 21.13%, and suspect C produced 50%. According to Bayes Theorem, P( Suspect i typed the document document is of length x and has y errors) is equal to P(document is of legnth x and has y errors Suspect i typed the document)P(Suspect i typed the document) P(document is of legnth x and has y crrors) Note that the denominator of the above does not depend on the suspect under consideration, and so when it comes to producing an estimate of the relative likelihood that a particular suspect produced a document given its length and the number of errors, we could just compute the nunerator and compare it for different suspects. This numerator is sometimes referred to as the posterior likelihood. This leads to a way of classifying which documents were typed by which suspect: classifying based on the class giving the largest posterior likelihood is known as Bayes classification", and is a popular starting point in machine learning methods for classification. Task: For each document and cach suspect, compute the posterior likelihood P(the document is of legnth x and has y errors Suspect i typed the document)P(Suspect i typed the document) Store these numbers for cach suspect in vectors of length 14 named "BayesA", "BayesB", and "BayesC". Print each of these vectors. d) For cach document determine which suspect has the largest posterior likelihood. Produce a vector of characters "A", "B", and "C" of length 14 named "BayesClass", which encodes for each document the suspect that had the highest posterior likelihood of producing that document. Append this vector as a 5th column to DocSum, and give it the name "BayesClass" Print all 5 columns of DocSun. This entire question uses R. As a part of the investigation in Question 2, the law firm has obtained 11 documents, and has determined the number of crrors that they contain and their total word counts. This information is stored in the comma separated file "Document Summary.csv". It is known that cach document summarized was typed by one of either suspect A, suspect B, or suspect C. As before, the number of errors introduced to a document by suspect A is assumed to follow a Poisson process with a rate of 3 errors per 1000 words. Errors introduced by suspects B and C also approximately follow a Poisson process with error rates of 1 and 7 crrors per 1000 words, respectively. The goal of this question is to try and identify who typed Cach document. Begin by reading Document Summary.csv" into R. In order for the following code to work, you must have saved Document Summary.csv" into the working directory of your R session. Otherwise, you can include a full file path. a) For each of the 14 documents, compute the probability that suspect A would have produced a document of the given length with the same number of errors. Store thehe 1 numbers in a vector named "ProbA", and then print "ProbA" Do the same thing for suspects B and C, producing and printing vectors "ProbB" and "ProbC" b) For cach document, determine which suspect had the highest probability of producing a document of the given length with the same number of errors. Produce s vector of characters "A", "B", and "C" of length 14 named "Most LikelySuspect", which encoxles for each document the suspect that had the highest probability of producing a document of the given length with the same number of crrors. Append this vector as a Ith column to DocSum, and give it the name "Most.Likely. Suspect" c) Preamble: Suppose that it is known that the 14 documents studied were drawn at random from a large pool of documents of which suspect A produced 28.57%, suspect B produced 21.13%, and suspect C produced 50%. According to Bayes Theorem, P( Suspect i typed the document document is of length x and has y errors) is equal to P(document is of legnth x and has y errors Suspect i typed the document)P(Suspect i typed the document) P(document is of legnth x and has y crrors) Note that the denominator of the above does not depend on the suspect under consideration, and so when it comes to producing an estimate of the relative likelihood that a particular suspect produced a document given its length and the number of errors, we could just compute the nunerator and compare it for different suspects. This numerator is sometimes referred to as the posterior likelihood. This leads to a way of classifying which documents were typed by which suspect: classifying based on the class giving the largest posterior likelihood is known as Bayes classification", and is a popular starting point in machine learning methods for classification. Task: For each document and cach suspect, compute the posterior likelihood P(the document is of legnth x and has y errors Suspect i typed the document)P(Suspect i typed the document) Store these numbers for cach suspect in vectors of length 14 named "BayesA", "BayesB", and "BayesC". Print each of these vectors. d) For cach document determine which suspect has the largest posterior likelihood. Produce a vector of characters "A", "B", and "C" of length 14 named "BayesClass", which encodes for each document the suspect that had the highest posterior likelihood of producing that document. Append this vector as a 5th column to DocSum, and give it the name "BayesClass" Print all 5 columns of DocSun