Question

1 Approved Answer

Posted on Oct 10, 2024

CS 486/686 - Assignment #2 Due Date: Wednesday, June 20 by 5pm. Note: Question 3 may be done in groups of two or three; questions

CS 486/686 - Assignment #2 Due Date: Wednesday, June 20 by 5pm. Note: Question 3 may be done in groups of two or three; questions 1 and 2 are to be done individually. No late assignments will be accepted. Please hand in your assignment using the drop boxes by 5pm on the day that it is due. Hand in question 3 separately from questions 2 and 3. Each group should hand in just one copy of Question 3. #1. (25 points) Cholesterol is a soft, wax-like substance found in all parts of the body. Our bodies need a little bit of cholesterol to work properly. Cholesterol comes in two forms: lowdensity lipoprotein (LDL) also known as \"bad\" cholesterol and high-density lipoprotein (HDL) also known as \"good\" cholesterol. Too much LDL or \"bad\" cholesterol can clog your arteries and lead to heart disease. Cholesterol is measured in milligrams per deciliter (mg/dL) and (for our purposes) ranges from 1 to 200 mg/dL. Your doctor must interpret your cholesterol numbers based on other risk factors such as age, family history, smoking, and high blood pressure. Let L be the true value of the total LDL cholesterol in a patient. Different lab tests are available for estimating the total amount of LDL in a patient's body. Let T1 and T2 be the estimates obtained from two different lab tests. Of course, the tests only provide an estimate and are not 100% accurate. Normally, there are small possibilities of being off by up to five mg/dL in each direction. For lab test T1 it is important that the patient not eat or drink anything for 12 hours before the test. Unfortunately, with some small probability, a patient will forget or ignore the rule about not eating and drinking and the lab test T1 will undercount by up to 35 mg/dL. Let E1 model the event that the patient does not follow the rule (i.e., E1 is true means the patient forgot or ignored the rule, E1 is false means the patient followed the rule). For lab test T2, with some small probability the lab technician mixes up the vial of blood with that of another patient with the result that the lab test T2 will be wildly wrong. Let E2 model the event that the lab technician mixes up the vials (i.e., E2 is true means the lab technician mixed up the vials, E2 is false means the lab technician did not mix up the vials). Consider the three Bayesian networks shown below. E1 L T1 E1 E2 T1 T2 E1 E2 E2 T2 L T1 T2 L (i) (ii) (iii) (a) For each of the Bayesian networks, state whether the network is correct or incorrect, given the above information. Explain why each network is correct or incorrect. (b) For each of the correct Bayesian networks, state the number of probabilities that would need to be specified for that network. You may assume that each lab test returns a value in the range 1 to 200 mg/dL inclusive. How many probabilities would need to be specified if no conditional independence relations held among the random variables (i.e., if you had to explicitly specify each entry in the joint probability distribution)? (c) Which of these networks is the best network? Explain. #2. (25 points) Artificial insemination is a common practice in the breeding of cows, pigs, and dogs. A few weeks after inseminating a cow, a farmer has three possible tests to confirm whether artificial insemination has succeeded; i.e., the cow is pregnant. The first test is a scanning test that has a false positive rate of 1% and a false negative rate of 10%. The second test is a blood test that detects progesterone with a false positive rate of 10% and a false negative rate of 30% (increasing amounts of progesterone are produced during pregnancy; the test either reports that the level of progesterone is detectable or undetectable). The third test is a urine test that also detects progesterone with a false positive rate of 10% and a false negative rate of 20% (again, the test either reports that the level of progesterone is detectable or undetectable). The probability of a detectable progesterone level is 90% given pregnancy and 1% given no pregnancy. The probability that artificial insemination will impregnate a cow is 87%. (a) Draw a Bayesian network that best represents the domain using only Boolean random variables. Specify the conditional probability tables as part of your network. (b) Give an example of a conditional independence relation which holds among the random variables. State your example in English, and more formally as an equation. (c) Suppose that the farmer inseminates a cow, waits for a few weeks, and then performs all three tests and all three tests come out negative. What is the probability that the cow is pregnant? Show your work. (d) Suppose the farmer is unhappy with the relatively high probability found in part (c) given that all three tests came out negative. Suppose that the farmer can purchase, at some extra expense, more accurate versions of each test. Which single test do we have to change, and how much more accurate would it need to be, to ensure that the probability of pregnancy would be no more than 5% given three negative tests? #3. (50 points) You are allowed to submit this particular question in groups of two or three. \"Many online publishers say the next big growth in advertising will emerge from efforts to offer ads based not on the content of a web page, but on knowing who is looking at it.\" (Washington Post, April 4, 2008). The task in this question is to construct a Bayesian network that models a web user. The idea is that this model would be used to help decide which ads to place on a web page, where the ads are targeted to an individual web user. The approach has been called \"behavioral targeting\". To build this network, you will use some existing software for constructing and querying Bayesian networks. To run the software: java classpath ~cs486/software/JavaBayes/Classes JavaBayes The software is fairly self-explanatory and contains some online help. Further online documentation is available at: http://www.cs.cmu.edu/~javabayes/ As an example, the Holmes example from in class can be loaded into the JavaBayes software and queried or modified: ~cs486/software/JavaBayes/Examples/Holmes.bif To design a Bayesian network, you first need a probabilistic model; i.e., a set of random variables which capture the essential elements of the domain. Some of these variables will be evidence variables and others will be hidden variables. In general, you will not have all of the information about a particular web user. Evidence variables for this domain include information about the user that is readily available or could be available, such as demographic information, geographic information, past purchase history, email content, keywords typed into a search engine, and the content of the web page itself. Hidden variables (perhaps) are properties such as education level, income, sense of humor, interests, potential car (or cosmetics or sports equipment or cell phone or ...) buyer, and so forth. Note that some demographic information may be better inferred from the user's behavior rather than directly queried from the user. You will need to decide on suitable domains for the variables, keeping in mind the need to discretize the values. Of course, some of these variables may influence other variables. In general, if you find that a node has too many predecessors so that the conditional probability table becomes too big, this is a clue that you may need to add an intermediate node. The network you construct should have around 15-20 nodes and 3-4 layers. Your network will be evaluated based on whether it is a reasonable representation of the domain. Each group should hand in one copy of the following (hand this in separately from questions 2 & 3): (a) An explanation of the meaning of each variable and the values that it takes. For each evidence variable, state how you propose to gather this information. (b) For each variable without parents, provide the prior probabilities. (c) For variables with parents, your five most interesting conditional probability tables. (I recommend that you use the ASCII output of JavaBayes as your starting point and, using a word processor or a text editor, format your chosen tables.) (d) A graph showing the structure of your Bayesian network. (I recommend that you use the \"import\" commandtype the command \"man import\" for more informationto do a screen capture of your network as displayed by JavaBayes; you may need to do this more than once and tape the pieces together.) (e) Four or five interesting test cases. For each case, pick a set of observations that you might get in real life, instantiate them as evidence, and (have the system) compute the probabilities of a number of reasonable query nodes. You should hand in a description of the evidence, and the probabilities you got as an answer. If your probabilities do not make sense to you, you should probably go back and revise your network. (f) Describe how you would use the Bayesian network to make a decision on which ads to place on a web page that the web user has asked to view. Notes: (a) Onsite behavioral targeting, where the behavior of a user on a site or a collection of sites is monitored, has been going on for years. More recently and more controversially, network behavioral targeting has also become more widespread. In network behavioral monitoring, an internet service provider does what is called deep-packet inspectioni.e., looks at the content of every packet that you transmit or receive as you roam the weband uses this information to target ads. (b) According to a study by Soltani et al. at UC Berkeley, over half of the most popular 100 websites use secret behavior-tracking software to monitor users, mostly without their knowledge. The technique is to use \"Flash cookies,\" which are more persistent than HTTP cookies. Disabling or erasing HTTP cookies, clearing your history, erasing the cache, or browsing using \"Private Browsing\" still allows Flash cookies to operate fully and track the user. (See: http://papers.ssrn.com/sol3/papers.cfm?abstract_id=1446862) (c) There is a trend to linking offline data (such as your income, credit score, home ownership, what kind of car you drive, and whether you have a pet) with data gathered online during your web browsing. It is estimated that companies (such as Acxiom, Experian, and Datran Media) have up to 1500 pieces of data on every Canadian and American. This data can then be linked with your online data when, for example, a person registers on a Web site or clicks through on an e-mail message from a marketer. (See: \"Ads Follow Web Users, and Get Much More Personal,\" Stephanie Clifford, New York Times, July 31, 2009)