Answered step by step
Verified Expert Solution
Question
1 Approved Answer
Algorithms in Artificial Intelligence (or, the old name: Introduction to Algorithmic Decision Making) Part 1 Based on slides by David Sarne and Lirong Xia Course
Algorithms in Artificial Intelligence (or, the old name: Introduction to Algorithmic Decision Making) Part 1 Based on slides by David Sarne and Lirong Xia Course Tentative Schedule Introduction Making simple decisions with uncertainty Bayesian inference and Bayesian networks Decisions tree, value of information Preferences, rationality, and utilities Maximum expected utility Decisions with multiple agents: game theory Simultaneous and sequential games Voting and manipulation Preliminaries What are decision? Is this CS or Economics? What would we want from a computer that makes decisions? Intelligent Humans- alike? Not our attention It is important to make the distinction between our focus to works in cognitive science Rational doing the right thing, given what you know Efficient Example Bayesian network Example- Golden Balls (pause at 3:37) Example- Jet Blue Experiment (pause at 3:07) The Three Main Ingredients Environment Fully observable vs. partially observable Single vs. multi Static vs. dynamic Deterministic vs. stochastic Discrete vs. continues Episodic vs. sequential Knowledge representation Reasoning system Making Simple Decisions with Uncertainty We must have tools to handle uncertainty Partial observability Non-determinism Laziness and ignorance We would still like to make the right decision states of nature alternatives Low High Small 8 8 Medium 5 15 Large -11 22 Example- Building a Housing Complex Entrepreneur must decide on the size of the housing complex that he was going to set up - small, medium or large. Profits derived from the level of future demand for housing complexes. Table of payoffs/utilities: states of nature alternatives Low High Small 8 8 Medium 5 15 Large -11 22 What if we do not have probabilities? Possible decision rules: Optimistic- choose the alternative with the highest payoff states of nature alternatives Low High Small 8 8 Medium 5 15 Large -11 22 What if we do not have probabilities? Possible decision rules: Optimistic- choose the alternative with the highest payoff Conservative- choose the alternative with the maximal minimal payoff What if we do not have probabilities? Possible decision rules: Optimistic- choose the alternative with the highest payoff Conservative- choose the alternative with the maximal minimal payoff Minimax regret- choose the alternative with minimal maximal regret Minimax Regret states of nature alternatives Low High Small 8 8 Medium 5 15 Large -11 22 Best Profit for Low 8 Best Profit for High 22 states of nature alternatives Low High Small 0 14 Medium 3 7 Large 19 0 What if we do not have probabilities? Possible decision rules: Optimistic- choose the alternative with the highest payoff Conservative- choose the alternative with the maximal minimal payoff Minimax regret- choose the alternative with minimal maximal regret Insufficient reasoning- assume that all the states of nature have the same chance to occur, hence choose the alternative with the maximal sum of payoffs Now Add Probabilities Probability provides a way of summarizing the uncertainty Probability statements are made with respect to a knowledge state, not with respect to the real world We will start with probabilistic inference: compute a desired probability from other known probabilities We will then add utilities to the different actions A sample space O states of the world, or equivalently, outcomes only need to model the states that are relevant to the problem A probability mass function P over the sample space for all A?O, P(A)=0 SA?O P(A)=1 Probability Theory - Refreshment Unconditional or prior probability that a proposition A is true: P(A) In the absence of any other information, the probability to event A is P(A). For example, if A = getting a 6 by rolling an even dice, then we can write P(A) = 1/6. Propositions may include random variables X Each random variable X has domain of values: {red, blue, green} P(X=Red) means the probability of X to be Red For example, P(AppAccept=true)=0.2 For short, we will write P(appAccept)=0.2 Unconditional Probability Conditional Probability What if we have some evidence? E.g. we have a friend who has applied with a much weaker qualification, and that application was accepted? Posterior or conditional probability P(A|B) probability of A given all we know is B P(AppAccept=true|weaker application was accepted) If we know B and also know C, then P(A| B ? C) Conditional Probability P(A|B) = percentage of events where A is true, among all the events where B is true B( ) (A, B) ( | B) (B) (B) P A B P P A P P ? = = A Product Rule( ) ( | B) (B)P A B P A P? =( ) (B | A) (A)P A B P P? = B A Chain Rule( ) ( ) ( ) ( ) ( ) ( ) 1 2 3 1 2 1 3 1 2 1 2 1 1 , , , , , n i i i P x x x P x P x x P x x x P x x x P x x x - = = ? If we apply the product rule more than one time we get: Law of Total Probability( ) ( ) ( )|i iP A P B P A B= ?1B2B3B4B A Joint Probability Distribution Joint probability distribution is a table Assigns probabilities to all possible assignment of values for combinations of variables P(X1,X2,..Xn) assigns probabilities to all possible assignment of values to variables X1, X2,..Xn Joint Probability Distribution X1 = Status of your application X2 = Status of your friends application Then P(X1,X2): 0.15 0.3 0.02 0.3 0.02 0.09 0.02 0.09 0.01 X1 X2 AcceptReject Wait-list Accept Reject Wait-list Marginal Probabilities 0.15 0.35 0.02 0.52 0.3 0.02 0.09 0.41 0.02 0.04 0.01 0.07 0.47 0.41 0.12 1 X1 X2 AcceptReject Wait-list Accept Reject Wait-list 0.47 0.41 0.12 X1 AcceptReject Wait-list 0.52 0.41 0.07 X2 AcceptReject Wait-list Bayes Rule We observed that P(A ? B) = P(A|B)*P(B) P(A ? B) = P(B|A)*P(A) ? P(B|A) = P(A|B)P(B) = P(A,B) P(A) P(A) Determine P(B|A) given P(A|B), P(B) and P(A) Using the total law of probability, P B A = P A B P(B) s P A B?? P(Bi) We now have the probabilistic infrastructure. How can we use it for probabilistic inference? Suppose you're on a game show, and you're given the choice of three doors: Behind one door is a car; behind the others, goats. You pick a door, say No. 1, and the host, who knows what's behind the doors, opens another door, say No. 3, which has a goat. He then says to you, "Do you want to pick door No. 2?" Is it to your advantage to switch your choice? After the problem appeared in Parade, approximately 10,000 readers, including nearly 1,000 with PhDs, wrote to the magazine, most of them claiming the solution was wrong Paul Erdos remained unconvinced until he was shown a computer simulation confirming the predicted result The Monty Hall Problem Solving Monty Hall with Bayes Rule Denote by A, B and C the door with the car Denote by MA, MB and MC the door that was opened by Monty Suppose we chose A: p(MA|A) = 0, p(MA|B) = 0, p(MA|C) = 0; p(MB|A) = 1/2, p(MB|B) = 0, p(MB|C) = 1; p(MC|A) = 1/2, p(MC|B) = 1, p(MC|C) = 0. In addition, p(A) = p(B) = p(C) = 1/3( ) ( ) ( ) ( ) ( ) ( ) ( ) ( ) 33.0 33.0*133.0*033.0*5.0 33.0*5.0 )(|)(|)(| || )|( = ++ = = ++ == CPCMBPBPBMBPAPAMBP APAMBP MBP APAMBP MBAP x x x x x x x x x win lose loselose win win stay switch stay switch stay switch Another Example S: Proposition that patient has stiff neck M: Proposition that patient has meningitis Meningitis causes stiff-neck, 50% of the time Given: P(S | M) = 0.5 P(M) = 1/50,000 P(S) = 1/20 P(M|S) = P(S| M) * P(M) / P(S) = 0.0002 If a patient complains about stiff-neck, the probability that she has meningitis is only 0.0002 How Bayes Rule Help Us? P(S|M) is causal knowledge, P(M|S) diagnostic knowledge e.g., S is symptom, M is disease P(S | M) is causal knowledge, does not change It is model based It reflects the way meningitis works P(M | S) is diagnostic; tells us likelihood of M given symptom S Diagnostic knowledge may change with circumstance, thus helpful to derive it If there is an epidemic, probability of Meningitis goes up; rather than again observing P(M | S), we can compute it from P(S|M) Similar Interpretation Suppose we have some hypothesis H, which can be true or false. We might be able to estimate the probability that H is true, that is P(H). If we gathered new information, we would like to update our estimation accordingly. We denote the new information by E (evidence). How do we update our beliefs? using Bayes rule, we compute P(H|E). Why? Since it is usually harder to estimate P(H|E), and easier to estimate the probability that we get E given that H is true, that is P(E|H). Confusing P(H|E) with P(E|H) One out of 1000 is suffering from a congenital heart defect. There is a test that can reveal the defect. The test is 100% percent accurate for people with the defect, and 95% percent for people without the defect (that is, 5% among those who dont have the defect will be diagnosed as ones that have the defect). If we randomly tested someone, and it was found that she has the heart defect, what is the probability that she indeed has the defect? Among 60 student from Harvard medical school, 30 students thought the answer is 95% The average answer was 56% Only 11 students answered correctly 0.001 from the population suffers from the from a congenital heart defect. The test to reveal the defect is 100% percent accurate for people with the defect, and 95% percent for people without the defect. If we randomly tested someone, and it was found that she has the heart defect, what is the probability that she indeed has the defect? We will define the events: A: one has the congenital heart defect B: the test returns a positive answer P(not A)=0.999 P(B|not A)=0.05 P(B|A)=1( ) ( ) ( ) ( ) ( ) ( ) ( ) ( ) ( ) ( ) | | | | |~ ~ P B A P A P B A P A P A B P B P B A P A P B A P A = = +( ) 01963.0 999.0*05.0001.0 001.0 | = + =BAP The problem: doctors are prescribing medicines according to such tests Confusing P(H|E) with P(E|H) Computing the denominator: P(S) We wish to avoid computing the denominator in the Bayes rule May be hard to obtain There are 2 different techniques to compute (or avoid computing P(S)) Relative likelihoods Law of total probability Relative Likelihood If M (meningitis) and W(whiplash) are two possible explanations: P(M|S) = P(S| M) * P(M) / P(S) P(W|S) = P(S| W) * P(W)/ P(S) P(M|S)/P(W|S) = P(S|M) * P(M) / P(S| W) * P(W) Disadvantages: Not always enough Possibility of many explanations #2 approach, using M & ~M: Checking the probability of M, ~M when S P(M|S) = P(S| M) * P(M) / P(S) P(~M|S) = P(S| ~M) * P(~M)/ P(S) P(M|S) + P(~M | S) = 1 (must sum to 1) [P(S|M)*P(M)/ P(S) ] + [P(S|~M) * P(~M)/P(S)] = 1 P(S|M) * P(M) + P(S|~M) * P(~M) = P(S) Calculate P(S) in this way Law of Total Probability The #2 approach is actually - normalization: 1/P(S) is a normalization constant Must ensure that the computed probability values sum to 1 For instance: P(M|S)+P(~M|S) must sum to 1 Thus, to compute P(M|S) we can compute: (a) P(S|~M) * P(~M) or P(S,~M) (b) P(S | M) * P (M) or P(S,M) (a) and (b) are numerators, and give us un-normalized values We could compute those values and then scale them so that they sum to 1 We denote the normalization factor by a Law of Total Probability Combining Evidence Example: S: Proposition that patient has stiff neck H: Proposition that patient has severe headache M: Proposition that patient has meningitis Meningitis causes stiff-neck, 50% of the time Meningitis causes head-ache, 70% of the time Probability of Meningitis should go up, if both symptoms reported How to combine such symptoms? catch ?catch catch ?catch cavity 0.108 0.012 0.072 0.008 ?cavity 0.016 0.064 0.144 0.576 toothache ?toothache First Suggestion- Inference by Enumeration The joint probability distribution contains the complete probabilistic belief state - probabilistic distribution over all the states that the we think possible catch ?catch catch ?catch cavity 0.108 0.012 0.072 0.008 ?cavity 0.016 0.064 0.144 0.576 toothache ?toothache Inference by Enumeration P(cavity) = 0.108 + 0.012 + 0.072 + 0.008 = 0.2 catch ?catch catch ?catch cavity 0.108 0.012 0.072 0.008 ?cavity 0.016 0.064 0.144 0.576 toothache ?toothache Inference by Enumeration P(cavity ? toothache) = 0.108 + 0.012 + ... = 0.28 catch ?catch catch ?catch cavity 0.108 0.012 0.072 0.008 ?cavity 0.016 0.064 0.144 0.576 toothache ?toothache Inference by Enumeration P(cavity | catch,toothache) = P(cavity,catch,toothace)/P(catch,toothace) = 0.108/(0.108+0.016)=0.871 Issues The table contains a lot of entries. Modeling difficulties- how do we get so many numbers? Computational difficulties: if we have n variables with a domain of size d, we will need O(dn) space for storing the table. We will also need O(dn) time to compute probabilities. Independence A and B are independent (denoted A?B) iff P(A|B) = P(A) or P(B|A) = P(B) or P(A, B) = P(A) P(B) P(Toothache, Catch, Cavity, Weather) = P(Toothache, Catch, Cavity) P(Weather) 15 entries reduced to 8 (since its 7+1 instead of 24-1) For n independent biased coins, O(2n) ?O(n) Absolute independence powerful but rare Dentistry is a large field with hundreds of variables, none of which are independent. What to do? Conditional Independence If I have a cavity, the probability that the probe catches in it doesn't depend on whether I have a toothache: (1) P(catch | toothache, cavity) = P(catch | cavity) The same independence holds if I haven't got a cavity: (2) P(catch | toothache,?cavity) = P(catch | ?cavity) Catch is conditionally independent of Toothache given Cavity, written as Catch ?Toothace|Cavity: P(Catch | Toothache,Cavity) = P(Catch | Cavity) Equivalent statements: P(Toothache | Catch, Cavity) = P(Toothache | Cavity) P(Toothache, Catch | Cavity) = P(Toothache | Cavity) P(Catch | Cavity) Conditional Independence Conditional independence assertions are much more commonly available than absolute independence assertions. There is no obvious connection between independence and conditional independence: Assume that A is the result of a fair coin tossing, B is another result of the same coin, and C is XOR between A and B, then A ? B ? A ? B | C Assume that A is the result of unknown coin tossing, B is another result of the same coin, and C represents the probability that this coin will fall on Head, then A ? B | C ? A ? B Bayesian Networks A simple, graphical notation for conditional independence assertions and hence for compact specification of full joint distributions It describes how variables interact locally Local interactions chain together to give global, indirect interactions Graphical model is a somewhat broader class that includes Bayesian networks. Syntax A set of nodes, one per variable A directed, acyclic graph. The arcs indicate direct influence between variables. Formally, they encode conditional independence. A conditional distribution for each node given its parents: P (Xi | Parents (Xi))- conditional probability table (CPT) Example 1 Cavity Catch P(Catch | Cavity) T T 0.9 T F 0.1 F T 0.05 F F 0.95 Cavity Toothache P(Toothache | Cavity) T T 0.8 T F 0.2 F T 0.4 F F 0.6 Weather P(Weather) T 0.4 F 0.6 Cavity P(Cavity) T 0.8 F 0.2 Actually, some rows here are redundant Example 1 Cavity P(Catch=true | Cavity) T .9 F .05 P(Cavity=true) = .8 Cavity P(Toothache=true | Cavity) T .8 F .4 P(Weather=true) = .4 Example 1 Cavity P(Catch=true | Cavity) T .9 F .05 P(Cavity=true) = .8 Cavity P(Toothache=true | Cavity) T .8 F .4 P(Weather=true) = .4 Topology of network encodes conditional independence assertions: Weather is independent of the other variables Toothache and Catch are conditionally independent given Cavity Intuitively, Cavity is a direct cause of Toothache and Catch The Semantics of BN Given its parents, each node is conditionally independent of everything except its descendants An equivalent view: BN is a compact way to represent the full joint probability distribution How? P(x1?x2??xn) = Pi=1,,nP(xi|parents(Xi)) Why? Calculation of Joint Probability Suppose that X1,X2,,Xn are ordered according to the partial ordered implied by the BN graph. That is, parents(Xi) ? {X1,...,Xi-1}, or equivalently: descendants(Xi) n {X1,...,Xi-1} = ?? Then, P(X1?X2??Xn) = Pi=1,,nP(Xi|X1,,Xi-1) (chain rule) = Pi=1,,nP(Xi|parnets(Xi)) (conditional independence) Calculation of Joint Probability For example, P(weather,~cavity,toothache,~catch) = P(weather)P(~cavity)P(toothace|~cavity)P(~catch|~cavity) = 0.4*0.2*0.4*0.95 = 0.0304 That is, every BN over a domain implicitly represents some joint distribution over that domain Advantage: it is usually easy for a domain expert to decide what direct influences exist, and also easier to elicit local CPTs Example 2 N independent coin flips : No interactions between variables: absolute independence Does every Bayes Net can represent every full joint? No. For example, Only distributions whose variables are absolutely independent can be represented by a Bayes net with no arcs. Can we represent the same full joint with different BNs? Before answering, lets look on another network, and also try to understand how we build BNs P(X1=tree) = 0.5 P(X2=tree) = 0.5 P(Xn=tree) = 0.5 Example 3 I'm at work, neighbor John calls to say my alarm is ringing, but neighbor Mary doesn't call. Sometimes it's set off by minor earthquakes. Is there a burglar? Variables: Burglary, Earthquake, Alarm, JohnCalls, MaryCalls Network topology reflects "causal" knowledge: A burglar can set off the alarm An earthquake can set off the alarm The alarm can cause Mary to call The alarm can cause John to call Example contd. Laziness and Ignorance a Side Note The probabilities actually summarize a potentially infinite set of circumstances in which the alarm might fail to go off high humidity power failure dead battery cut wires a dead mouse stuck inside the bell John or Mary might fail to call and report it out to lunch on vacation temporarily deaf passing helicopter Causality? Rain causes Traffic Lets build the joint: Reverse Causality? Causality? What do the arrows really mean? Topology may happen to encode causal structure Topology really encodes conditional independencies When Bayes nets reflect the true causal patterns: Often simpler (nodes have fewer parents) Often easier to think about Often easier to elicit from experts BNs need not actually be causal Sometimes no causal net exists over the domain E.g. consider the variables Traffic and RoofDrips End up with arrows that reflect correlation, not causation Compactness A CPT for Boolean Xi with k Boolean parents has 2k rows for the combinations of parent values Each row requires one number p for Xi = true (the number for Xi = false is just 1-p) If each variable has no more than k parents, the complete network requires O(n 2k) numbers I.e., grows linearly with n, vs. O(2n) for the full joint distribution For burglary net, 1 + 1 + 4 + 2 + 2 = 10 numbers (vs. 25-1 = 31) We utilize the property of locally structured system Example 3, Again Consider the following 2 orders for insertion: (a) MaryCalls, JohnCalls, Alarm, Burglary, Earthquake Since, P(Burglary|Alarm, JohnCalls, MaryCalls) = P(Burglary|Alarm) (b) Mary Calls, JohnCalls, Earthquake, Burglary, Alarm. Example 3, Again All these networks represent the exact same distribution Adding unneeded arcs isnt wrong, it is just inefficient It is also harder for experts to describe, e.g., P(E|B,A) Answering Queries I'm at work, neighbor John calls to say my alarm is ringing, but neighbor Mary doesn't call. Sometimes it's set off by minor earthquakes. Is there a burglar? P(b|j,m) = P(b,j,m)/P(j,m) P(b,j,m) = P(b,e,a,j,m) + P(b,e,a,j,m) + P(b,e,a,j,m) + P(b,e,a,j,m) = P(b)P(e)P(a|b,e)P(j|a)P(m|a) + P(b)P(e)P(a|b,e)P(j|a)P(m|a) + P(b)P(e)P(a|b, e)P(j|a)P(m|a) + P(b)P(e)P(a|b, e)P(j|a)P(m|a) Do the same to calculate P(b,j,m) and normalize We should then write: P(B|j,m)=aP(B,j,m) Worst case, for a network with n Boolean variables, O(n2n). This can be improved (to be described shortly), and it turns out to be faster to answer queries, than from the full joint distribution Back to Our Meningitis Example P(M| S ? H) = P(M ? S ? H) / P (S ? H) From the net: P(M ? S ? H) = P(S| M) * P(H| M) * P(M) P(S ? H) = P(M ? S ? H) + P(~M ? S ? H) = M: meningitis H: severe headache S: stiff neck P(M) = 1/50,000 P(H|M) = 0.7 P(H|~M) = 0.4 P(S|M) = 0.5 P(S|~M) = 0.1 More on the Topology of BNs Important question about a BN: Can we answer queries about conditional independence and influence from the graph? That is, can we infer whether two nodes independent given certain evidence? Example: X: pressure, Y: rain, Z: traffic Question: are X and Z necessarily independent? Answer: no. Example: low pressure causes rain, which causes traffic X can influence Z, Z can influence X (via Y) Causal Chains This configuration is a causal chain Is X independent of Z given Y? Evidence along the chain blocks the influence( ) ( ) ( ) ( ), ,p x y z p x p y x p z y=( ) ( ) ( ) ( ) ( ) ( ) ( ) ( ) ( ) , , , , p x p y x p z yp x y z p z x y p x y p x p y x p z y = = = X: Low pressure Y: Rain Z: Traffic Yes! Common Cause Another basic configuration: two effects of the same cause Are X and Z independent? Are X and Z independent given Y? Observing the cause blocks influence between effects.( ) ( ) ( ) ( ) ( ) ( ) ( ) ( ) ( ) , , , , p y p x y p z yp x y z p z x y p x y p y p x y p z y = = = Y: Project due X: Newsgroup busy Z: Lab full Yes! Common Effect Last configuration: two causes of one effect (v-structure) Are X and Z independent? Yes: the ballgame and the rain cause traffic, but they are not correlated Still need to prove they must be (try it!) Are X and Z independent given Y? No: seeing traffic puts the rain and the ballgame in competition as explanation This is backwards from the other cases Observing an effect activates influence between possible causes. X: Raining Z: Ballgame Y: Traffic Connection Types X ind. Z, given Y?X ind. Z?DiagramName YesNoCausal chain YesNoCommon Cause NoYesCommon Effect Any complex example can be analyzed using these three canonical cases Reachability (the Bayes Ball) Shade evidence nodes Start at source node Try to reach target by search States: node, along with previous arc Successor function: Unobserved nodes: To any child To any parent if coming from a child Observed nodes: From parent to parent If you cant reach a node, its conditionally independent of the start node Example L ind. T, given T? Yes L ind. B? Yes L ind. B, given T? No L ind. B, given T? No L ind. B, given T and R? Yes Exercise P(H=true) = 0.1 GH R J H P(G=true | H) T .4 F .8 H G P(R =true | H, G) false false 0.2 false true 0.9 true false 0.3 true true 0.8 R P(J=true | R) false 0.2 true 0.7H - Hardworking G - Good Grader R - Excellent Recommendation J - Landed a good Job What can be inferred? i: ii iii Q: What is the value of P(H,G,R,J)? A: P(H,G, R, J) = P(H)*P(G|H)*P(R|H,G)*P(J|H,G, R) = P(H)*P(G|H)*P(R|H,G)*P(J| R) = 0.1 * 0.4 * 0.2 * 0.8 = 0.0064 Q: What if we want to add another parameter, C= Has The Right Connections?( ) ( ) ( ),P H G P H P G= ?( ) ( ),P J R H P J R=( ) ( )P J P J H= ? Answer P(H=true) = 0.1 GH R J H P(G=true | H) T .4 F .8 C H G P(R =true | H, G,C) false false false ?? false false true ?? false true false ?? false true true ?? true false false ?? true false true ?? true true false ?? true true true ?? C P(C=true) = ??? C R P(J=true | R) false false ?? false true ?? true false ?? true false ?? Answering Queries part II I'm at work, neighbors John and Mary call to say my alarm is ringing. Sometimes it's set off by minor earthquakes. Is there a burglar? P(b|j,m) = P(b,j,m)/P(j,m) P(b,j m) = s??,?? ??(??, ??, ??, ??, ??) = s??,?? ?? ?? ?? ?? ?? ?? ??, ?? ?? ?? ?? ?? ?? ?? = ??(??) s?? ??(??) s?? ?? ?? ??, ?? ?? ?? ?? ??(??|??) Worst case, for a network with n Boolean variables, O(2n), and there is still lots of redundant work in the computation tree. Answering Queries part II Variable Elimination Idea: interleave joining and marginalizing! Track objects called factors Initial factors are local CPTs, one per node, (but instantiated by evidence) In our example, we wanted to compute: - P(b|j,m) = P(b,j,m)/P(j,m) - Which is equivalent to computing P(B,j m) and normalizing - P(B|j,m)=aP(B,j,m)= a s??,?? ??(??, ??, ??, ??, ??)= a s??,?? ?? ?? ?? ?? ?? ?? ??, ?? ?? ?? ?? ?? ?? ?? = a??(??) s?? ??(??) s?? ?? ?? ??, ?? ?? ?? ?? ??(??|??) Variable Elimination Thus, the factors are: And we can write: - P(B|j,m)=af1(B) ?ef2(E)?af3(A,B,E)f4(A)f5(A) E B A P(A | B,E) T T T 0.95 T F T 0.29 F T T 0.94 F F T 0.001 T T F 0.05 T F F 0.71 F T F 0.06 F F F 0.999 B P(B) T 0.001 F 0.999 E P(E) T 0.002 F 0.998 A P(j|A) T 0.90 F 0.05 A P(m|A) T 0.70 F 0.01 f1(B) f2(E) f4(A) f5(A) f3(A,B,E) Variable Elimination We would like to compute: P(Q|E1=e1,,Ek=ek) Start with initial factors - local CPTs instantiated by evidence - If an instantiated CPT becomes one-valued, discard the factor While there are still hidden variables (not Q or evidence): - Pick a hidden variable H - Join all factors mentioning H - Eliminate (sum out) H - If the factor becomes one-valued, discard the factor Join all remaining factors and normalize Operation 1: Join Factors Just like a database join Get all factors over the joining variable Build a new factor over the union of the variables In our example, if we join on A: f4(A) f5(A) = f6(A) A P(j|A) T 0.90 F 0.05 A P(m|A) T 0.70 F 0.01 f4(A) f5(A) f3(A,B,E) A P(j,m|A) T 0.63 F 0.0005 = f6(A) E B A P(A | B,E) T T T 0.95 T F T 0.29 F T T 0.94 F F T 0.001 T T F 0.05 T F F 0.71 F T F 0.06 F F F 0.999 Operation 1: Join Factors And also: f6(A) f3(A,B,E) = f7(A,B,E) E B A P(A | B,E) T T T 0.95 T F T 0.29 F T T 0.94 F F T 0.001 T T F 0.05 T F F 0.71 F T F 0.06 F F F 0.999 f6(A) f3(A) f7(A,B,E) A P(j,m|A) T 0.63 F 0.0005 = E B A P() T T T 0.5985 T F T 0.1827 F T T 0.5922 F F T 0.00063 T T F 0.000025 T F F 0.000355 F T F 0.00003 F F F 0.0004995 Operation 2: Eliminate Take a factor and sum out a variable - marginalization Shrinks a factor to a smaller one In our example, after the join on A: f8(B,E) = ?a f7(A,B,E) ?a f8(B,E) E B A P() T T T 0.5985 T F T 0.1827 F T T 0.5922 F F T 0.00063 T T F 0.000025 T F F 0.000355 F T F 0.00003 F F F 0.0004995 f7(A,B,E) E B P() T T 0.598525 T F 0.183055 F T 0.59223 F F 0.0011295 Next Variable - E So far we got: P(B|j,m)=af1(B) ?ef2(E)f8(B,E) Now choose E, do a join and marginalize: ?e= f8(B,E) f2(E) E B P() T T 0.598525 T F 0.183055 F T 0.59223 F F 0.0011295 E P(E) T 0.002 F 0.998 B P(B) T 0.59224259 F 0.00149335 E B P() T T 0.00119705 T F 0.00036611 F T 0.59104554 F F 0.00112724 f9(B,E) f10(B) Next Variable - B We got: P(B|j,m)=af1(B) f10(B) So we only need to join and normalize: That is, the chance of a burglary, given calls from both neighbors, is about 28% a= f1(B) B P(B) T 0.00059224259 F 0.00149185665 f10(B) B P(B) T 0.001 F 0.999 B P(B) T 0.59224259 F 0.00149335 B P(B) T 0.284171971 F 0.715828028 A More Complex Join Suppose we need to join on A the following factors: f1(A,B,C) f2(A,B,D) Note that we need to match the values of B as well: = f1(A,B,C) A B C P() T T T 0.9 T F T 0.8 F T T 0.7 F F T 0.6 T T F 0.5 T F F 0.4 F T F 0.3 F F F 0.2 A B D P() T T T 0.1 T F T 0.2 F T T 0.3 F F T 0.4 T T F 0.5 T F F 0.6 F T F 0.7 F F F 0.8 A B C D P() T T T T 0.9*0.1 T F T T 08.*0.2 F T T T 0.7*0.3 F F T T 0.6*0.4 T T F T 0.5*0.1 T F F T 0.4*0.2 F T F T 0.3*0.3 F F F T 0.2*0.4 T T T F 0.9*0.5 T F T F 0.8*0.6 F T T F 0.7*0.7 F F T F 0.6*0.8 T T F F 0.5*0.5 T F F F 0.4*0.6 F T F F 0.3*0.7 F F F F 0.2*0.8 f1(A,B,D) Analysis of Variable Elimination Advantage: saves time by marginalizing variables as soon as possible and caching the results Time and space complexity depend on the largest factor constructed during the operation of the algorithm The largest factor is determined by the order of elimination of variables and by the structure of the network ? The order of elimination is important Intractable to find the optimal order but there are good heuristics ? For general network even VEs worst case time is exponential in the number of variables For singly connected BN, the computation takes time linear in the total number of CPT entries (? time linear in the # nodes if CPTs size is bounded) One More Observation Suppose we want to calculate P(j|b). ?? ?? ?? = ?? s??,??,?? ?? ?? ?? ?? ?? ?? ??, ?? ?? ?? ?? ?? ?? ?? = ??(??) s?? ??(??) s?? ?? ?? ??, ?? ?? ?? ?? s?? ??(??|??) But what is s?? ??(??|??) ? That is, we can remove any leaf node that is not a query variable or an evidence variable. We can do the same reasoning for the resulting tree, and we eventually find that every variable that is not an ancestor of a query variable or evidence variable is irrelevant to the query. More Irrelevant Variables Why do we get empty factors during VE? Consider the query: P(E=T|D=T) Factors: f1(A), f2(B), f3(A,B), f4(C), f5(E,C) Joining on A we get: f2(B), f4(C), f5(E,C), f6(B) Joining on B we get: f4(C), f5(E,C), f7() Note that E?A | D and also E?B | D, and thus, P(E=T|A=??, B= ??, D=T) = P(E=T|D=T) All the variables that are conditionally independent of the query variables given the evidence variables are irrelevant to the query. How do we find them? Bayes ball What to do with them? Discard any factor that contains them. D C E A B Nave Bayes Lets generalize the meningitis example: suppose we have a cause which directly influences a number of effects, all of which are conditionally independent, given the cause The full joint distribution can be written as ?? ??, ??1, , ???? = ??(??) ? ?? ?? ???? ??) It is called a nave Bayes model since it is often used (as a simplifying assumption) in cases where the effect variables are not actually conditionally independent given the cause variable C E1 E2 EnE3 Nave Bayes In practice, nave Bayes systems can work surprisingly well, even when the conditional independence assumption is not true. What can we model with nave Bayes? Classes Effects E-mail Spam, Not spam Words used in the message Handwritten Digit Recognition 0,1,,9 Pixel intensities in the image News Article Topics Politics, Sports,... Words used in the article What to do with Nave Bayes? 1. Create a Nave Bayes model: We need local probability estimates Could elicit them from a human Better: Estimate them from observations! This is called parameter estimation, or more generally learning 2. Use a Nave Bayes model to estimate probability of causes given observations of effects: This is a specific kind of probabilistic inference Requires just a simple computation (next slide) From this we can also get the most likely cause, which is called prediction, or classification These are the basic tasks of machine learning Naive Bayes Classifiers Task: Classify a new instance D based on a tuple of attribute values into one of the classes c ? CnxxxD ,,, 21 ?=1 2argmax ( | , , , )MAP n c C c P c x x x ? ?),,,( )()|,,,( argmax 21 21 n n Cc xxxP cPcxxxP ? ? ? =)()|,,,(argmax 21 cPcxxxP n Cc ? ? =argmax ( ) ( | )i c C i P c P x c ? = ? Maximum Likelihood Hypothesis If all hypotheses are a priori equally likely, we can omit the P(c) term:argmax ( | )ML i c C i c P x c ? ? ? Summary Bayesian networks provide a natural representation for (causally induced) conditional independence Topology + CPTs = compact representation of joint distribution Generally easy for domain experts to construct A lot more that we didnt cover: Continuous variables Approximate inference Undirected graph Dynamic BN Learning BN structure/CPT with perfect/imperfect data What is Probability, Anyway? Different philosophical positions: Frequentism: numbers only come from repeated experiments As we flip a coin lots of times, we see experimentally that it comes out heads the time Objectivism: probabilities are a real part of the universe Maybe true at level of quantum mechanics Most of us agree that the result of a coin flip is (usually) determined by initial conditions + classical mechanics Subjectivism: probabilities merely reflect agents beliefs There is probability of 0.0000001% that aliens will be landed on earth next year. What is Probability, Anyway? There is some critique on the use of subjective probabilities It is hard to test its correctness or accuracy Different experts will give different values for the same phenomena Nevertheless, it is frequently used The probability for an earthquake (by insurance companies) The probability of sports events (by gambling agencies) Sampling-based probabilities can also be inaccurate Might miss a trend if it is heavily based on the history Might be too general Subjective probabilities can be based on models ???? ????????? ?????? ????????? ?"? ????? confirmation bias ?? ??? ???? ??????? ?? ??? ??? ???? ???? confirmation bias confirmation bias ?? ??? ???? ??????? ??? ???????? ??????????? ??????? ?????? ????? ????? ?? ????? ?????? ??? ???? ??????? ???? ?? ?????,??? ????? ?? ????? ???????? ??????????? ???? ????? ??????? ?? ??????? ???? ????10????? ????.???? ??? ???? ???? ?????? H H T H T H T T H T T T T T T T T T T T ???? ?????? ???? ????"?????"????? ???????? ??? ??? ??????? ???1/1024 ?????? ??????? ??? ?? ???? ????? ?????? ?????? ?????? ????,?????? ?????? ????? ???? ?????,?????? ?????? ???? ???? ??? ???? ???? ?????????? ?-offseason ????? ??? ???? ???? ??? ????? ?????? ?????,???? ??? ????? ????,???? ???? ????? Now Add Utilities It is assumed that a decision-maker has preferences between the different possible outcomes of his actions. We use utility theory to represent and reason with preferences. Utility theory says that every state has a degree of usefulness, or utility, to a decision-maker and that the decision-maker will prefer states with higher utility. Utility of state1 > Utility of state2 state1 is preferred by the decision-maker over state2 Utilities Note: the utility of a state is relative to the decision-maker. To summarize: utilities are functions from outcomes to real numbers that describe an agents preferences. Decision Theory Decision Theory = Probability theory + Utility Theory (deals with chance) (deals with outcomes) The fundamental idea of decision theory is that a decision- maker is rational if and only if it chooses the action that yields the highest expected utility, averaged over all the possible outcomes of the action. This is called the principle of Maximum Expected Utility (MEU). Decisions Under Uncertainty Imagine driving in Los Angeles: Plan 1 105 freeway 80% chance it will succeed 105 will be quick if flowing smoothly, but stuck for 1 hour if accident If plan1 is successful the outcome is very good, but failure is very bad Plan 2 - 90 freeway 70% chance it will succeed 90 will be somewhat quick, but not so bad even if slow If plan2 successful the outcome is good (not as good as plan1s success), but failure is not that bad Decisions Under Uncertainty Which one will you pick: Plan1 or Plan2? Plan 1 because it has a high chance of success? Plan 2 because it has a low consequence of failure? Make a choice among actions or plans given two key elements: Chance of success/failure Consequence or cost or reward of success or failure Basic Principle A decision maker needs to select an action Let Ac = {A1, A2,Ai} set of actions Let OUT = {out1, out2,} set of all possible outcomes E.g. possible actions: plan1, plan2 E.g. possible outcomes: Getting home early, getting home somewhat early, stuck in traffic. P(outj| Ai) = probability of outcome outj, performing action Ai: U(outj) = utility associated with each outcome Our Example plan1 and plan2 are actions Plan 1 uses 105 freeway P(home-early|plan1) = 0.8 P(stuck-105|plan1) = 0.2 quick if flowing, stuck for 1 hour if slow U(home-early) = 100 U(stuck-105) = -1000 Plan 2 uses 90 freeway P(home-somewhat-early|plan2) = 0.7 P(stuck-90|plan2) = 0.3 quick if flowing, not bad even if slow U(home-somewhat-early) = 50 U(stuck-90) = -10 Basic Principle Given probability P(out1| Ai), utility U(out1), P(out2| Ai), utility U(out2) Expected utility of an action Aii: EU(Ai) = S U(outj)*P(outj|Ai) Choose Ai such that maximizes EU MEU = argmax S U(outj)*P(outj|Ai) Ai ? Ac Outj ? OUT Out-j ? OUT Application of Principle EU(Plan1) = P(home-early | plan1) *U(home-early) + P(stuck-105 | plan1) * U(stuck-105) =0.8 * 100 + 0.2 * -1000 = -120 EU(Plan2) = P(home-somewhat-early | plan2) * U(home-somewhat-early) + P(stuck-90 | plan2) * U(stuck-90) = 0.7 * 50 + 0.3 * -10 = 32 EU (plan2) is higher, so choose plan2 Another View: Decision Network Decision node: You play Chance node: Nature plays Plan1 Plan2 0.70 0.30 0.80 0.20 Success Reward: 100 Success Reward: 50 Failure Reward: -1000 Failure Reward: -10 EU(Plan2): $50*0.7 -10*0.3 = 32 EU(Plan1): 100*0.8 1000*0.2 = -120 What are some differences from standard AI game tree search? Another Decision network Example (in class) Demonstrates how to build complex decision network Demonstrates the use of consultant Introduce the value of sampled/perfect information Which System Should We Buy? You are the CEO of a company, which examines the possibility to replace the current computer system with a new system. Possible actions: a1 : buy system A a2 : buy system B a3 : keep the current system Costs: Buying system A costs $30,000 Buying system B costs $22,000 Profit: If the system is successful the company saves $50,000 Survey from other costumers: System A is successful in 90% of the cases System B is successful in 80% of the cases Decision Network a3 replace 0.80 0.20 0.90 0.10a1 a2 0 50-30 = 20 -30 50-22 = 28 -22 EMV = 0.9*20 + 0.1*-30 = 15 EMV = 0.8*28 + 0.2*-22 = 18 EMV = max(15,18) = 18 EMV = max(18,0) = 18 Should we write here -50? success success fail fail Adding a Consultant You, as a CEO, should choose action a2, with the EMV of 18 Well, not exactly. Many organizations hire a consultant to better estimate how well the system will work with the specific settings of the organization Suppose that hiring a consultant costs $3,000 However, he is not perfect. We have the following estimates: Y1 buy A Y2 buy B Y3 do nothing Sum ?1 = AB 50% 50% 0% 1 ?2 = AB 70% 30% 0% 1 ?3 = AB 25% 75% 0% 1 ?4 = AB 5% 0% 95% 1 Actually, this table is the conditional probability table ?? Yi ????) New Decision Network The new question is thus: should you hire this consultant, and if so, should you listen to his advice? a2 hire -3 Y1 Y2 18 Y3 0 a1 a3 a2 New Decision Network a2 hire -3 Y1 Y2 18 Y3 0 a2 a1 a3 T1 0 T2 0 T3 0 T4 1 28 -22 28 -22 T1 0 T2 0 T3 0T4 1 50-30=20 50-30=20 -30 -30 a2 hire -3 Y1 Y2 18 Y3 0 a2 a1 a3 T1 0 T2 0 T3 0 T4 1 28 -22 28 -22 T1 0 T2 0 T3 0T4 1 50-30=20 50-30=20 -30 -30 EMV = -30 EMV = -22 EMV = 0 New Decision Network Filling the Rest of the Network The utilities depend on your action and on natures status The probabilities depend on the advice and on natures status What should you do with Y1 and Y2? First, calculate the joint probability table, recall that: Y1 buy A Y2 buy B Y3 do nothing ??(????) ?1 = AB 0.72*0.5 = 0.36 0.72*0.5 = 0.36 0 0.9*0.8 = 0.72 ?2 = AB 0.18*0.7 = 0.126 0.18*0.3 = 0.054 0 0.9*0.2 = 0.18 ?3 = AB 0.08*0.25 = 0.02 0.08*0.75 = 0.06 0 0.1*0.8 = 0.08 ?4 = AB 0.02*0.05 = 0.001 0 0.02*0.95 = 0.019 0.1*0.2 = 0.02 ?? Yi 0.507 0.474 0.019 1 ?? Yi n ???? = ?? Yi ????) * ??(????) How to Fill the Rest of the Network? Then, calculate the posterior probabilities, using Bayes rule: Y1 buy A Y2 buy B Y3 do nothing ?1 = AB 0.36/0.507 = 0.71 0.36/0.474 = 0.759 0 ?2 = AB 0.126/0.507 = 0.249 0.054/0.474 = 0.114 0 ?3 = AB 0.02/0.507 = 0.039 0.06/0.474 = 0.127 0 ?4 = AB 0.001/0.507 = 0.002 0 0.019/0.019 = 1 Sum 1 1 1 ?? ???? ???? = ??(???? n ????) ??(???? ) Filling the Network for Y1 a2 hire -3 Y3 Y2 18 Y1 0 a2 a1 a3 28 -22 28 -22 T1 0.71 T2 0.249 T3 0.039T4 0.002 50-30=20 50-30=20 -30 -30 0 T1 0.71 T2 0.249 T3 0.039 T4 0.002 EMV = 17.95 EMV = 15.45 EMV = 17.95 Filling the Network for Y2 a2 hire -3 Y3 Y1 18 Y2 0 a2 a1 a3 28 -22 28 -22 T1 0.759 T2 0.114 T3 0.127T4 0 50-30=20 50-30=20 -30 -30 0 T1 0.759 T2 0.114 T3 0.127 T4 0 EMV = 13.65 EMV = 22.3 EMV = 22.3 17.95 Finalizing the Decision Network You now know what to do with the consultant's advice But should you hire him at all? a2 hire -3 Y1 0.507 Y2 0.474 18 Y3 0.19 22.3 (a2) 17.95 (a1) 0 (a3) EMV = 19.67 Value of Information EVSI = 19.67 18 = 1.67 That is, the consultants advice worth you $1670, and thus you are not going to pay for it $3000 But what if you have perfect consultant? How much would you willing to pay him? a2 hire ?? 18 T1 0.72 T2 0.18 T3 0.08T4 0.02 28 (a2) 20 (a1) 28 (a2) 0 (a3) EMV = 26 EVPI = 26-18 = 8 Summary of Terms EMV (Expected Monetary Value) Expected utility of a chance node, where utility=money. Following the principle of MEU, the value of a decision node is the maximum EMV of its children. EVSI (Expected Value of Sampled Information) Maximum EMV with the sampled information minus maximum EMV without any information. Also called EVII (Expected Value of Imperfect Information). EVPI (Expected Value of Perfect Information) Expected value under certainty minus maximum EMV, where expected value under certainty = s?????T ?? ???? ?? (Outj of the best action Ak) R&D Analysis Using Decision Network R&D goal is to find a cheaper way to produce a product. There are 2 possible technologies, a and ? It is possible to eliminate the uncertainty using R&D Interest rate is 10% Which Option to Choose? Invest in a: Invest in ?: Can we do better? Try Sequential Search Which option to develop first? What to do if the first one failed? Is seems that a dominates ?: lower cost for R&D, shorter duration, higher expected utility, higher minimum utility, lower variance in the utility, Example a ? 0.8, 00.2, 240 100 0 240240 240 0.5, 1000.5 , 55 100240 100 55240 55 240 55.4556 0.5, 100 55 0.5, 550.5, 1000.5, 550.8, 00.2, 2400.8, 00.2, 240 aa stopstop?stop?stop56 1.1 1 ]55*8. 240*2[.20 2 =? ? ? ? ? ? ++-8.85 1.1 1 ]100*8. 240*2[.20 2 =? ? ? ? ? ? ++- 10015 [.5*56 1 .5*100] 55.9 1.1 - + + ? ? =? ? ? ?2 20 [.2* 240 1 .8*55.45] 56.33 1.1 - + + ? ? =? ? ? ?15 [.5*55 1 .5*100] 55.45 1.1 - + + ? ? =? ? ? ? Why maximizing over the expected utility is the rational way to act? Why not minimize the worst possible loss? Why should a utility function with the required properties exist at all? Foundation of Decision Theory Preferences An agent chooses among: Prizes: A, B, etc. Lotteries: situations with uncertain prizes Note, A and B can be also lotteries Notation:( ), ; 1 ,A BL p p= -? ?? ? L A B p 1-p A is strictly preferred to B In difference between A and B A is strictly preferred to or indifferent with B Constraints on Rational Preferences We want some constraints on preferences before we call them rational For example: an agent with intransitive preferences can be induced to give away all of its money If B>C, then an agent with C would pay (say) 1 cent to get B If A>B, then an agent with B would pay (say) 1 cent to get A If C>A, then an agent with A would pay (say) 1 cent to get C ?? ? ?? ? (?? ? ??) ?(A ? C) Preference of a rational agent must obey constraints The axioms of rationality: for all A, B, C Orderability Transitivity Continuity Substitutability Monotonicity Decomposability( ) ( ) ( )A B B C A C? ?? ? ? ?( ), ;1 , , ;1 ,A B p q p A p B q A q B? ? ? - - Constraints on Rational Preferences,A B? ?? ? ?? ???? ?? ? ?? ????(??~??) [??,??; 1-??, ??,??; 1-??,?? ]~[p,A;(1-p)q,B;(1-p)(1-q),C] MEU Principle Theorem: [Ramsey, 1931; von Neumann & Morgenstern, 1944] Given any preference satisfying these axioms, there exists a real-value function U such that: In other words, once the probabilities and utilities of the possible outcome states are specified, the utility of a compound lottery involving those states is completely determined. An agent can act rationally - that is, consistently with its preferences - only by choosing an action that maximizes expected utility. ?? ?? > ?? ?? ? ?? ? ?? ?? ?? = ?? ?? ? ?? ~ ?? ?? [??1, ??1; ? ; ????, ????] = S????????(????) Additional Notes on Utilities Utilities are just a representation! an agent can be entirely rational (consistent with MEU) without ever representing or manipulating utilities and probabilities In other words, the existence of a utility function that describes an agents preference behavior does not necessarily mean that the agent is explicitly maximizing that utility function in its own deliberations. Utility function is not unique If U(S) is a utility function of an agent, then for any constant a,b where a is positive, U(S)=aU(S)+b is also a utility function of the agent. The Utility of Money Suppose an agent gives you a choice: Choice1: You will get $1,000,000 Choice2: The agent will toss a coin If heads, then you win $3,000,000 If tails, then you get nothing What is your choice? Simple expected utility calculations give: EU(Choice1) = $1,000,000 EU(Choice2) = 0.5 * $0 + 0.5 * $3,000,000 = $1,500,000 So why did we prefer the first choice? Answer: Risk Aversion Utility ? Money We are risk averse Our utility functions for money are as follows (!!): Our first million means a lot U($1M) = 10 Second million not so much U($2M) = 15 (NOT 20) Third million even less so U($3M) = 18 (NOT 30) Answer: Risk Aversion If we plot amount of money on the x-axis and utility on the y-axis, we get a concave curve EU(choice1) = U($1M) = 10 EU(choice2) = 0.5*U(0) + 0.5*U($3M =18) = 9 That is why we prefer the sure $1M0 5 10 15 20 25 0 1M 2M 3M 4M Money Utility Risk Averse, Risk Neutral Risk Seeking0 5 10 15 20 25 0 1M 2M 3M 4M Money Utility RISK AVERSE0 5 10 15 20 25 30 35 40 45 0 1M 2M 3M 4M Money Utility RISK NEUTRAL0 20 40 60 80 100 120 0 1M 2M 3M 4M Money Utility RISK SEEKER U(X)EU(Choice1) = 10 EU(Choice2) = 9 U(X)=0 ? EU(Choice1) =10 EU(Choice2) =15 U(X)>0 ? EU(Choice1)=10 EU(Choice2)=25 More Risk Aversion Key: Slope of utility function is continuously decreasing Thus, we will refuse to play a monetarily fair bet Example: Suppose we start with x dollars We are offered a game: 0.5 chance to win 1000 dollars (c = 1000) 0.5 chance to lose 1000 dollars (c = 1000) Expected monetary gain or loss is zero (hence monetarily fair) Should be neutral to it, but seems we are not! Why? U(x + c) U(x) U(x + c) + U(x c) U(x + c) /2 + U(x c) / 2 EU (playing the game) When People will be Risk Seekers? Someone already $10,000,000 in debt might well accept a gamble on a fair coin with a gain of $10,000,000 for heads and a loss of $20,000,000 for tails. Insurance Consider the lottery L=[0.5, $1000; 0.5, $0] What is its expected monetary value (EMV)? ($500) What is its certainty equivalent (or CME)? Monetary value acceptable in lieu of lottery That is, U(CE(L))=U(L) $400 for most people regarding L Difference of $100 is the insurance premium There is an insurance industry because people will pay to reduce their risk If everyone were risk-neutral, no insurance needed! Insurance Because people ascribe different utilities to different amounts of money, insurance agreements can increase both parties expected utility Amount Your Utility UY $0 0 -$50 -150 -$200 -1000 You own a car. Your lottery: LY = [0.8, $0; 0.2, -$200] i.e., 20% chance of crashing You do not want -$200! Insurance is $50 UY(LY) = 0.2*UY(-$200)=-200 UY(-$50)=-150 Example: Insurance Because people ascribe different utilities to different amounts of money, insurance agreements can increase both parties expected utility You own a car. Your lottery: LY = [0.8, $0; 0.2, -$200] i.e., 20% chance of crashing You do not want -$200! UY(LY) = 0.2*UY(-$200)=-200 UY(-$50)=-150 Insurance company buys risk: LI = [0.8, $50; 0.2, -$150] i.e., $50 revenue + your LY Insurer is risk-neutral: U(L) = U(EMV(L)) UI(LI) = U(0.8*50+0.2*(-150)) = U($10) >U($0) Decision Theory and AI In a sense, the MEU principle could be seen as defining all of AI. All an intelligent agent has to do is calculate the various quantities, maximize utility over its actions, and away it goes. So no need for research in AI? The MEU principle formalizes the general notion that the agent should do the right thing, but goes only a small distance toward a full operationalization of that advice. Decision Theory and AI Estimating the state of the world requires perception, learning, knowledge representation, and inference. Computing the probabilities requires a complete causal model of the world and it is NP-hard inference in (very large) Bayesian networks. Computing the outcome utilities often requires searching or planning, because an agent may not know how good a state is until it knows where it can get to from that state. So, decision theory is not a panacea that solves the AI problembut it does provide a useful framework.
Attachments:
Step by Step Solution
There are 3 Steps involved in it
Step: 1
Get Instant Access to Expert-Tailored Solutions
See step-by-step solutions with expert insights and AI powered tools for academic success
Step: 2
Step: 3
Ace Your Homework with AI
Get the answers you need in no time with our AI-driven, step-by-step assistance
Get Started