Question

1 Approved Answer

Posted on Sep 30, 2024

Performance Difference Lemma and Policy Improvement The- orem (a) Let's first prove a useful lemma, called Performance Difference Lemma: For any policies 7,7', and any

image text in transcribed

Performance Difference Lemma and Policy Improvement The- orem (a) Let's first prove a useful lemma, called Performance Difference Lemma: For any policies 7,7', and any state s ES, V*' (s) V"(s) = Eyewey! [A(s', ')] where dir' is the discounted occupancy induced by policy a' by starting from state s, i.e. di" (s') = * Pr(s+ = s'|s0 = 8,7") t=0 and A" (s',') is defined as A"(s', 7') := Eq'~!(s)[Q"(s', a')] V"(s'). A* is referred to as the the advantage function of . Hint: To prove this lemma, consider a sequence of (possibly non-stationary) policies {Ti}i>0; where To = 1, Top = h'. For any intermediate i, Ti is the non-stationary policy that follows ' for the first i time steps (i.e. time steps t such that 0 > t i, as shown in the figure below comparing of Ti=1 and Ti=2. Ti-2 t=0 -S 1:=1 (SES Cont' Si ,Q,~T aont' 1 Lant' 2 Az~T S3 Now we can rewrite the LHS of the statement as: V*" (8) V"(s) = (VTitl(s) V*i(8)) i=0 For each term (VTi+1(8) - V*(s) on the RHS, observe that Ti+1 and Ti are both identical to ' for the first i time steps, which induces the same state distribution ate time step i, Pr(si so = 8,7'). They are also both identical to a starting from state siti at time step i +1; so conditioned on (si = s, ai = a), the expected total reward for the remainder of the trajectory is y'Q (s, a). Therefore, we have V**+(s) V** (s) = Pr(si = s'|s0 = 8, 7')(Eq;vx([Q" (s', a;)] Eq;w(31)[Q"(s', a;)]). = s' because the difference between VTi+1 Vti only starts from s at time step i, where Ti+1 and Tti choose action di according i' and at, respectively. (b) Using the performance difference lemma, prove the policy improvement theorem, i.e. prove VTk+1 (s) > Vk (s), where tk+1 and Tk are consecutive policies in policy iteration. Performance Difference Lemma and Policy Improvement The- orem (a) Let's first prove a useful lemma, called Performance Difference Lemma: For any policies 7,7', and any state s ES, V*' (s) V"(s) = Eyewey! [A(s', ')] where dir' is the discounted occupancy induced by policy a' by starting from state s, i.e. di" (s') = * Pr(s+ = s'|s0 = 8,7") t=0 and A" (s',') is defined as A"(s', 7') := Eq'~!(s)[Q"(s', a')] V"(s'). A* is referred to as the the advantage function of . Hint: To prove this lemma, consider a sequence of (possibly non-stationary) policies {Ti}i>0; where To = 1, Top = h'. For any intermediate i, Ti is the non-stationary policy that follows ' for the first i time steps (i.e. time steps t such that 0 > t i, as shown in the figure below comparing of Ti=1 and Ti=2. Ti-2 t=0 -S 1:=1 (SES Cont' Si ,Q,~T aont' 1 Lant' 2 Az~T S3 Now we can rewrite the LHS of the statement as: V*" (8) V"(s) = (VTitl(s) V*i(8)) i=0 For each term (VTi+1(8) - V*(s) on the RHS, observe that Ti+1 and Ti are both identical to ' for the first i time steps, which induces the same state distribution ate time step i, Pr(si so = 8,7'). They are also both identical to a starting from state siti at time step i +1; so conditioned on (si = s, ai = a), the expected total reward for the remainder of the trajectory is y'Q (s, a). Therefore, we have V**+(s) V** (s) = Pr(si = s'|s0 = 8, 7')(Eq;vx([Q" (s', a;)] Eq;w(31)[Q"(s', a;)]). = s' because the difference between VTi+1 Vti only starts from s at time step i, where Ti+1 and Tti choose action di according i' and at, respectively. (b) Using the performance difference lemma, prove the policy improvement theorem, i.e. prove VTk+1 (s) > Vk (s), where tk+1 and Tk are consecutive policies in policy iteration