Pengaturan
Kami sedang mempertimbangkan dalam pengaturan:
- Tindakan terpisah
- Status diskrit
- Hadiah terbatas
- Kebijakan stasioner
- Cakrawala tak terbatas
The kebijakan yang optimal didefinisikan sebagai:
dan fungsi nilai optimal adalah:
V * = max π V π ( s ) , ∀ s ∈ S
Ada dapat satu set kebijakan yang mencapai maksimum. Tetapi hanya ada satu fungsi nilai optimal:
V ∗ = V π ∗
π∗∈argmaxπVπ(s),∀s∈S(1)
V∗=maxπVπ(s),∀s∈S(2)
V∗=Vπ∗(3)
Pertanyaan
How to prove that there exists at least one π∗ which satisfies (1) simultaneously for all s∈S ?
Outline of proof
Construct the optimal equation to be used as a temporary surrogate definition of optimal value function, which we will prove in step 2 that it is equivalent to the definition via Eq.(2).
V∗(s)=maxa∈A[R(s,a)+γ∑s′∈ST(s,a,s′)V∗(s′)](4)
Turunkan persamaan dari fungsi nilai optimal mendefinisikan melalui Persamaan (4) dan melalui Persamaan (2).
(Catat sebenarnya kita hanya perlu arah keperluan dalam buktinya, karena kecukupan sudah jelas karena kita membangun Persamaan. (4) dari Persamaan. (2).)
Buktikan bahwa ada solusi unik untuk Persamaan. (4).
Pada langkah 2, kita tahu bahwa solusi yang diperoleh pada langkah 3 juga merupakan solusi untuk Persamaan. (2), jadi ini adalah fungsi nilai yang optimal.
Dari fungsi nilai optimal, kita dapat memulihkan kebijakan optimal dengan memilih tindakan maximizer dalam Persamaan (4) untuk setiap negara.
Detail langkah-langkahnya
1
V∗(s)=Vπ∗(s)=Ea[Qπ∗(s,a)]Vπ∗(s)≤maxa∈AQπ∗(s,a)s~ such that Vπ∗≠maxa∈AQπ∗(s,a), we can choose a better policy by maximizing Q∗(s,a)=Qπ∗(s,a) over a.
2
(=>)
Follows by step 1.
(<=)
i.e. If V~ satisfies V~(s)=maxa∈A[R(s,a)+γ∑s′∈ST(s,a,s′)V~(s′)], then V~(s)=V∗(s)=maxπVπ(s),∀s∈S.
Define the optimal Bellman operator as
TV(s)=maxa∈A[R(s,a)+γ∑s′∈ST(s,a,s′)V(s′)](5)
So our goal is to prove that if
V~=TV~, then
V~=V∗. We show this by combining two results, following
Puterman[1]:
a) If V~≥TV~, then V~≥V∗.
b) If V~≤TV~, then V~≤V∗.
Proof:
a)
For any π=(d1,d2,...),
V~≥TV~=maxd[Rd+γPdV~]≥Rd1+γPd1V~
Here
d is the decision rule(action profile at specific time),
Rd is the vector representation of immediate reward induced from
d and
Pd is transition matrix induced from
d.
By induction, for any n,
V~≥Rd1+∑i=1n−1γiPiπRdi+1+γnPnπV~
where
Pjπ represents the
j-step transition matrix under
π.
Since
Vπ=Rd1+∑i=1∞γiPiπRdi+1
we have
V~−Vπ≥γnPnπV~−∑i=n∞γiPiπRdi+1→0 as n→∞
So we have
V~≥Vπ. And since this holds for any
π, we conclude that
V~≥maxπVπ=V∗
b)
Follows from step 1.
3
The optimal Bellman operator is a contraction in L∞ norm, cf. [2].
Proof:
For any s,
|TV1(s)−TV2(s)|=∣∣∣∣maxa∈A[R(s,a)+γ∑s′∈ST(s,a,s′)V1(s′)]−maxa′∈A[R(s,a′)+γ∑s′∈ST(s,a′,s′)V(s′)]∣∣∣∣≤(∗)∣∣∣∣maxa∈A[γ∑s′∈ST(s,a,s′)(V1(s′)−V2(s′))]∣∣∣∣≤γ∥V1−V2∥∞
where in (*) we used the fact that
maxaf(a)−maxa′g(a′)≤maxa[f(a)−g(a)]
Thus by Banach fixed point theorum it follows that T has a unique fixed point.
References
[1] Puterman, Martin L.. “Markov Decision Processes : Discrete Stochastic Dynamic Programming.” (2016).
[2] A. Lazaric. http://researchers.lille.inria.fr/~lazaric/Webpage/MVA-RL_Course14_files/slides-lecture-02-handout.pdf