341. Policy Iteration
An alternative to value iteration for infinite-horizon MDPs. Alternates between evaluating the current policy and improving it.
341.1. Algorithm
Initialize π(s) ← arbitrary action
Repeat:
# Policy evaluation: compute J^π by solving the linear system
# J^π(s) = c(s, π(s)) + γ Σ_{s'} P(s' | s, π(s)) J^π(s')
# i.e., (I - γ P_π) J^π = c_π (|S| × |S| linear system)
# Policy improvement: greedy w.r.t. J^π
For each s ∈ S:
π'(s) ← argmin over a of { c(s, a) + γ Σ_{s'} P(s' | s, a) J^π(s') }
If π' = π: stop
π ← π'
Return π*
341.2. Two phases per iteration
-
Policy evaluation: solve for the value function of the current policy . This is a linear system — exact, no approximation.
-
Policy improvement: greedy update — pick the action minimizing the Bellman expression using the current .
If policy improvement changes nothing, the current is optimal (Bellman optimality is satisfied).
341.3. Why it terminates in finite steps
Each iteration strictly improves the policy (in value) unless already optimal. There are only deterministic policies, so the algorithm must terminate. In practice, it converges in very few iterations — often or fewer, regardless of .
341.4. Cost per iteration
- Policy evaluation: (solve linear system) or fewer with iterative methods
- Policy improvement:
So per iteration, policy iteration is more expensive than value iteration. But it needs fewer iterations. Total cost is comparable.
341.5. Modified policy iteration
A middle ground: don’t fully solve the linear system in step 1. Instead, do value-iteration-like sweeps to approximate :
Repeat k times:
For each s ∈ S:
J(s) ← c(s, π(s)) + γ Σ_{s'} P(s' | s, π(s)) J(s')
Then improve. With : equivalent to value iteration. With : full policy iteration. Tuning trades off per-iteration cost vs number of iterations.
341.6. Comparison with value iteration
| Value iteration | Policy iteration | |
|---|---|---|
| Per-iter cost | low | high |
| Number of iterations | many | few |
| Convergence | asymptotic () | exact terminator (finite policy space) |
| Memory | for | for and |
| Practical | simpler, more common in DP literature | favored for small / mid state spaces |
341.7. Why policy iteration converges fast
The policy improvement theorem says that greedy update produces a policy with value at least as good componentwise — strictly better at some state if the policy wasn’t already optimal. In high-discount problems, value iteration can take thousands of iterations; policy iteration usually takes 10–50.
341.8. Linear-programming reformulation
Both value iteration and policy iteration solve the same fixed-point problem:
This LP has variables and constraints. Solving it directly is a third approach (and the connection to LP duality reveals deep structure of MDPs).
341.9. See also
- Value Iteration — the alternative
- Bellman Equation
- Stochastic DP
- Linear Programming — LP reformulation