341. Policy Iteration

An alternative to value iteration for infinite-horizon MDPs. Alternates between evaluating the current policy and improving it.

341.1. Algorithm

Initialize π(s) ← arbitrary action

Repeat:
    # Policy evaluation: compute J^π by solving the linear system
    #     J^π(s) = c(s, π(s)) + γ Σ_{s'} P(s' | s, π(s)) J^π(s')
    # i.e., (I - γ P_π) J^π = c_π        (|S| × |S| linear system)

    # Policy improvement: greedy w.r.t. J^π
    For each s ∈ S:
        π'(s) ← argmin over a of { c(s, a) + γ Σ_{s'} P(s' | s, a) J^π(s') }

    If π' = π: stop
    π ← π'

Return π*

341.2. Two phases per iteration

  1. Policy evaluation: solve for the value function of the current policy . This is a linear system — exact, no approximation.

  2. Policy improvement: greedy update — pick the action minimizing the Bellman expression using the current .

If policy improvement changes nothing, the current is optimal (Bellman optimality is satisfied).

341.3. Why it terminates in finite steps

Each iteration strictly improves the policy (in value) unless already optimal. There are only deterministic policies, so the algorithm must terminate. In practice, it converges in very few iterations — often or fewer, regardless of .

341.4. Cost per iteration

So per iteration, policy iteration is more expensive than value iteration. But it needs fewer iterations. Total cost is comparable.

341.5. Modified policy iteration

A middle ground: don’t fully solve the linear system in step 1. Instead, do value-iteration-like sweeps to approximate :

Repeat k times:
    For each s ∈ S:
        J(s) ← c(s, π(s)) + γ Σ_{s'} P(s' | s, π(s)) J(s')

Then improve. With : equivalent to value iteration. With : full policy iteration. Tuning trades off per-iteration cost vs number of iterations.

341.6. Comparison with value iteration

Value iteration Policy iteration
Per-iter cost low high
Number of iterations many few
Convergence asymptotic () exact terminator (finite policy space)
Memory for for and
Practical simpler, more common in DP literature favored for small / mid state spaces

341.7. Why policy iteration converges fast

The policy improvement theorem says that greedy update produces a policy with value at least as good componentwise — strictly better at some state if the policy wasn’t already optimal. In high-discount problems, value iteration can take thousands of iterations; policy iteration usually takes 10–50.

341.8. Linear-programming reformulation

Both value iteration and policy iteration solve the same fixed-point problem:

This LP has variables and constraints. Solving it directly is a third approach (and the connection to LP duality reveals deep structure of MDPs).

341.9. See also