Policy Iteration

341. Policy Iteration

An alternative to value iteration for infinite-horizon MDPs. Alternates between evaluating the current policy and improving it.

341.1. Algorithm

Initialize π(s) ← arbitrary action

Repeat:
    # Policy evaluation: compute J^π by solving the linear system
    #     J^π(s) = c(s, π(s)) + γ Σ_{s'} P(s' | s, π(s)) J^π(s')
    # i.e., (I - γ P_π) J^π = c_π        (|S| × |S| linear system)

    # Policy improvement: greedy w.r.t. J^π
    For each s ∈ S:
        π'(s) ← argmin over a of { c(s, a) + γ Σ_{s'} P(s' | s, a) J^π(s') }

    If π' = π: stop
    π ← π'

Return π*

341.2. Two phases per iteration

Policy evaluation: solve for the value function of the current policy . This is a linear system — exact, no approximation.
Policy improvement: greedy update — pick the action minimizing the Bellman expression using the current .

If policy improvement changes nothing, the current is optimal (Bellman optimality is satisfied).

341.3. Why it terminates in finite steps

Each iteration strictly improves the policy (in value) unless already optimal. There are only deterministic policies, so the algorithm must terminate. In practice, it converges in very few iterations — often or fewer, regardless of .

341.4. Cost per iteration

Policy evaluation: (solve linear system) or fewer with iterative methods
Policy improvement:

So per iteration, policy iteration is more expensive than value iteration. But it needs fewer iterations. Total cost is comparable.

341.5. Modified policy iteration

A middle ground: don’t fully solve the linear system in step 1. Instead, do value-iteration-like sweeps to approximate :

Repeat k times:
    For each s ∈ S:
        J(s) ← c(s, π(s)) + γ Σ_{s'} P(s' | s, π(s)) J(s')

Then improve. With : equivalent to value iteration. With : full policy iteration. Tuning trades off per-iteration cost vs number of iterations.

341.6. Comparison with value iteration

	Value iteration	Policy iteration
Per-iter cost	low	high
Number of iterations	many	few
Convergence	asymptotic ()	exact terminator (finite policy space)
Memory	for	for and
Practical	simpler, more common in DP literature	favored for small / mid state spaces

341.7. Why policy iteration converges fast

The policy improvement theorem says that greedy update produces a policy with value at least as good componentwise — strictly better at some state if the policy wasn’t already optimal. In high-discount problems, value iteration can take thousands of iterations; policy iteration usually takes 10–50.

341.8. Linear-programming reformulation

Both value iteration and policy iteration solve the same fixed-point problem:

This LP has variables and constraints. Solving it directly is a third approach (and the connection to LP duality reveals deep structure of MDPs).

341.9. See also

Value Iteration — the alternative
Bellman Equation
Stochastic DP
Linear Programming — LP reformulation