340. Value Iteration
The standard algorithm for solving infinite-horizon Markov Decision Processes. Iteratively applies the Bellman equation until the value function converges.
340.1. Infinite-horizon discounted MDP
States , actions , transition , immediate cost , discount factor .
Objective: minimize expected discounted total cost over an infinite horizon:
Since rewards / costs in the far future are exponentially discounted, the sum converges.
340.2. Bellman optimality equation (infinite horizon)
This is a fixed-point equation: where is the Bellman operator. The operator is a contraction (with contraction factor ), so the fixed point is unique and reachable by iteration.
340.3. Algorithm
Initialize J(s) ← 0 (or any function)
Repeat until convergence:
For each s ∈ S:
J'(s) ← min over a of { c(s, a) + γ Σ_{s'} P(s' | s, a) J(s') }
J ← J'
Return optimal policy:
π*(s) ← argmin over a of { c(s, a) + γ Σ_{s'} P(s' | s, a) J(s') }
Convergence test: stop when .
340.4. Why it converges
The Bellman operator is a -contraction under the sup-norm:
By Banach fixed-point theorem, iterating from any starting converges geometrically to the unique fixed point :
340.5. Convergence rate
Geometric in . To get within of takes about
If is close to 1 (long-horizon problems), convergence is slow. Acceleration tricks:
- Gauss-Seidel updates (update in-place using already-updated neighbors)
- Prioritized sweeping (update states with large Bellman residuals first)
- Approximate value iteration (function approximation for large state spaces)
340.6. Cost per iteration
— for each state (), for each action (), sum over next states ().
340.7. Value iteration vs policy iteration
| Value iteration | Policy iteration | |
|---|---|---|
| Iterates on | Value function | Policy |
| Per iteration cost | Cheap () | Expensive (full policy evaluation: solve linear system) |
| Iterations to converge | Many (geometric in ) | Few (often iterations) |
| Total cost | Comparable in practice | Comparable |
| Anytime? | Yes — returns a usable at any point | Only at policy-iteration end |
See Policy Iteration for the alternative.
340.8. Example application
Inventory management with infinite horizon: state = on-hand inventory, action = order quantity, transition = stochastic demand. Value iteration finds the optimal policy.