Value Iteration

340. Value Iteration

The standard algorithm for solving infinite-horizon Markov Decision Processes. Iteratively applies the Bellman equation until the value function converges.

340.1. Infinite-horizon discounted MDP

States , actions , transition , immediate cost , discount factor .

Objective: minimize expected discounted total cost over an infinite horizon:

Since rewards / costs in the far future are exponentially discounted, the sum converges.

340.2. Bellman optimality equation (infinite horizon)

This is a fixed-point equation: where is the Bellman operator. The operator is a contraction (with contraction factor ), so the fixed point is unique and reachable by iteration.

340.3. Algorithm

Initialize J(s) ← 0 (or any function)

Repeat until convergence:
    For each s ∈ S:
        J'(s) ← min over a of { c(s, a) + γ Σ_{s'} P(s' | s, a) J(s') }
    J ← J'

Return optimal policy:
    π*(s) ← argmin over a of { c(s, a) + γ Σ_{s'} P(s' | s, a) J(s') }

Convergence test: stop when .

340.4. Why it converges

The Bellman operator is a -contraction under the sup-norm:

By Banach fixed-point theorem, iterating from any starting converges geometrically to the unique fixed point :

340.5. Convergence rate

Geometric in . To get within of takes about

If is close to 1 (long-horizon problems), convergence is slow. Acceleration tricks:

Gauss-Seidel updates (update in-place using already-updated neighbors)
Prioritized sweeping (update states with large Bellman residuals first)
Approximate value iteration (function approximation for large state spaces)

340.6. Cost per iteration

— for each state (), for each action (), sum over next states ().

340.7. Value iteration vs policy iteration

	Value iteration	Policy iteration
Iterates on	Value function	Policy
Per iteration cost	Cheap ()	Expensive (full policy evaluation: solve linear system)
Iterations to converge	Many (geometric in )	Few (often iterations)
Total cost	Comparable in practice	Comparable
Anytime?	Yes — returns a usable at any point	Only at policy-iteration end

See Policy Iteration for the alternative.

340.8. Example application

Inventory management with infinite horizon: state = on-hand inventory, action = order quantity, transition = stochastic demand. Value iteration finds the optimal policy.