340. Value Iteration

The standard algorithm for solving infinite-horizon Markov Decision Processes. Iteratively applies the Bellman equation until the value function converges.

340.1. Infinite-horizon discounted MDP

States , actions , transition , immediate cost , discount factor .

Objective: minimize expected discounted total cost over an infinite horizon:

Since rewards / costs in the far future are exponentially discounted, the sum converges.

340.2. Bellman optimality equation (infinite horizon)

This is a fixed-point equation: where is the Bellman operator. The operator is a contraction (with contraction factor ), so the fixed point is unique and reachable by iteration.

340.3. Algorithm

Initialize J(s) ← 0 (or any function)

Repeat until convergence:
    For each s ∈ S:
        J'(s) ← min over a of { c(s, a) + γ Σ_{s'} P(s' | s, a) J(s') }
    J ← J'

Return optimal policy:
    π*(s) ← argmin over a of { c(s, a) + γ Σ_{s'} P(s' | s, a) J(s') }

Convergence test: stop when .

340.4. Why it converges

The Bellman operator is a -contraction under the sup-norm:

By Banach fixed-point theorem, iterating from any starting converges geometrically to the unique fixed point :

340.5. Convergence rate

Geometric in . To get within of takes about

If is close to 1 (long-horizon problems), convergence is slow. Acceleration tricks:

340.6. Cost per iteration

— for each state (), for each action (), sum over next states ().

340.7. Value iteration vs policy iteration

Value iteration Policy iteration
Iterates on Value function Policy
Per iteration cost Cheap () Expensive (full policy evaluation: solve linear system)
Iterations to converge Many (geometric in ) Few (often iterations)
Total cost Comparable in practice Comparable
Anytime? Yes — returns a usable at any point Only at policy-iteration end

See Policy Iteration for the alternative.

340.8. Example application

Inventory management with infinite horizon: state = on-hand inventory, action = order quantity, transition = stochastic demand. Value iteration finds the optimal policy.

340.9. See also