339. Stochastic DP

Extends deterministic DP to problems with random transitions. The framework underlying Markov Decision Processes (MDPs), reinforcement learning, and most multi-period stochastic optimization in operations.

339.1. Setup

At stage in state :

After the last decision: terminal cost .

339.2. Policy (not plan)

A fixed plan no longer works — you don’t know which state will realize at each step. Instead, a policy is a function:

Total cost is a random variable; we minimize its expectation:

339.3. Stochastic Bellman equation

Same argument as the deterministic case (additivity + principle of optimality, now applied to expected cost):

Equivalently:

The expectation over replaces the deterministic transition from the deterministic Bellman.

339.4. Side-by-side comparison

339.5. Worked example

From state at stage , action “right” leads to:

If action “left” gives , optimal action at = left.

339.6. Examples

339.7. Finite vs infinite horizon

339.8. Curse of dimensionality + curse of randomness

Stochastic DP inherits the dimensionality curse from deterministic DP, plus the cost of computing expectations over the next-state distribution. Approximations:

339.9. See also