339. Stochastic DP
Extends deterministic DP to problems with random transitions. The framework underlying Markov Decision Processes (MDPs), reinforcement learning, and most multi-period stochastic optimization in operations.
339.1. Setup
At stage in state :
- Choose action
- Pay immediate cost
- Next state is random: — transition probability depending on current state and action
After the last decision: terminal cost .
339.2. Policy (not plan)
A fixed plan no longer works — you don’t know which state will realize at each step. Instead, a policy is a function:
Total cost is a random variable; we minimize its expectation:
339.3. Stochastic Bellman equation
Same argument as the deterministic case (additivity + principle of optimality, now applied to expected cost):
Equivalently:
The expectation over replaces the deterministic transition from the deterministic Bellman.
339.4. Side-by-side comparison
339.5. Worked example
From state at stage , action “right” leads to:
- with probability (immediate cost , )
- with probability (immediate cost , )
If action “left” gives , optimal action at = left.
339.6. Examples
- (s, S) inventory policy — order up to when inventory drops below ; optimality is a stochastic-DP result
- Multi-period newsvendor — stochastic demand each period, carry inventory between periods
- Revenue management — DP on (remaining capacity, time-to-go)
- Optimal stopping — secretary problem, asset selling, exercise of American options
- Reinforcement learning — model-free / sample-based stochastic DP
339.7. Finite vs infinite horizon
- Finite horizon: just solve backward in time as in deterministic case
- Infinite horizon: leads to value iteration / policy iteration
339.8. Curse of dimensionality + curse of randomness
Stochastic DP inherits the dimensionality curse from deterministic DP, plus the cost of computing expectations over the next-state distribution. Approximations:
- Sampling: replace exact expectation with Monte Carlo estimate
- Approximate DP / Neuro-DP: function approximation for
- Reinforcement learning: sample-based, model-free
339.9. See also
- Bellman Equation — deterministic version
- Value Iteration
- Policy Iteration
- Markov Chains — underlying state-transition model