Stochastic DP

339. Stochastic DP

Extends deterministic DP to problems with random transitions. The framework underlying Markov Decision Processes (MDPs), reinforcement learning, and most multi-period stochastic optimization in operations.

339.1. Setup

At stage in state :

Choose action
Pay immediate cost
Next state is random: — transition probability depending on current state and action

After the last decision: terminal cost .

339.2. Policy (not plan)

A fixed plan no longer works — you don’t know which state will realize at each step. Instead, a policy is a function:

Total cost is a random variable; we minimize its expectation:

339.3. Stochastic Bellman equation

Same argument as the deterministic case (additivity + principle of optimality, now applied to expected cost):

Equivalently:

The expectation over replaces the deterministic transition from the deterministic Bellman.

339.4. Side-by-side comparison

339.5. Worked example

From state at stage , action “right” leads to:

with probability (immediate cost , )
with probability (immediate cost , )

If action “left” gives , optimal action at = left.

339.6. Examples

(s, S) inventory policy — order up to when inventory drops below ; optimality is a stochastic-DP result
Multi-period newsvendor — stochastic demand each period, carry inventory between periods
Revenue management — DP on (remaining capacity, time-to-go)
Optimal stopping — secretary problem, asset selling, exercise of American options
Reinforcement learning — model-free / sample-based stochastic DP

339.7. Finite vs infinite horizon

Finite horizon: just solve backward in time as in deterministic case
Infinite horizon: leads to value iteration / policy iteration

339.8. Curse of dimensionality + curse of randomness

Stochastic DP inherits the dimensionality curse from deterministic DP, plus the cost of computing expectations over the next-state distribution. Approximations:

Sampling: replace exact expectation with Monte Carlo estimate
Approximate DP / Neuro-DP: function approximation for
Reinforcement learning: sample-based, model-free

339.9. See also

Bellman Equation — deterministic version
Value Iteration
Policy Iteration
Markov Chains — underlying state-transition model