Validation

279. Validation

Tests for confirming that a system-dynamics model represents reality usefully. Sterman (2000, Ch. 21) gives a canonical 12-test battery, divided into structure-oriented and behavior-oriented tests.

279.1. Structure-oriented tests (does the model capture what it should?)

Boundary adequacy: are the right things inside vs outside the model?
Structure assessment: do the relationships match real-world causal structure?
Dimensional consistency: units balance in every equation
Parameter assessment: parameter values consistent with literature and expert estimates
Extreme conditions test: behavior at extremes (zero inventory, infinite capacity) sensible
Integration error test: refine $Δ 𝑡$ ; if behavior changes, integrator is too coarse

279.2. Behavior-oriented tests (does the model produce realistic output?)

Behavior reproduction: match historical / observed data
Behavior anomaly: if you simulate a known anomaly (delay, shock), the model exhibits it
Family member: same structure should explain same phenomena across similar systems
Surprise behavior: model should reveal new dynamic patterns not built-in
Sensitivity analysis: behavior changes “reasonably” with parameter perturbations
System improvement: model leads to useful policy insight

279.3. Quantitative fit: Theil’s U decomposition

For predicted ${\hat{𝑦}}_{𝑡}$ vs actual $𝑦_{𝑡}$ :

𝑈 = \sqrt{\frac{MSE}{𝜎_{\hat{𝑦}}^{2} + 𝜎_{𝑦}^{2} + {(| 𝑦 | - | \hat{𝑦} |)}^{2}}}

Decomposes MSE into three components:

$𝑈^{𝑀}$ (bias): systematic mean error
$𝑈^{𝑆}$ (variance): scale of variability
$𝑈^{𝐶}$ (covariance): direction / phase

$𝑈^{𝑀} + 𝑈^{𝑆} + 𝑈^{𝐶} = 1$ . Ideal model: $𝑈^{𝑀}$ and $𝑈^{𝑆}$ small, $𝑈^{𝐶}$ large — model captures direction even if magnitudes slightly off.

279.4. Behavior modes matter more than point fits

For SD models, qualitative behavior (does it overshoot? oscillate? saturate?) is usually more valuable than precise fit. A model that predicts the right pattern of bullwhip with the wrong amplitude is better than one with right amplitude but no pattern.

Sterman emphasizes: don’t fixate on RMSE; check that the behavior modes match.

279.5. Calibration methods

For data-fitting:

FIMLOF (Full Information Maximum Likelihood with Optimal Filtering): rigorous statistical calibration including Kalman-filter-style observation noise
Vensim Powell search: gradient-based optimization on RMSE
Bayesian calibration: posterior distributions over parameters given data

For decision-rule parameters (e.g., beer-game $𝛼_{𝑆}, 𝛼_{𝑆 𝐿}$ ): typically estimated by least-squares from gameplay data.

279.6. Common pitfalls

Over-fitting to historical data — many degrees of freedom in an SD model
Ignoring extreme tests — model “fits” but breaks at boundaries
Skipping unit checks — about half of model bugs are unit errors
Confidence over-reach — SD models are insightful, not predictive in the forecasting sense

279.7. Sensitivity analysis types

Univariate: vary one parameter at a time; identify high-leverage variables
Monte Carlo / Latin Hypercube: sample parameter combinations; build envelopes of model behavior
Tornado diagram: rank parameters by impact on a metric

279.8. See also

System Dynamics overview
Numerical Integration — integration-error test
Monte Carlo Simulation — for sensitivity