Multiple Linear Regression

188. Multiple Linear Regression

Causal vs. extrapolative. Time-series methods (SMA, ETS, ARIMA) build a forecast from the demand series alone — extracting level, trend, and seasonality from

𝑥_{1}, \dots, 𝑥_{𝑡}

. Causal models flip the framing: demand is driven by exogenous regressors (price, weather, promotions, demographics, day-of-week, …) and we fit a relationship between them.

𝑌_{𝑖} = 𝛽_{0} + 𝛽_{1} 𝑥_{1, 𝑖} + \dots + 𝛽_{𝑘} 𝑥_{𝑘, 𝑖} + 𝜀_{𝑖}, 𝜀_{𝑖} \sim iid 𝒩︀ (0, 𝜎^{2})

Model parameters (unknown — to be estimated)
- $𝛽_{0} \in ℝ$ : intercept
- $𝛽_{𝑗} \in ℝ$ , $𝑗 = 1, \dots, 𝑘$ : slope on regressor $𝑥_{𝑗}$
- $𝜎^{2} > 0$ : noise variance
Data
- $(𝑥_{1, 𝑖}, \dots, 𝑥_{𝑘, 𝑖}, 𝑦_{𝑖})$ for $𝑖 = 1, \dots, 𝑛$ : one row per observation
Output
- ${\hat{𝑏}}_{0}, {\hat{𝑏}}_{1}, \dots, {\hat{𝑏}}_{𝑘}$ : fitted coefficients
- ${\hat{𝑦}}_{𝑖} = {\hat{𝑏}}_{0} + \sum_{𝑗 = 1}^{𝑘} {\hat{𝑏}}_{𝑗} 𝑥_{𝑗, 𝑖}$ : predicted value
- $𝑒_{𝑖} = 𝑦_{𝑖} - {\hat{𝑦}}_{𝑖}$ : residual

188.0.1. Why “least squares”?

Three candidate loss functions for the residuals $𝑒_{𝑖} = 𝑦_{𝑖} - {\hat{𝑦}}_{𝑖}$ :

Loss	What it captures	Verdict
$\sum_{𝑖} 𝑒_{𝑖}$	Bias only	Positive and negative errors cancel — useless for accuracy
$\sum_{𝑖} \| 𝑒_{𝑖} \|$	Bias + accuracy	No closed-form minimiser — intractable analytically (OK numerically)
$\sum_{𝑖} 𝑒_{𝑖}^{2}$	Bias + accuracy + smooth	Closed form via calculus ✓

OLS picks $\hat{𝑏}$ to minimise the Residual Sum of Squares:

RSS (𝑏_{0}, \dots, 𝑏_{𝑘}) = \sum_{𝑖 = 1}^{𝑛} {(𝑦_{𝑖} - 𝑏_{0} - 𝑏_{1} 𝑥_{1, 𝑖} - \dots - 𝑏_{𝑘} 𝑥_{𝑘, 𝑖})}^{2}

For the univariate case ( $𝑘 = 1$ ), setting first-order conditions to zero gives a clean closed form:

{\hat{𝑏}}_{1} = \frac{\sum_{𝑖 = 1}^{𝑛} (𝑥_{𝑖} - \bar{𝑥}) (𝑦_{𝑖} - \bar{𝑦})}{\sum_{𝑖 = 1}^{𝑛} {(𝑥_{𝑖} - \bar{𝑥})}^{2}}, {\hat{𝑏}}_{0} = \bar{𝑦} - {\hat{𝑏}}_{1} \bar{𝑥}

For the multivariate case, the same exercise in matrix form yields the normal equations:

\hat{𝑏} = {(𝑋^{𝑇} 𝑋)}^{- 1} 𝑋^{𝑇} 𝑦

where $𝑋$ is the $𝑛 \times (𝑘 + 1)$ design matrix (a column of ones for the intercept, plus one column per regressor).

188.0.2. Goodness of fit ( $𝑅^{2}$ )

Total variation in $𝑦$ around its mean splits cleanly into the part the model explains and the part it leaves behind:

\underset{TSS(total)}{\underset{⏟}{\sum_{𝑖 = 1}^{𝑛} {(𝑦_{𝑖} - \bar{𝑦})}^{2}}} = \underset{RSS(regression)}{\underset{⏟}{\sum_{𝑖 = 1}^{𝑛} {({\hat{𝑦}}_{𝑖} - \bar{𝑦})}^{2}}} + \underset{ESS(residual)}{\underset{⏟}{\sum_{𝑖 = 1}^{𝑛} {(𝑦_{𝑖} - {\hat{𝑦}}_{𝑖})}^{2}}}

The coefficient of determination is the fraction of variation the model explains:

𝑅^{2} = \frac{RSS}{TSS} = 1 - \frac{ESS}{TSS} \in [0, 1]

$𝑅^{2} = 1$ : model fits perfectly (all residuals zero)
$𝑅^{2} = 0$ : model no better than predicting $\bar{𝑦}$ for everyone

Cautions on $𝑅^{2}$ :

$𝑅^{2}$ never decreases when you add a regressor → use adjusted $𝑅^{2} = 1 - (1 - 𝑅^{2}) \frac{𝑛 - 1}{𝑛 - 𝑘 - 1}$ when comparing models of different sizes.
High $𝑅^{2}$ does not imply causality — only that variation co-moves.
Each coefficient still needs its own t-test / confidence interval; the F-test on $𝑅^{2}$ only says “some coefficient is non-zero”.

Example

Trend + summer dummy on monthly demand (20 months, Jan year 1 — Aug year 2).

Model — level, linear trend, and a summer indicator:

𝐹_{𝑖} = 𝛽_{0} + 𝛽_{1} \cdot {Period}_{𝑖} + 𝛽_{2} \cdot {Summer}_{𝑖}, 𝑖 = 1, \dots, 20

where ${Period}_{𝑖} \in {1, \dots, 20}$ is a linear time index and ${Summer}_{𝑖} \in {0, 1}$ flags May–Aug.

Data

Mo.	$𝑦$	$𝑡$	$𝑆$	Mo.	$𝑦$	$𝑡$	$𝑆$
Jan	3 025	1	0	Nov	3 499	11	0
Feb	3 047	2	0	Dec	3 598	12	0
Mar	3 079	3	0	Jan	3 596	13	0
Apr	3 136	4	0	Feb	3 721	14	0
May	3 454	5	1	Mar	3 745	15	0
Jun	3 661	6	1	Apr	3 650	16	0
Jul	3 554	7	1	May	4 157	17	1
Aug	3 692	8	1	Jun	4 221	18	1
Sep	3 407	9	0	Jul	4 238	19	1
Oct	3 410	10	0	Aug	4 008	20	1

Fit (e.g. numpy.linalg.lstsq, statsmodels.OLS, or Excel’s regression tool):

{\hat{𝐹}}_{𝑖} = 2969.14 + 48.03 \cdot {Period}_{𝑖} + 303.51 \cdot {Summer}_{𝑖}

Diagnostics

$𝑅^{2} = 0.958$ , adj. $𝑅^{2} = 0.953$ , residual standard error $\approx 79.21$
All three coefficients have $𝑝 < 0.001$ (t-stats: intercept $79.8$ , period $15.0$ , summer $8.05$ )

Interpretation

Intercept ( $2969$ ): baseline demand at $Period = 0$ in a non-summer month
Period ( $48$ ): underlying trend adds $\approx 48$ units per month
Summer ( $304$ ): summer months run $\approx 304$ units above the trend, holding period fixed

Forecast next month (Sep, $𝑖 = 21$ , $𝑆 = 0$ ):

{\hat{𝐹}}_{21} = 2969.14 + 48.03 (21) + 303.51 (0) = 3977.77

188.0.3. When to reach for OLS

Strengths:

Coefficients are interpretable — each one quantifies a specific driver
Confidence intervals, t-tests, and F-tests come for free
Easy to fold in exogenous regressors (weather, promotions, demographics, …) that pure time-series methods cannot consume
Plays nicely with categorical predictors via dummy variables

Limitations:

Treats every observation equally — no down-weighting of stale data the way SES does
Linear in coefficients (transformations of $𝑥$ help, but the structure is rigid)
Assumes residuals are iid normal; serial correlation in time-series residuals deflates standard errors (fix: Newey–West SEs, ARMA errors, or fall back to ETS)
Forecasting future $𝑦$ requires future $𝑥$ — fine when “Period” is the regressor, awkward when it’s “weather”

188. Multiple Linear Regression

188.0.1. Why “least squares”?

188.0.2. Goodness of fit (𝑅2)

188.0.3. When to reach for OLS

188.0.2. Goodness of fit ( $𝑅^{2}$ )