Prediction (out of sample)¶
[1]:
%matplotlib inline
[2]:
import numpy as np
import matplotlib.pyplot as plt
import statsmodels.api as sm
plt.rc("figure", figsize=(16, 8))
plt.rc("font", size=14)
Artificial data¶
[3]:
nsample = 50
sig = 0.25
x1 = np.linspace(0, 20, nsample)
X = np.column_stack((x1, np.sin(x1), (x1 - 5) ** 2))
X = sm.add_constant(X)
beta = [5.0, 0.5, 0.5, -0.02]
y_true = np.dot(X, beta)
y = y_true + sig * np.random.normal(size=nsample)
Estimation¶
[4]:
olsmod = sm.OLS(y, X)
olsres = olsmod.fit()
print(olsres.summary())
OLS Regression Results
==============================================================================
Dep. Variable: y R-squared: 0.982
Model: OLS Adj. R-squared: 0.981
Method: Least Squares F-statistic: 833.9
Date: Tue, 14 Feb 2023 Prob (F-statistic): 4.35e-40
Time: 22:28:59 Log-Likelihood: -1.8360
No. Observations: 50 AIC: 11.67
Df Residuals: 46 BIC: 19.32
Df Model: 3
Covariance Type: nonrobust
==============================================================================
coef std err t P>|t| [0.025 0.975]
------------------------------------------------------------------------------
const 5.0039 0.089 56.097 0.000 4.824 5.183
x1 0.4920 0.014 35.763 0.000 0.464 0.520
x2 0.4534 0.054 8.384 0.000 0.345 0.562
x3 -0.0189 0.001 -15.628 0.000 -0.021 -0.016
==============================================================================
Omnibus: 3.399 Durbin-Watson: 2.157
Prob(Omnibus): 0.183 Jarque-Bera (JB): 3.157
Skew: -0.605 Prob(JB): 0.206
Kurtosis: 2.773 Cond. No. 221.
==============================================================================
Notes:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
In-sample prediction¶
[5]:
ypred = olsres.predict(X)
print(ypred)
[ 4.53199687 4.98668018 5.40550525 5.76376136 6.04565573 6.24690829
6.37545485 6.45014327 6.49763682 6.54803347 6.62992089 6.76567957
6.96780577 7.23685842 7.56136729 7.91971751 8.28370084 8.6231502
8.91089605 9.12723024 9.26314402 9.32180816 9.31805199 9.27592698
9.22475517 9.19431161 9.20993244 9.2883522 9.43495443 9.64288714
9.89418856 10.16273827 10.41854863 10.63269173 10.78205213 10.85312467
10.84423333 10.76580659 10.63866408 10.49059593 10.351797 10.24990586
10.20546394 10.22854058 10.31707866 10.45723263 10.62564321 10.79327537
10.93018857 11.01045603]
Create a new sample of explanatory variables Xnew, predict and plot¶
[6]:
x1n = np.linspace(20.5, 25, 10)
Xnew = np.column_stack((x1n, np.sin(x1n), (x1n - 5) ** 2))
Xnew = sm.add_constant(Xnew)
ynewpred = olsres.predict(Xnew) # predict out of sample
print(ynewpred)
[11.0066231 10.88269213 10.65644437 10.36840101 10.07190218 9.82004745
9.65269518 9.58670382 9.61180436 9.69311455]
Plot comparison¶
[7]:
import matplotlib.pyplot as plt
fig, ax = plt.subplots()
ax.plot(x1, y, "o", label="Data")
ax.plot(x1, y_true, "b-", label="True")
ax.plot(np.hstack((x1, x1n)), np.hstack((ypred, ynewpred)), "r", label="OLS prediction")
ax.legend(loc="best")
[7]:
<matplotlib.legend.Legend at 0x7fcb1b2969d0>

Predicting with Formulas¶
Using formulas can make both estimation and prediction a lot easier
[8]:
from statsmodels.formula.api import ols
data = {"x1": x1, "y": y}
res = ols("y ~ x1 + np.sin(x1) + I((x1-5)**2)", data=data).fit()
We use the I
to indicate use of the Identity transform. Ie., we do not want any expansion magic from using **2
[9]:
res.params
[9]:
Intercept 5.003900
x1 0.491987
np.sin(x1) 0.453415
I((x1 - 5) ** 2) -0.018876
dtype: float64
Now we only have to pass the single variable and we get the transformed right-hand side variables automatically
[10]:
res.predict(exog=dict(x1=x1n))
[10]:
0 11.006623
1 10.882692
2 10.656444
3 10.368401
4 10.071902
5 9.820047
6 9.652695
7 9.586704
8 9.611804
9 9.693115
dtype: float64