Prediction (out of sample)

[1]:
%matplotlib inline
[2]:
import numpy as np
import matplotlib.pyplot as plt

import statsmodels.api as sm

plt.rc("figure", figsize=(16, 8))
plt.rc("font", size=14)

Artificial data

[3]:
nsample = 50
sig = 0.25
x1 = np.linspace(0, 20, nsample)
X = np.column_stack((x1, np.sin(x1), (x1 - 5) ** 2))
X = sm.add_constant(X)
beta = [5.0, 0.5, 0.5, -0.02]
y_true = np.dot(X, beta)
y = y_true + sig * np.random.normal(size=nsample)

Estimation

[4]:
olsmod = sm.OLS(y, X)
olsres = olsmod.fit()
print(olsres.summary())
                            OLS Regression Results
==============================================================================
Dep. Variable:                      y   R-squared:                       0.982
Model:                            OLS   Adj. R-squared:                  0.981
Method:                 Least Squares   F-statistic:                     833.9
Date:                Tue, 14 Feb 2023   Prob (F-statistic):           4.35e-40
Time:                        22:28:59   Log-Likelihood:                -1.8360
No. Observations:                  50   AIC:                             11.67
Df Residuals:                      46   BIC:                             19.32
Df Model:                           3
Covariance Type:            nonrobust
==============================================================================
                 coef    std err          t      P>|t|      [0.025      0.975]
------------------------------------------------------------------------------
const          5.0039      0.089     56.097      0.000       4.824       5.183
x1             0.4920      0.014     35.763      0.000       0.464       0.520
x2             0.4534      0.054      8.384      0.000       0.345       0.562
x3            -0.0189      0.001    -15.628      0.000      -0.021      -0.016
==============================================================================
Omnibus:                        3.399   Durbin-Watson:                   2.157
Prob(Omnibus):                  0.183   Jarque-Bera (JB):                3.157
Skew:                          -0.605   Prob(JB):                        0.206
Kurtosis:                       2.773   Cond. No.                         221.
==============================================================================

Notes:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.

In-sample prediction

[5]:
ypred = olsres.predict(X)
print(ypred)
[ 4.53199687  4.98668018  5.40550525  5.76376136  6.04565573  6.24690829
  6.37545485  6.45014327  6.49763682  6.54803347  6.62992089  6.76567957
  6.96780577  7.23685842  7.56136729  7.91971751  8.28370084  8.6231502
  8.91089605  9.12723024  9.26314402  9.32180816  9.31805199  9.27592698
  9.22475517  9.19431161  9.20993244  9.2883522   9.43495443  9.64288714
  9.89418856 10.16273827 10.41854863 10.63269173 10.78205213 10.85312467
 10.84423333 10.76580659 10.63866408 10.49059593 10.351797   10.24990586
 10.20546394 10.22854058 10.31707866 10.45723263 10.62564321 10.79327537
 10.93018857 11.01045603]

Create a new sample of explanatory variables Xnew, predict and plot

[6]:
x1n = np.linspace(20.5, 25, 10)
Xnew = np.column_stack((x1n, np.sin(x1n), (x1n - 5) ** 2))
Xnew = sm.add_constant(Xnew)
ynewpred = olsres.predict(Xnew)  # predict out of sample
print(ynewpred)
[11.0066231  10.88269213 10.65644437 10.36840101 10.07190218  9.82004745
  9.65269518  9.58670382  9.61180436  9.69311455]

Plot comparison

[7]:
import matplotlib.pyplot as plt

fig, ax = plt.subplots()
ax.plot(x1, y, "o", label="Data")
ax.plot(x1, y_true, "b-", label="True")
ax.plot(np.hstack((x1, x1n)), np.hstack((ypred, ynewpred)), "r", label="OLS prediction")
ax.legend(loc="best")
[7]:
<matplotlib.legend.Legend at 0x7fcb1b2969d0>
../../../_images/examples_notebooks_generated_predict_12_1.png

Predicting with Formulas

Using formulas can make both estimation and prediction a lot easier

[8]:
from statsmodels.formula.api import ols

data = {"x1": x1, "y": y}

res = ols("y ~ x1 + np.sin(x1) + I((x1-5)**2)", data=data).fit()

We use the I to indicate use of the Identity transform. Ie., we do not want any expansion magic from using **2

[9]:
res.params
[9]:
Intercept           5.003900
x1                  0.491987
np.sin(x1)          0.453415
I((x1 - 5) ** 2)   -0.018876
dtype: float64

Now we only have to pass the single variable and we get the transformed right-hand side variables automatically

[10]:
res.predict(exog=dict(x1=x1n))
[10]:
0    11.006623
1    10.882692
2    10.656444
3    10.368401
4    10.071902
5     9.820047
6     9.652695
7     9.586704
8     9.611804
9     9.693115
dtype: float64