What's new

FRM 1 QA forecasting


New Member
pg 149 2 colum garp book on quantative analysis states

that it is useless to use higher than 2 polynomial functions for forecasting,because the statistical programm has the function to minimize the square amounts,and therefore may impose the coefficients beta3, beta4 etc... to nil.

But with that logic i would also impose beta 1 to nil...
I dont understand how the logic works, some is willing to help me out?


David Harper CFA FRM

David Harper CFA FRM
Staff member
Hi @Alicante82 I don't see/recall that specific assertion but it does remind me of Diebold Chapter 5 where he describes the problem of overfitting with a polynomial regression model. The basic (although hardly the only) way to calibrate parameters is to minimize the total sum of squares (TSS; i.e., the sum of the vertical distances from each observation to the trendline, just like a univariate regression; aka, CLRM). The univariate regression, y = mX + b = β(0) + β(1)*TIME(t), is already a "first-degree polynomial regression." Unless the there is absolutely no linear relations (ρ=0), the OLS best fit line is unlikely to return β(1) = zero; the line is likely to have some slope. In any case, there are only intercept and slope as params to fit the line. There is only so much fitting that can be done. If we add "degrees" (time^2, time^3 ... ), the line becomes very flexible. Please note: you can try this in Excel up to polynomial power 6 (select scattered data, then Add Trendline). I am not aware of why additional betas going to zero is a problem per se. But I am aware that, as below, beyond the 2nd or 3rd power, it easily becomes over-fitting to the data (and the coefficients can become ridiculously large or small). This is much easier to see visually: if you look at any scatterplot, higher order polynomials just look overfitted, because they are so super-trendy that they can't possibly predict out-of-sample. The practical issue is, does the line that fits the historical data (aka, training set) really predict out of sample (aka, test set). I hope that's helpful!
Selecting forecasting models on the basis of MSE or any of the equivalent forms discussed above -- that is, using in-sample MSE to estimate the out-of-sample 1-step-ahead MSE – turns out to be a bad idea. In-sample MSE can’t rise when more variables are added to a model, and typically it will fall continuously as more variables are added. To see why, consider the fitting of polynomial trend models. In that context, the number of variables in the model is linked to the degree of the polynomial (call it p):
T(t) = β(0) + β(1)*TIME(t) + β(2)*TIME(t)^2 + β(3)*TIME(t)^3 + ... β(P)*TIME(t)^P

We’ve already considered the cases of p=1 (linear trend) and p=2 (quadratic trend), but there’s nothing to stop us from fitting models with higher powers of time included. As we include higher powers of time, the sum of squared residuals can’t rise, because the estimated parameters are explicitly chosen to minimize the sum of squared residuals. The last-included power of time could always wind up with an estimated coefficient of zero; to the extent that the estimate is anything else, the sum of squared residuals must have fallen. Thus, the more variables we include in a forecasting model, the lower the sum of squared residuals will be, and therefore the lower MSE will be, and the higher R2 will be. The reduction in MSE as higher powers of time are included in the model occurs even if they are in fact of no use in forecasting the variable of interest. Again, the sum of squared residuals can’t rise, and due to sampling error it’s very unlikely that we’d get a coefficient of exactly zero on a newly-included variable even if the coefficient is zero in population.

The effects described above go under various names, including in-sample overfitting and data mining, reflecting the idea that including more variables in a forecasting model won’t necessarily improve its out-of-sample forecasting performance, although it will improve the model’s “fit” on historical data. The upshot is that MSE is a biased estimator of out-of-sample 1- step-ahead prediction error variance, and the size of the bias increases with the number of variables included in the model. The direction of the bias is downward -- in-sample MSE provides an overly-optimistic (that is, too small) assessment of out-of-sample prediction error variance.

To reduce the bias associated with MSE and its relatives, we need to penalize for degrees of freedom used." -- Diebold Chatper 5, page 83
Hi David, in your answer to @Alicante82 ,you quoted Diebold's Chapter 5 and there was a statement that "The last-included power of time could always wind up with an estimated coefficient of zero." According to my interpretation it simply tells us that the coefficient can't take any value and even zero to reduce the SSR. But then there is another statement "Again, the sum of squared residuals can’t rise, and due to sampling error it’s very unlikely that we’d get a coefficient of exactly zero on a newly-included variable even if the coefficient is zero in population." What does the second statement mean? What does sampling error mean? Doesn't the second statement contradict the first one? First they said it can be zero but now it is written that it can't be zero. Can you please help me understand that.

And can someone tell me what does out-of- sample one step ahead mean squared prediction error mean? Is it similar to the population mean that we used to find earlier by dividing the summation by n-1? Over here is it the same interpreation that we divide SSR by n-k ( k is the no of parameters estimated) to find out of sample MSE? And I know one step ahead means one period ahead but let's say I wish to find two step ahead. How would I do that ?

MSE = (summation of e^2)/N-k and e = y(t) - y-hat(t) ( t subscript) . Would that t become t+2 ? and one more thing the value of y(t) is known only when I have divided the past data into in sample and out of sample data. But in reality I can't predict the what y(t) since it is in the unkown future right and that is why the penalise the degrees of freedom of the in-sample MSE ? can someone correct me if my interpretation is wrong