|
|
Overview
|
|
Weighting. A key question in forming an exponential smoothing line is how much to count the current period, the previous period, and earlier periods when estimating the value in the next period. Exponential smoothing software allows up to four weighting parameters to be assigned:
Selecting which parameters are needed. The researcher begins by considering which parameters need to be set. Only alpha needs to be set if there is no seasonality, the trend is not dying (damping), and the there is no trend (the series varies randomly about its mean). For alpha to apply, there does need to be autocorrelation (memory), meaning that adjacent points are not random but tend to be relatively close together, even if there is no overall trend. Autocorrelation, trends, damping, and seasonality can all be assessed by looking at a sequence plot of the time series, obtained in SPSS by selecting Graphs, Sequence. (The Sequence Plot dialog box also provides for logarithmic transformations, differencing, and seasonality differencing).
Grid search for needed parameters. To help estimate smoothing parameters, SPSS provides a grid search function. One chooses Analyze, Time Series, Exponential Smoothing from the menus, then in the Exponential Smoothing dialog box. The grid search function causes SPSS to create a sequence of equally spaced values for alpha and for each value calculates a measure of how well the predictions agreed with the actual values. The parameters that produce the smallest SSE (sum square of errors) are the best-fitting parameters. By default, SPSS displays the 10 best-fitting sets of parameters and their corresponding SSE values. (Warning: if you are estimating more than one parameter, the size of the grid grows exponentially). After the grid search comes up with the optimal parameter setting, SPSS adds two new series to your file. The series fit_1 contains the predicted values from the exponential smoothing, and err_1 contains the errors. Select Graphs, Sequence, from the SPSS menus to obtain a sequence plot of the new smoothed series fit_1. The original series amount and the fit_1 forecasts are both show, and their correspondence indicates the degree to which the exponential smoothing forecasts are tracking actual values.
Models. SPSS provides exponential smoothing in the menu system under Analyze, Time Series, Exponential Smoothing. The Exponential Smoothing dialog box allows the user to select from four models:
Residual analysis. The SPSS exponential smoothing procedure automatically adds the variable err_1 variable to the file. This series is simply the difference between the actual value and the prediction value. It can be plotted by selecting Graphs. Sequence from the SPSS menus. This plot is inspected to assure that residuals are randomly distributed. A finding of non-randomness indicates the model is inadequate.
Inspecting the series. Selecting Graphs, Sequence, from the SPSS menu creates a plot of the series. By inspecting the series, the researcher gains a rough impression of whether it would be reasonable to think that some sort of curve might be fitted to the pattern displayed.
Fitting curves. Select Analyze, Regression, Curve Estimation from the SPSS menus. In the Curve Estimation dialog box's "Models" section, one may check the type of curve wanted: linear, power, quadratic, cubic, inverse, logistic, exponential, or other. Only one independent is allowed. Click the Save button in the dialog to save predicted and/or residual values for each model, for purposes of later comparison. Note that the Save button allows the researcher to specify the range of observations to be predicted.
Output. For each model selected, SPSS output will show these parameters:
Thus the formula for a quadratic model will be Dependent = b1*case + b2*case-squared + b0, where case is the sequential case number (representing the time variable).
Validation. While selecting the model with the highest R-squared is tempting, it is not the recommended method. For instance, a cubic model will always have a higher R-squared than a quadratic model. The recommended method for selecting which model is best is cross-validation. That is, the formulas for each model based on the estimation dataset are applied to the hold-out dataset, then the R-squares are compared based on output for the hold-out dataset. Alternatively, the determination may be made graphically by overlaying sequence plots of both models for the hold-out dataset.
Leading Indicator Regression. While simple curve-fitting uses time (the sequential case number) as the predictor variable, in some settings one or more leading indicators may be available. A leading indicator, of course, is a variable whose value in the present period is a good predictor of the dependent variable in a future period.
Using cross-correlation to identify leading indicators and lags. CCF is found in the SPSS menu system under Graphs, Time Series, Cross-Correlations. The Cross-Correlations dialog box allows the research to treat any or all time series variables and to apply first or second order differencing (or higher, but that is rarely done), as well as apply natural log transforms. One might apply cross-correlation to a suspected leading indicator and to the dependent variable. Upon clicking OK, Cross-Correlations will yield a cross-correlation plot in which the x axis is lags. The lag with the greatest correlation will show having the highest bar. To put it another way, a good leading indicator will have a high bar on one of the positive lags (1 lag ahead or greater).
Creating the leading indicator variable. After a leading indicator is found and the optimal number of lags ahead it predicts is determined, the next step is to create a new variable which for any given time period contains the value of the indicator from the proper number of lags ago. In SPSS Trends, select Transform, Create Time Series. In the Create Time Series dialog box, move the indictor variable into the variables list. Then highlight the contents of the Name text box and type a name that you want to replace it. Then choose the Lag function from the Function drop-down list. The Order text box shows a value of 1. Highlight this and replace it with a higher lag value if CCF so indicated. Click Change. The New Variables list will now contain something like "leadvar=LAG(inquiries,3)", for the case where the newly created leading indicator variable "leadvar" is the "inquiries" variable with a lag of 3 time periods. Click OK to create the new time series.(Note that since there is a lag of 3 in this example, the first three observations will have a period, representing a missing value, since the file lacks information about the index prior to observation 1. Other observations will equal the value of "inquiries" three rows higher.)
Linear regression. Select Analyze, Regression, Linear, from the SPSS Base menu system. Follow normal regression procedures to specify the dependent variable and to make the new leading indicator variable the independent. If cross-validation is to be used (recommended, see below), regress only the evaluation cases, saving a hold-out portion of the time series for validation. Note that time series regression frequently violates the regression assumption of uncorrelated errors. When this happens, the significance levels and goodness-of-fit statistics reported by Linear Regression are unreliable. Nonetheless, one can still use the regression equation to make forecasts on the basis of a leading indicator. The regression coefficients themselves are not biased by the autocorrelated errors.
Cross-validation. To apply the linear regression model to all observations in the time series, including the hold-out data, from the SPSS menu select, Transform, Compute. In the Computer Variable dialog box, let the Target Variable be a new variable such as "predict". In the Numeric Expression text box enter the regression formula computed in the linear regression step above. It will take the form such as "32+ 1.54*leadvar", where "32" is the constant, "1.54" is the b coefficient for leadvar, and "leadvar" is the leading indicator, which will be the lagged version of some other variable (of "inquiries" in the example above). Click OK to create the new variable "predict". Go to Data, Select Cases, and select All Cases. For graphical cross-validation, obtain a sequence plot for "predict" and the "dependent" variable, to visually inspect how well "predict" tracks the dependent not only for the evaluation cases but also for the hold-out validation observations.
The Autoregression procedure displays final parameters and goodness-of-fit statistics. The b coefficients and their significance in an autoregression model (after autocorrelation has been removed) may be compared with the corresponding coefficients in a simple regression model. Independents which were shown to be weak or insignificant in a simple regression model may be revealed to be significant in an autoregression model. The parameter estimates in an autoregression model are much more likely to represent the "true" relationships since correlated errors are taken into accoount. Autoregression is discussed further in Chapter 9 of the SPSS Trends manual.
The values of the p and q parameters may be inferred by looking at autocorrelation and partial autocorrelation functions as discussed below.
Autocorrelation and partial autocorrelation functions (ACF and PACF) can also be used to estimate p and q. Specifically, ACF and PACF plots plot deviations from zero autocorrelation by time period: the larger the positive or negative autocorrelation for a period, the longer the plot line to the right (positive) or left (negative) of zero. ACF and PACF are obtained in SPSS under Graphs/Time Series/Autocorrelations.
Other rules of thumb:
AIC is the Akaike Information Criterion and is a goodness of fit measure used to assess which of two ARIMA models are better, when both have acceptable residuals. The lower the AIC, the better the model. However, this comparison may only be made with nested models. (Ex., ARIMA (0,1,0) is nested under ARIMA(1,1,0); however, ARIMA (1,0,1) and ARIMA (0,1,0) cannot be compared by AIC because neither is nested under the other. There are also other goodness of fit measures less commonly used for this purpose, such as BIC (Bayesian Information Criterion) or SBC (Schwarz Bayesian Criterion).
In SPSS, one can also get case summary output using Analyze, Reports, Case Summaries. By doing this for cases before and after the intervention, SPSS will compute the median number of dependent variable units per time period before and after the intervention (ex., the median number of abortions per year).
Control variables in intervention analysis. When additional independent variables beyond the intervention variable are added to the equation, these serve as controls. That is, the b coefficient of the intervention variable then reflects the intervention variable's effect on the dependent controlling for other variables in the equation, just as with ordinary regression, because the b coefficients are partial coefficients. Thus, for instance, if median income were added as a control in the example above, the value of b would be the mean number of abortions per year change attributable to the partial birth abortion ban, controlling for median income and controlling for autoregressive and moving average effects (if specified in the ARIMA model).
More technically, significance tests of OLS regression estimates assume non-autocorrelation of the error terms. Error terms at sequential points in the series should constitute a random series. It is also assumed that the mean of the error terms will be zero (because estimates are half are above and half below the actual values), and the variance of the error terms will be constant throughout the time series. When, as in many time series, the value of a datum in time t largely determines the value of the subsequent datum in time t + 1, a dependency exists linking the error terms and the non-autocorrelation assumption is violated. The practical effect is that the significance of OLS estimates is computed to be far better than actual, leading the researcher to think that significant relationships exist when they do not. The Durbin-Watson test is the standard test for autocorrelation.
Firebaugh (1997: 16) warns researchers to remember that one possible cause of group trends is differential recruitment into the groups rather than group effects per se. That is, Democrats and non-Democrats might be diverging on the abortion issue, for instance, because Democrats are attracting a larger percentage of pro-choice individuals to their ranks over time.
Often trends exhibit both individual and turnover effects. Firebaugh (1997: 22) recommends the construction of a cohort-by-period data array to assess relative effects. This is simply a table in which column 1 is the cohort ranges (ex., born 1950-1960, 1961 - 1970, etc). Subsequent columns are the percentages (ex., percentage favoring legalization of marijauna) for each of the repeated surveys (ex., 1985 survey, 1990 survey, etc.), The last columns are the percent changes between surveys (ex., percent change between 1985 and 1990, between 1990 and 1995, etc.). Such a data array allows the researcher to visually inspect changes by cohort by period using the same expectations discussed above.
An alternative decomposition approach is to use regression, though this requires that the researcher to assume within-cohort changes are linear and additive. All respondents in all surveys are cumulated into a single dataset for this analysis. The regression formula sets the response of the a given respondent on a given year of survey equal to a constant plus a regression coefficient times year (of the survey for the given respondent) plus a regression coefficient times cohort (birth year of the given respondent). The regression coefficient for year is the within-cohort slope and the estimated effect of within-cohort change equals this coefficient times the difference in year of the last survey and year of the first (ex., 1995 - 1985 = 10). The regression coefficient for cohort is the cross-cohort slope and the estimated effect of cohort turnover equals this coefficient times the difference in average year of birth in the last survey minus the average year of birth in the first survey. The ratio of the two effects is the ratio of importance of individual versus turnover effects. The two effects will sum approximately (but not exactly) to the aggregate effect. A proof of this is given by Firebaugh (1997: 25-26), who provides more extensive discussion and additional strategies.
There is no solution to this identification problem, only strategies for trying to deal with it. Sometimes it is possible to fix the value of the regression coefficient for one of the effects, typically to zero. For instance, in a study of bisexual vs. homosexual preferences, one might assume that the age effect was zero and all changes over time were due to period and cohort effects. One could then use period and cohort as independents in a regression in which sexual preference was a dependent, but results would be invalid if the assumption that there was no age effect was an untrue assumption. A way of wrestling with such assumptions is to run three regressions, each time fixing one of the effects (age, cohort, period) to zero, then examining the resulting coefficients to assess whether, on the basis of external information, all three models seemed plausible. One may find, for instance, that a regression coefficient approaches zero for one of the models, yet one has reason to believe that the effect for that coefficient does indeed exist, meaning that that model is not plausible.
In regard to computational aspects of ARIMA, McDowell et al. cite the original FORTRAN programs distributed by Glass et al. (P. 14). Output is presented for Box-Tiao time series ith SCRUNCH, and a citation to the software is provided (pp. 26-27). Writing in 1980, largely before the PC revolution, the authors simply note that BMDP, MINITAB, SAS, and SPSS all implement ARIMA modeling. They recommend BMDP for instructional purposes and in an appendix provide the addresses for six vendors: BMDP, SAS, SPSS, MINITAB, IMSL, and PACK. There is no discussion of the command-level use of any of these packages either in general, or for specific recommended analytic strategies.