Appendix IV Multiple Regression Analysis

This dissertation employs multiple regression analysis described in any standard statistics textbook. All models are expressed in the form of the algebraic equation:

Y = a₀ + a₁*X₁ + a₂*X₂ + … + a_k*X_k + u

where,

Y is the dependent variable (variable to be explained) on the left‑hand side of the equation

X₁,…,X_k are the independent or explanatory variables on the right‑hand side of the equation, where k represents the number of such variables

a₀ is a constant coefficient (similar to the y‑intercept in a simple algebraic equation)

a₁,…,a_k are the coefficients of the independent variables

u is the error term

Multiple regression analysis determines the coefficients a₀,…,a_k which provide the “best fit” upon inserting sets of data into the equations. Since the data involved in multiple regression analysis in this dissertation is annual time-series data, the “set of data” is that unique combination of dependent and independent variables which occur in any particular year, say 1696. “N” represents the number of sets of data under observation.

Since the calculated set of coefficients a₀,…,a_k is only a “best fit,” any particular set of data will rarely fit the equation perfectly so an error term u, which can be either positive or negative and differs for each set of data, is included in the equation.

The multiple regression results (whether presented in equations or tables) list the values of the calculated coefficients a₀,…,a_k, and list the t‑statistics either in the right-hand column (Tables I, III, and IV) or in parentheses underneath each coefficient (Table II). A “t‑statistic” is the measure of confidence that the listed coefficient is not merely random (i.e., is significantly different than zero), and is calculated by dividing the coefficient by the standard deviation of the coefficient. For most analyses presented here, t‑statistics of around 2 or greater indicate that there is statistically less than a 5% chance that the coefficient is purely random. Lower t‑statistics indicate a much greater chance of randomness and 5% is considered by most econometricians the maximum degree of chance acceptable when assessing statistical significance. Coefficients that have less than a 5% chance of being random and are thus statistically significant at the 5% level of significance are specially indicated by an ^*. Where t-statistics are listed in parentheses under each coefficient (as in Table II), they are presented as absolute values (without + or ‑ signs) simply for ease of reading, because the t‑statistic will always have the same sign as the coefficient.

A measure commonly presented in econometric analyses is the R² statistic, which is a measure of the fraction of variation in the dependent variable that is explained by the regression. Normally the higher the R² the better the model, but the R² statistic can be quite deceiving and is not sufficient for judging the statistical significance of any particular model because it depends heavily on the type of model and data being tested. A model that uses a lot of individual-level data and has an R² of 0.15 may be much better than a model that uses aggregate-level data and has an R² of 0.90. A model that has many independent variables relative to N (such as Table III) can also have a very high R² compared to a model which has fewer independent variables relative to N (such as Table IV). However, when comparing two similar models using the same data, R² provides a quick check of which model is best.

Because we are dealing with time-series data which often tends to be cyclical in nature, one of the most important statistics in every table presented is the “Durbin-Watson” or “D.W.” statistic. A major assumption of multiple regression analysis is that the error term (u) for any set of data should be random. For annual time-series data, however, this year’s error term is often related to last year’s error term. For example, if last year’s prediction was high, then this year’s prediction might tend to be high. If the model tends to overpredict for a few years and then underpredict for a few years, this is called positive autocorrelation which is the most common problem in normal time-series data. If the model bounces back and forth every year between overprediction and underprediction, this is called negative autocorrelation which is much less common. Autocorrelation could be due to problems either with the model (model specification error) or with the data (error specification error) and there are statistical ways to test for and correct this.

The Durbin‑Watson coefficient is a measure of autocorrelation on a scale of 0 to 4 with 0 (perfect positive autocorrelation), 2 (no autocorrelation) and 4 (perfect negative autocorrelation). As with t‑statistics, we are interested in when the odds of autocorrelation being problematic have less than a 5% chance of being random. This depends heavily on the length of the time series and the number of independent variables in the model. For the models and data examined in this dissertation, there was no statistically significant positive or negative autocorrelation.

[cite]