Sunday, September 8, 2019

How to Interpret Linear Regression in SPSS

1. Introduction

Linear regression analysis is probably one of the most common terms you have heard in your graduate studies. For some, the term may evoke fascination and excitement because of the potential of the method. For others, it may bring apprehension especially if this is the first time they have heard about it. In any case, I hope by now you have a good idea of what regression analysis is all about.

With the advent of computers, linear regression analysis is probably one of those quantitative methods that are now easy to conduct since the computation aspects of the method are well taken cared of. What is not taken cared of are (a) the correct specification of the model that we will use in regression analysis, and (b) the interpretation of the results of the regression analysis.

2. The Model

Linear regression analysis starts with the model with its specific function. For example in the model we are using as an example, the model is:

Wage = f(age, work experience, education, gender, sector)

This model assumes that the wage a person receives is a function of his/her age, work experience, education (may be measured in terms of years of schooling completed) gender (i.e. male or female) and sector (private or public). The dependent variable in this model is wage and the independent or explanatory variables are age, work experience, education, gender, and sector. Note that two of the explanatory variables are categorical, that is, it is either-or. What is often done with this type of variable is, we assign a number say 0 for male and 1 for female, and 0 for private and 1 for the public sector. Another term used for these variables is the dummy variable.

Let us assume that the relationship between the dependent and the explanatory variables is linear, in linear regression analysis we can express the specific function as:


Wage = B1 + B2*Age + B3*work experience + B4*education + B5*gender + B6*sector + ε linear 

We then collect data through a say, a sample survey. In the example we are using the size of the sample is 1,993.

It is important to note that the method of linear regression analysis is based on the following five key assumptions (using two-variable linear regression analysis): This will be discussed in the next post.

(a) Linearity
(b) Multivariate normality
(c) No or little multicollinearity
(d) No auto-correlation
(e) Homoscedasticity

The reason why we have to enumerate these assumptions is that the tests that we conduct on linear regression analysis depend on them. In addition, all the problems we encounter in linear regression analysis may be traced to the violation of any of the above assumptions.


3. Linear Regression Analysis SPSS Results and their Explanation

After you have successfully run SPSS, the linear regression analysis results will be displayed to you in the form of tables. Below are some of these tables and their explanations.

3.1 The ANOVA table

When you do linear regression analysis, this is the first table you should examine. The ANOVA table for our example model is reproduced below. The reason why you should look at the ANOVA first is that it gives you an indication of the goodness – of – fit of your model with the data. 

Accordingly, it answers the question: Did the model explain the deviations in the dependent variable? The answer to this question is seen in the last column of the ANOVA table, that is, the sig. The value that you see in the last column of the ANOVA table is the prob-value or p-value of the computed F, that is the figure you see in the second to the last column. 

We will not go into how the p-value is derived or computed since this is one of the bases of hypothesis testing which you learn in your statistics courses. The traditional way, of course, is to look at the F-table and find out the tabular value of F(5, 1987, 0.05) and compare this with the computed F-value which is 414.262. The p-value does not require you to look at the F-table, instead, you compare the p-value with your specified level of significance, which is either 0.05 or 0.01. If the p-value is greater than 0.05 or 0.01 then you reject your null hypothesis, otherwise, you fail to reject your null hypothesis.


Table 1
alt="ANOVA table in linear regression"

From the above table, we can see that the p-value, that is, the sig. is less than 0.05 or 0.01, therefore, we fail to reject the null hypothesis. Thus, we conclude that the model fits the data.

Note that the degrees of freedom of Regression Sum of Squares is equal to the number of variables minus one since we have a six-variable regression, then the df of Regression Sum of Squares is 6 – 1 = 5. The degrees of freedom of the Total Sum of Squares (TSS) is equal to the sample size minus one or 1993 – 1 = 1992. The degrees of freedom of the Residual Sum of Squares or Error Sum of Squares is equal to the df of TSS – df of RSS = 1992 – 5 = 1987. 

The Regression Mean Square (RMS) is equal to Regression Sum of Squares divided by the df of the Residual Sum of Squares. The Residual Mean Square is equal to the Residual Sum of Squares divided by the df of the Residual Sum of Squares. The results of this operation are seen in the ANOVA table under the mean square column.

What exactly happened in the ANOVA? Well, what the ANOVA really does is to compare two models:


(1) Wage = B1 + B2*Age + B3*work experience + B4*education +  B5*gender + B6*sector
and
(2) Wage = B1

This is because when we do the ANOVA or actually we are conducting an F-test, the null hypothesis is:

H0: β2 = β3 = β4 = β5 = β6 =0

The alternative hypothesis of course is:

Ha: β2 ≠ β3 ≠ β4 ≠ β5 ≠ β6 ≠0

If the F test is not significant, that is, we accept the null hypothesis, then what we have is Model 2, that is Wage = B1, instead of the Model 1. Which means of course that the model is useless, that is, the explanatory variables could not predict the behavior of the dependent variable. If the F test is significant, we reject our null hypothesis and accept the alternative hypothesis, which means that the model can explain the variation in the dependent variable. In regression analysis, therefore, we are very happy when we reject the null hypothesis.

We now explain the other terms in the ANOVA table. From rows, we have Regression, Residual and Total, and from the third column, we have the Sum of Squares. The intersection of the row and column gives the value corresponding to row sum of squares, thus for the Total, we have 106,809.9, the Residual amounted to 52,295.48, and the Regression we have 54,514.39. The explanation of these terms and values are as follows:

Total Sum of Squares (TSS): This is the total deviations or variations in the dependent variable. The objective of regression is to explain these variations. This is done by finding the best β’s that can minimize the sum of the squares of these deviations or variations.

Explained Sum of Squares (ESS) (in ANOVA it appears as Regression). The ESS gives us the amount of TSS that could be explained by the model. Naturally, the more the model can explain the variation in the dependent variable the better.

Residual sum of Squares (RSS): The RSS is the amount that could not be explained by the model. This is also equal to TSS – ESS.
Note that the R2 is ESS divided by TSS. It captures the percent of deviation from the mean in the dependent variable that could be explained by the model.

3.2 The Model Summary Table

alt="Model Summary in SPSS. ANOVA table in linear regression"

R – Square
R – square measures the proportion of the variation in the dependent variable (wage) that was explained by variations in the dependent variables. In the above table, 51% of the variation in the dependent variable was explained. R – Square takes the value from 0 to 1.

Adjusted R – Square
The adjusted R – square measures the proportion of the variance in the dependent variable that was explained by the variations in the dependent variables. In the above table, 50.9% of the variance was explained. Adjusted R – Square could take a negative value.

R – Square
R – square measures the proportion of the variation in the dependent variable (wage) that was explained by variations in the dependent variables. In the above table, 51% of the variation in the dependent variable was explained. R – Square takes the value from 0 to 1.

Adjusted R – Square
The adjusted R – square measures the proportion of the variance in the dependent variable that was explained by the variations in the dependent variables. In the above table, 50.9% of the variance was explained. Adjusted R – Square could take a negative value.


The Adjusted R – Square came about because of the several problems associated with the use of R2. First, all our statistical results follow from the initial assumption that the model is correct, and we have no statistical procedure to compare alternative specifications. Second, Ris sensitive to the number of independent variables included in the model. The addition of more independent variables to the regression equation can never lower R2 and is likely to raise it. This is because additional independent variables can never lower TSS but are likely to increase the ESS. Finally, the interpretation of R2 becomes difficult when the model is formulated which has a zero intercept.

The difficulty of R2 as a measure of goodness of fit is that R2 pertains to explained and unexplained variation in Y and therefore does not account for the number of degrees of freedom in the problem. A natural solution, therefore, is to concern ourselves with variances, not variations, thus eliminating the dependence of goodness of fit on the number of independent variables of the model. 

          Adjusted R – Square is computed as:
alt="Adjusted R Square. ANOVA table in linear regression"
                         where k is the number of independent variables.

          Note that from the above formula, we have:

(a) If k = 1, then, R2 = R2adj.
(b) If k is greater than 1, then R2 ≥ R2adj.
(c) R2adj. can be negative.


Standard Error of Estimate

The Standard Error of the Estimate measures the dispersion of the dependent variable estimates around its mean. The Standard Error of the Estimate divided by the Mean is the Coefficient of Variation. If this value is more than 10%, that is more than 10% of the Mean, it is considered high. 

3.3 The Coefficients Table

alt="Coefficient Table. ANOVA table in linear regression"

      The B Coefficients
The B coefficients or the regression coefficients measure the effect of the individual independent variable on the dependent variable. For example, an additional year in school increases the wage by 0.777, also being in the public sector increases the wage by 1.741, but being a woman decreases wage by – 2.030 that is if female = 1, male = 0.


The Reliability of Individual Coefficients
The columns on t and sig provide the information on the significance of a particular independent variable. The same as before, the sig is the p-value. If the p-value or sig is greater than 0.05, that is if the level of significance you set is 5% then the coefficient estimate is not reliable because it has too much dispersion or variance. Note the test you are conducting here is the t-test. The null hypothesis is βi = 0.


Confidence Interval
The column on confidence interval gives the range of the values of B within the 95% confidence limit. This means there is a 95% probability that the value of the coefficient will fall within this interval.

4. The Linear Regression Analysis Equation

The linear regression analysis equation is found in Table 3, specifically, the coefficients are in column B. Thus, we can express the linear regression analysis equation as:


WAGE = - 1.820 + 0.118 AGE + 0.777 EDUCATION - 2.030 GENDER + 1.741 PUB_SEC + 0.100 WORK_EX


This equation means that wage can be forecasted as equal to – 1.820 + 0.118 x AGE + 0.777 x EDUCATION – 2.030 x GENDER + 1.741 x PUB_SEC + 0.100 x WORK_EX. 

Thus, a unit increase in the age of the worker increases his/her wage by 0.118 and a unit increase in years of schooling completed increase wage by 0.777.

On the other hand, being female (that is if we use female = 1, male = 0) will decrease wage by – 2.030; while working in the public sector means an increase in wage rate by 1.741, and a unit increase in work experience increase wage rate by 0.100. 

Note that linear regression analysis interpretation applies only to linear regression and for the specific equation used. For example, an equation linear in logarithm will have a different interpretation.



References

 1. Vijay Gupta, “Regression Explained in Simple Terms: Interpreting Linear Regression Analysis in SPSS”, Vijay Gupta Publication, 2000 (most of the above notes came from this publication)
2. Robert S. Pindyck and Daniel L. Rubinfeld, Econometric Models and Forecasts, : Interpreting Linear Regression Analysis in SPSS” (Second edition) McGraw-Hill International Editions, 1981.



# LinearRegressionAnalysis
#HowtoInterpretLinearRegressionAnalysisinSPSS


15 comments:

Be sure to check back again because I do make an effort to reply to your comments here.