Chapter Outline
4.1 Robust Regression Methods
4.1.1 Regression with Robust Standard Errors
4.1.2 Using the Proc Genmod for Clustered Data
4.1.3 Robust Regression
4.1.4 Quantile Regression
4.2 Constrained Linear Regression
4.3 Regression with Censored or Truncated Data
4.3.1 Regression with Censored Data
4.3.2 Regression with Truncated Data
4.4 Regression with Measurement Error
4.5 Multiple Equation Regression Models
4.5.1 Seemingly Unrelated Regression
4.5.2 Multivariate Regression
4.6 Summary
In this chapter we will go into various commands that go beyond OLS. This chapter is a bit different from the others in that it covers a number of different concepts, some of which may be new to you. These extensions, beyond OLS, have much of the look and feel of OLS but will provide you with additional tools to work with linear models.
The topics will include robust regression methods, constrained linear regression, regression with censored and truncated data, regression with measurement error, and multiple equation models.
4.1 Robust Regression Methods
It seems to be a rare dataset that meets all of the assumptions underlying multiple regression. We know that failure to meet assumptions can lead to biased estimates of coefficients and especially biased estimates of the standard errors. This fact explains a lot of the activity in the development of robust regression methods.
The idea behind robust regression methods is to make adjustments in the estimates that take into account some of the flaws in the data itself. We are going to look at three robust methods: regression with robust standard errors, regression with clustered data, robust regression, and quantile regression.
Before we look at these approaches, let's look at a standard OLS regression using the elementary school academic performance index (elemapi2.dta) dataset. We will look at a model that predicts the api 2000 scores using the average class size in K through 3 (acs_k3), average class size 4 through 6 (acs_46), the percent of fully credentialed teachers (full), and the size of the school (enroll). First let's look at the descriptive statistics for these variables. Note the missing values for acs_k3 and acs_k6.
proc means data = "c:sasregelemapi2" mean std max min; var api00 acs_k3 acs_46 full enroll; run;
The MEANS ProcedureVariable Mean Std Dev Minimum Maximum------------------------------------------------------------------------api00 647.6225000 142.2489610 369.0000000 940.0000000acs_k3 19.1608040 1.3686933 14.0000000 25.0000000acs_46 29.6851385 3.8407840 20.0000000 50.0000000full 84.5500000 14.9497907 37.0000000 100.0000000enroll 483.4650000 226.4483847 130.0000000 1570.00------------------------------------------------------------------------
Below we see the regression predicting api00 from acs_k3 acs_46 full and enroll. We see that all of the variables are significant except for acs_k3.
proc reg data = "c:sasregelemapi2"; model api00 = acs_k3 acs_46 full enroll ; run;
The REG ProcedureModel: MODEL1Dependent Variable: api00 Analysis of Variance Sum of MeanSource DF Squares Square F Value Pr > FModel 4 3071909 767977 61.01 <.0001Error 390 4909501 12588Corrected Total 394 7981410Root MSE 112.19832 R-Square 0.3849Dependent Mean 648.65063 Adj R-Sq 0.3786Coeff Var 17.29719 Parameter Estimates Parameter StandardVariable DF Estimate Error t Value Pr > |t|Intercept 1 -5.20041 84.95492 -0.06 0.9512acs_k3 1 6.95438 4.37110 1.59 0.1124acs_46 1 5.96601 1.53105 3.90 0.0001full 1 4.66822 0.41425 11.27 <.0001enroll 1 -0.10599 0.02695 -3.93 <.0001
Since the regression procedure is interactive and we haven't issued the quit command, we can test both of the class size variables, and we find the overall test of these two variables is significant.
test acs_k3 = acs_46 = 0; run;
Test 1 Results for Dependent Variable api00 MeanSource DF Square F Value Pr > FNumerator 2 139437 11.08 <.0001Denominator 390 12588
Here is the residual versus fitted plot for this regression. Notice that the pattern of the residuals is not exactly as we would hope. The spread of the residuals is somewhat wider toward the middle right of the graph than at the left, where the variability of the residuals is somewhat smaller, suggesting some heteroscedasticity.
plot r.*p.; run;
Here is the index plot of Cook's D for this regression. We see 4 points that are somewhat high in both their leverage and their residuals.
plot cookd.*obs.; run;
None of these results are dramatic problems, but the plot of residual vs. predicted value suggests that there might be some outliers and some possible heteroscedasticity and the index plot of Cook's D shows some points in the upper right quadrant that could be influential. We might wish to use something other than OLS regression to estimate this model. In the next several sections we will look at some robust regression methods.
4.1.1 Regression with Robust Standard Errors
The SAS proc reg includes an option called acov in the model statement for estimating the asymptotic covariance matrix of the estimates under the hypothesis of heteroscedasticity. The standard error obtained from the asymptotic covariance matrix is considered to be more robust and can deal with a collection of minor concerns about failure to meet assumptions, such as minor problems about normality, heteroscedasticity, or some observations that exhibit large residuals, leverage or influence. For such minor problems, the standard error based on acov may effectively deal with these concerns.
With the acov option, the point estimates of the coefficients are exactly the same as in ordinary OLS, but we will calculate the standard errors based on the asymptotic covariance matrix. Here is the same regression as above using the acov option. We also use SAS ODS (Output Delivery System) to output the parameter estimates along with the asymptotic covariance matrix. We calculated the robust standard error in a data step and merged them with the parameter estimate using proc sql and created the t-values and corresponding probabilities. Note the changes in the standard errors and t-tests (but no change in the coefficients). In this particular example, using robust standard errors did not change any of the conclusions from the original OLS regression. We should also mention that the robust standard error has been adjusted for the sample size correction.
proc reg data = "c:sasregelemapi2"; model api00 = acs_k3 acs_46 full enroll /acov; ods output ACovEst = estcov; ods output ParameterEstimates=pest; run; quit; data temp_dm; set estcov; drop model dependent; array a(5) intercept acs_k3 acs_46 full enroll; array b(5) std1-std5; b(_n_) = sqrt((395/390)*a(_n_)); std = max(of std1-std5); keep variable std; run; proc sql; select pest.variable, estimate, stderr, tvalue, probt, std as robust_stderr, estimate/robust_stderr as tvalue_rb, (1 - probt(abs(estimate/robust_stderr), 394))*2 as probt_rb from pest, temp_dm where pest.variable=temp_dm.variable; quit;
robust_Variable Estimate StdErr tValue Probt stderr tvalue_rb probt_rb------------------------------------------------------------------------------Intercept -5.20041 84.95492 -0.06 0.9512 86.66308 -0.06001 0.95218acs_k3 6.95438 4.37110 1.59 0.1124 4.620599 1.505082 0.133104acs_46 5.96601 1.53105 3.90 0.0001 1.573214 3.792246 0.000173full 4.66822 0.41425 11.27 <.0001 0.414681 11.25737 0enroll -0.10599 0.02695 -3.93 <.0001 0.028015 -3.78331 0.000179
4.1.2 Using the Proc Genmod for Clustered Data
As described in Chapter 2, OLS regression assumes that the residuals are independent. The elemapi2 dataset contains data on 400 schools that come from 37 school districts. It is very possible that the scores within each school district may not be independent, and this could lead to residuals that are not independent within districts. SAS proc genmod is used to model correlated data. We can use the class statement and the repeated statement to indicate that the observations are clustered into districts (based on dnum) and that the observations may be correlated within districts, but would be independent between districts.
proc genmod data="c:sasregelemapi2"; class dnum; model api00 = acs_k3 acs_46 full enroll ; repeated subject=dnum / type=ind ; run; quit;
The GENMOD Procedure
Analysis Of GEE Parameter Estimates Empirical Standard Error Estimates
Standard 95% ConfidenceParameter Estimate Error Limits Z Pr > |Z|
Intercept -5.2004 119.5172 -239.450 229.0490 -0.04 0.9653acs_k3 6.9544 6.7726 -6.3196 20.2284 1.03 0.3045acs_46 5.9660 2.4839 1.0976 10.8344 2.40 0.0163full 4.6682 0.6904 3.3151 6.0213 6.76 <.0001enroll -0.1060 0.0421 -0.1886 -0.0234 -2.51 0.0119
As with the regression with robust error, the estimate of the coefficients are the same as the OLS estimates, but the standard errors take into account that the observations within districts are non-independent. Even though the standard errors are larger in this analysis, the three variables that were significant in the OLS analysis are significant in this analysis as well.
We notice that the standard error estimates given here are different from what Stata's result using regress with the cluster option. This is because that Stata further does a finite-sample adjustment. We can do some SAS programming here for the adjustment. The adjusted variance is a constant times the variance obtained from the empirical standard error estimates. This particular constant is
(N-1)/(N-k)*M/(M-1).
data em; set 'c:sasregelemapi2'; run;
proc genmod data=em; class dnum; model api00 = acs_k3 acs_46 full enroll ; repeated subject=dnum / type = ind covb ; ods output geercov = gcov; ods output GEEEmpPEst = parms; run; quit;
proc sql; select count(dnum),count(distinct dnum) into :n, :m from em; quit; proc sql; select count(prm1) into :k from gcov; quit; data gcov_ad; set gcov; array all(*) _numeric_; do i = 1 to dim(all); all(i) = all(i)*((&n-1)/(&n-&k))*(&m/(&m-1)); if i = _n_ then std_ad = sqrt(all(i)); end; drop i; keep std_ad; run; data all; merge parms gcov_ad; run; proc print data = all noobs; run;
Parm Estimate Stderr LowerCL UpperCL Z ProbZ std_ad
Intercept -5.2004 119.5172 -239.450 229.0490 -0.04 0.9653 121.778acs_k3 6.9544 6.7726 -6.3196 20.2284 1.03 0.3045 6.901acs_46 5.9660 2.4839 1.0976 10.8344 2.40 0.0163 2.531full 4.6682 0.6904 3.3151 6.0213 6.76 <.0001 0.703enroll -0.1060 0.0421 -0.1886 -0.0234 -2.51 0.0119 0.043
4.1.3 Robust Regression
In SAS, we can not simply execute some proc to perform a robust regression using iteratively reweighted least squares. In order to perform a robust regression, we have to write our own macro. To this end, ATS has written a macro called robust_hb.sas. This macro first uses Hubert weight and later switches to biweight. This is why the macro is called robust_hb where h and b stands for Hubert and biweight respectively. Robust regression assigns a weight to each observation with higher weights given to better behaved observations. In fact, extremely deviant cases, those with Cook's D greater than 1, can have their weights set to missing so that they are not included in the analysis at all.
The macro robust_hb.sas uses another macro called mad.sas to generate MAD (median absolute deviation) during the iteration process. We will include both macros to perform the robust regression analysis as shown below. Note that in this analysis both the coefficients and the standard errors differ from the original OLS regression.
%include 'c:sasregmad.sas'; %include 'c:sasregrobust_hb.sas'; %robust_hb("c:sasregelemapi2", api00, acs_k3 acs_46 full enroll, .01, 0.00005, 10);
The REG ProcedureModel: MODEL1Dependent Variable: api00Weight: _w2_ Analysis of Variance Sum of MeanSource DF Squares Square F Value Pr > FModel 4 3053248 763312 73.19 <.0001Error 390 4067269 10429Corrected Total 394 7120516Root MSE 102.12196 R-Square 0.4288Dependent Mean 647.38090 Adj R-Sq 0.4229Coeff Var 15.77463 Parameter Estimates Parameter StandardVariable DF Estimate Error t Value Pr > |t|Intercept 1 -6.79858 80.45406 -0.08 0.9327acs_k3 1 6.11031 4.15015 1.47 0.1417acs_46 1 6.25588 1.45280 4.31 <.0001full 1 4.79575 0.39212 12.23 <.0001enroll 1 -0.10924 0.02558 -4.27 <.0001
If you compare the robust regression results (directly above) with the OLS results previously presented, you can see that the coefficients and standard errors are quite similar, and the t values and p values are also quite similar. Despite the minor problems that we found in the data when we performed the OLS analysis, the robust regression analysis yielded quite similar results suggesting that indeed these were minor problems. Had the results been substantially different, we would have wanted to further investigate the reasons why the OLS and robust regression results were different, and among the two results the robust regression results would probably be the more trustworthy.
Let's look at the predicted (fitted) values (p), the residuals (r), and the leverage (hat) values (h). The macro robust_hb generates a final data set with predicted values, raw residuals and leverage values together with the original data called _tempout_.
Now, let's check on the various predicted values and the weighting. First, we will sort by _w2_, the weight generated at last iteration. Then we will look at the first 15 observations. Notice that the smallest weights are near one-half but quickly get into the .6 range. The first five values are missing due to the missing values of predictors.
proc sort data = _tempout_; by descending _w2_; run; proc print data = _tempout_ (obs=10); var snum api00 p r h _w2_; run;
Obs snum api00 p r h _w2_ 1 116 513 . . . . 2 3072 763 . . . . 3 3055 590 . . . . 4 4488 521 . . . . 5 4534 445 . . . . 6 637 447 733.426 -286.144 0.003657 0.55684 7 5387 892 611.709 280.464 0.002278 0.57183 8 2267 897 622.062 275.512 0.009887 0.58481 9 65 903 631.930 271.731 0.010198 0.59466 10 3759 585 842.664 -257.473 0.040009 0.63128
Now, let's look at the last 10 observations. The weights for observations with snum 1678, 4486 and 1885 are all very close to one, since the residuals are fairly small. Note that the observations above that have the lowest weights are also those with the largest residuals (residuals over 200) and the observations below with the highest weights have very low residuals (all less than 3).
proc sort data = _tempout_; by _w2_; run; proc print data = _tempout_ (obs=10); var snum api00 p r h _w2_; run;
Obs snum api00 p r h _w2_ 1 1678 497 497.065 0.18424 0.023137 1.00000 2 4486 706 706.221 0.20151 0.013740 1.00000 3 1885 605 606.066 -0.41421 0.013745 1.00000 4 3535 705 703.929 1.16143 0.004634 0.99999 5 3024 727 729.278 -2.01559 0.010113 0.99997 6 3700 717 719.803 -2.43802 0.007317 0.99996 7 1596 536 533.918 2.77985 0.012143 0.99995 8 181 724 728.068 -3.53209 0.013665 0.99992 9 1949 696 691.898 3.96736 0.020426 0.99990 10 5917 735 731.141 4.03336 0.005831 0.99990
After using macro robust_hb.sas, we can use the dataset _tempout_ to create some graphs for regression diagnostic purposes. For example, we can create a graph of residuals versus fitted (predicted) with a line at zero. This plot looks much like the OLS plot, except that in the OLS all of the observations would be weighted equally, but as we saw above the observations with the greatest residuals are weighted less and hence have less influence on the results.
axis1 order = (-300 to 300 by 100) label=(a=90) minor=none; axis2 order = (300 to 900 by 300) minor=none; symbol v=star h=0.8 c=blue; proc gplot data = _tempout_; plot r*p = 1 /haxis=axis2 vaxis=axis1 vref=0; label r ="Residual"; label p ="Fitted Value"; run; quit;
4.1.4 Quantile Regression
Quantile regression, in general, and median regression, in particular, might be considered as an alternative to robust regression. SAS does quantile regression using a little bit of proc iml. Inside proc iml, a procedure called LAV is called and it does a median regression in which the coefficients will be estimated by minimizing the absolute deviations from the median. Of course, as an estimate of central tendency, the median is a resistant measure that is not as greatly affected by outliers as is the mean. It is not clear that median regression is a resistant estimation procedure, in fact, there is some evidence that it can be affected by high leverage values.
Here is what the quantile regression looks like using SAS proc iml. The first data step is to make sure that the data set that proc iml takes as input does not have any missing values. Inside proc iml we first generate necessary matrices for regression computation and then call the procedure LAV. After calling LAV we can calculate the predicted values and residuals. At last, we create a data set called _temp_ containing the dependent variables and all the predictors plus the predicted values and residuals.
data elemapi2; set "c:sasregelemapi2"; cons = 1; if api00 ~=. & acs_k3 ~= . & acs_46 ~=. & full ~=. & enroll ~=.; run; proc iml ; /*Least absolute values*/ use elemapi2; read all; a = cons || acs_k3 || acs_46 || full || enroll; b=api00; opt= { . 3 0 1 }; call lav(rc,xr,a,b,,opt); /*print out the estimates*/ opt= { . 0 . 1 }; call lav(rc,xr,a,b,,opt); /*no print out but to create the xr*/ pred = a*t(xr); resid = b - pred; create _temp_ var { api00 cons acs_k3 acs_46 full enroll pred resid}; append; quit;
LAV (L1) Estimation Start with LS Solution Start Iter: gamma=284.75134813 ActEqn=395 Iter N Huber Act Eqn Rank Gamma L1(x) F(Gamma) 31 36 5 5 0.0000 36268.1049 36240.6335Algorithm convergedObjective Function L1(x)= 36268.104941Number Search Directions= 68Number Refactorizations = 2Total Execution Time= 0.0200Necessary and sufficient optimality condition satisfied. L1 Solution with ASEEst 17.1505029 1.2690656888 7.2240793844 5.3238408715 -0.124573396ASE 123.531545 6.3559394265 2.2262729207 0.602359504 0.0391932684
The coefficient and standard error for acs_k3 are considerably different as compared to OLS (the coefficients are 1.2 vs 6.9 and the standard errors are 6.4 vs 4.3). The coefficients and standard errors for the other variables are also different, but not as dramatically different. Nevertheless, the quantile regression results indicate that, like the OLS results, all of the variables except acs_k3 are significant. Using the data set _temp_ we created above we obtain a plot of residuals vs. predicted values shown below.
symbol v=star h=0.8 c=blue; axis1 order = (-300 to 300 by 100) label=(a=90) minor=none; axis2 order = (300 to 900 by 300) minor=none; proc gplot data = _temp_; plot resid*pred = 1 / vaxis=axis1 haxis=axis2 vref=0; label resid="Residual"; label pred = "Fitted Value"; run; quit;
4.2 Constrained Linear Regression
Let's begin this section by looking at a regression model using the hsb2 dataset. The hsb2 file is a sample of 200 cases from the Highschool and Beyond Study (Rock, Hilton, Pollack, Ekstrom & Goertz, 1985). It includes the following variables: id female race ses schtyp program read write math science socst. The variables read write math science socst are the results of standardized tests on reading, writing, math, science and social studies (respectively), and the variable female is coded 1 if female, 0 if male.
Let's start by doing an OLS regression where we predict socst score from read, write, math, science and female (gender)
proc reg data="c:sasreghsb2"; model socst = read write math science female ; run;
The REG ProcedureModel: MODEL1Dependent Variable: socst Analysis of Variance Sum of MeanSource DF Squares Square F Value Pr > FModel 5 10949 2189.85150 35.44 <.0001Error 194 11987 61.78834Corrected Total 199 22936Root MSE 7.86056 R-Square 0.4774Dependent Mean 52.40500 Adj R-Sq 0.4639Coeff Var 14.99963 Parameter Estimates Parameter StandardVariable DF Estimate Error t Value Pr > |t|Intercept 1 7.33934 3.65024 2.01 0.0457read 1 0.37840 0.08063 4.69 <.0001write 1 0.38587 0.08893 4.34 <.0001math 1 0.13033 0.08938 1.46 0.1464science 1 -0.03339 0.08187 -0.41 0.6838female 1 -0.35326 1.24537 -0.28 0.7770
Notice that the coefficients for read and write are very similar, which makes sense since they are both measures of language ability. Also, the coefficients for math and science are similar (in that they are both not significantly different from 0). Suppose that we have a theory that suggests that read and write and math should have equal coefficients. We can test the equality of the coefficients using the test command.
test read = write; run;
Test 1 Results for Dependent Variable socst MeanSource DF Square F Value Pr > FNumerator 1 0.19057 0.00 0.9558Denominator 194 61.78834
The test result indicates that there is no significant difference in the coefficients for the reading and writing scores. Since it appears that the coefficients for math and science are also equal, let's test the equality of those as well.
test math = science; run;
Test 2 Results for Dependent Variable socst MeanSource DF Square F Value Pr > FNumerator 1 89.63950 1.45 0.2299Denominator 194 61.78834
Let's now perform both of these tests together, simultaneously testing that the coefficient for read equals write and math equals science using mtest statement.
mtest math - science, read - write; run;
Multivariate Test 1 Multivariate Statistics and Exact F Statistics S=1 M=0 N=96Statistic Value F Value Num DF Den DF Pr > FWilks' Lambda 0.99257173 0.73 2 194 0.4852Pillai's Trace 0.00742827 0.73 2 194 0.4852Hotelling-Lawley Trace 0.00748386 0.73 2 194 0.4852Roy's Greatest Root 0.00748386 0.73 2 194 0.4852
Note this second test has 2 df, since it is testing both of the hypotheses listed, and this test is not significant, suggesting these pairs of coefficients are not significantly different from each other. We can estimate regression models where we constrain coefficients to be equal to each other. For example, let's begin on a limited scale and constrain read to equal write. Proc reg uses restrict statement to accomplish this.
proc reg data="c:sasreghsb2"; model socst = read write math science female ; restrict read=write; run;
The REG ProcedureModel: MODEL1Dependent Variable: socstNOTE: Restrictions have been applied to parameter estimates. Analysis of Variance Sum of MeanSource DF Squares Square F Value Pr > FModel 4 10949 2737.26674 44.53 <.0001Error 195 11987 61.47245Corrected Total 199 22936Root MSE 7.84044 R-Square 0.4774Dependent Mean 52.40500 Adj R-Sq 0.4667Coeff Var 14.96124 Parameter Estimates Parameter StandardVariable DF Estimate Error t Value Pr > |t|Intercept 1 7.35415 3.63118 2.03 0.0442read 1 0.38185 0.05139 7.43 <.0001write 1 0.38185 0.05139 7.43 <.0001math 1 0.13030 0.08915 1.46 0.1454science 1 -0.03328 0.08164 -0.41 0.6840female 1 -0.32962 1.16736 -0.28 0.7780RESTRICT -1 -25.51302 458.21611 -0.06 0.9558** Probability computed using beta distribution.
Notice that the coefficients for read and write are identical, along with their standard errors, t-test, etc. Also note that the degrees of freedom for the F test is four, not five, as in the OLS model. This is because only one coefficient is estimated for read and write, estimated like a single variable equal to the sum of their values. Notice also that the Root MSE is slightly higher for the constrained model, but only slightly higher. This is because we have forced the model to estimate the coefficients for read and write that are not as good at minimizing the Sum of Squares Error (the coefficients that would minimize the SSE would be the coefficients from the unconstrained model).
Next, we will define a second constraint, setting math equal to science together with the first constraint we set before.
proc reg data="c:sasreghsb2"; model socst = read write math science female ; restrict read = write, math = science; run;
The REG ProcedureModel: MODEL1Dependent Variable: socstNOTE: Restrictions have been applied to parameter estimates. Analysis of Variance Sum of MeanSource DF Squares Square F Value Pr > FModel 3 10860 3619.84965 58.75 <.0001Error 196 12077 61.61554Corrected Total 199 22936Root MSE 7.84956 R-Square 0.4735Dependent Mean 52.40500 Adj R-Sq 0.4654Coeff Var 14.97864 Parameter Estimates Parameter StandardVariable DF Estimate Error t Value Pr > |t|Intercept 1 7.50566 3.63323 2.07 0.0402read 1 0.38604 0.05133 7.52 <.0001write 1 0.38604 0.05133 7.52 <.0001math 1 0.04281 0.05192 0.82 0.4107science 1 0.04281 0.05192 0.82 0.4107female 1 -0.20087 1.16383 -0.17 0.8631RESTRICT -1 -15.36285 458.82638 -0.03 0.9734*RESTRICT -1 547.24384 454.01563 1.21 0.2290** Probability computed using beta distribution.
Now the coefficients for read = write and math = science and the degrees of freedom for the model has dropped to three. Again, the Root MSE is slightly larger than in the prior model, but we should emphasize only very slightly larger. If indeed the population coefficients for read = write and math = science, then these combined (constrained) estimates may be more stable and generalize better to other samples. So although these estimates may lead to slightly higher standard error of prediction in this sample, they may generalize better to the population from which they came.
4.3 Regression with Censored or Truncated Data
Analyzing data that contain censored values or are truncated is common in many research disciplines. According to Hosmer and Lemeshow (1999), a censored value is one whose value is incomplete due to random factors for each subject. A truncated observation, on the other hand, is one which is incomplete due to a selection process in the design of the study.
We will begin by looking at analyzing data with censored values.
4.3.1 Regression with Censored Data
In this example we have a variable called acadindx which is a weighted combination of standardized test scores and academic grades. The maximum possible score on acadindx is 200 but it is clear that the 16 students who scored 200 are not exactly equal in their academic abilities. In other words, there is variability in academic ability that is not being accounted for when students score 200 on acadindx. The variable acadindx is said to be censored, in particular, it is right censored.
Let's look at the example. We will begin by looking at a description of the data, some descriptive statistics, and correlations among the variables.
proc means data = "c:sasregacadindx"; run;
The MEANS ProcedureVariable N Mean Std Dev Minimum Maximum-------------------------------------------------------------------------------id 200 100.5000000 57.8791845 1.0000000 200.0000000female 200 0.5450000 0.4992205 0 1.0000000reading 200 52.2300000 10.2529368 28.0000000 76.0000000writing 200 52.7750000 9.4785860 31.0000000 67.0000000acadindx 200 172.1850000 16.8173987 138.0000000 200.0000000-------------------------------------------------------------------------------
proc means data = "c:sasregacadindx" N; where acadindx = 200; run;
The MEANS ProcedureVariable N---------------id 16female 16reading 16writing 16acadindx 16---------------
proc corr data = "c:sasregacadindx" nosimple noprob; var acadindx female reading writing; run;
The CORR Procedure 4 Variables: acadindx female reading writing Pearson Correlation Coefficients, N = 200 acadindx female reading writingacadindx 1.00000 -0.08210 0.71309 0.66256female -0.08210 1.00000 -0.05308 0.25649reading 0.71309 -0.05308 1.00000 0.59678writing 0.66256 0.25649 0.59678 1.00000
Now, let's run a standard OLS regression on the data and generate predicted scores in p1.
proc reg data = "c:sasregacadindx"; model acadindx =female reading writing; output out = reg1 p = p1; run; quit;
The REG ProcedureModel: MODEL1Dependent Variable: acadindx Analysis of Variance Sum of MeanSource DF Squares Square F Value Pr > FModel 3 34994 11665 107.40 <.0001Error 196 21288 108.61160Corrected Total 199 56282Root MSE 10.42169 R-Square 0.6218Dependent Mean 172.18500 Adj R-Sq 0.6160Coeff Var 6.05261 Parameter Estimates Parameter StandardVariable DF Estimate Error t Value Pr > |t|Intercept 1 96.11841 4.48956 21.41 <.0001female 1 -5.83250 1.58821 -3.67 0.0003reading 1 0.71842 0.09315 7.71 <.0001writing 1 0.79057 0.10410 7.59 <.0001
The proc lifereg is one of the procedures in SAS that can be used for regression with censored data. The syntax of the command is similar to proc reg with the addition of the variable indicating if an observation is censored. Therefore, we have to create a data set with the information on censoring.
data tobit_model; set "c:sasregacadindx"; censor = ( acadindx >= 200 ); run; proc lifereg data = tobit_model; model acadindx*censor(1) = female reading writing /d=normal; output out = reg2 p = p2; run;
The LIFEREG Procedure Model InformationData Set WORK.TOBIT_MODELDependent Variable acadindxCensoring Variable censorCensoring Value(s) 1Number of Observations 200Noncensored Values 184Right Censored Values 16Left Censored Values 0Interval Censored Values 0Name of Distribution NormalLog Likelihood -718.0636168Algorithm converged. Type III Analysis of Effects WaldEffect DF Chi-Square Pr > ChiSqfemale 1 14.0654 0.0002reading 1 60.8529 <.0001writing 1 54.1655 <.0001 Analysis of Parameter Estimates Standard 95% Confidence Chi-Parameter DF Estimate Error Limits Square Pr > ChiSqIntercept 1 92.7378 4.8034 83.3233 102.1524 372.74 <.0001female 1 -6.3473 1.6924 -9.6644 -3.0302 14.07 0.0002reading 1 0.7777 0.0997 0.5823 0.9731 60.85 <.0001writing 1 0.8111 0.1102 0.5951 1.0271 54.17 <.0001Scale 1 10.9897 0.5817 9.9067 12.1912
Let's merge the two data sets we created together to compare the predicted score p1 and p2.
data compare; merge reg1 reg2; by id; run; proc means data = compare; var acadindx p1 p2; run;
The MEANS ProcedureVariable N Mean Std Dev Minimum Maximum-------------------------------------------------------------------------------acadindx 200 172.1850000 16.8173987 138.0000000 200.0000000p1 200 172.1850000 13.2608696 142.3820725 201.5311070p2 200 172.7040275 14.0029221 141.2210940 203.8540576-------------------------------------------------------------------------------
It shows that the censored regression model predicted values have a larger standard deviation and a greater range of values. When we look at a listing of p1 and p2 for all students who scored the maximum of 200 on acadindx, we see that in every case the censored regression model predicted value is greater than the OLS predicted value. These predictions represent an estimate of what the variability would be if the values of acadindx could exceed 200.
proc print data = compare; var acadindx p1 p2; where acadindx = 200; run;
Obs acadindx p1 p2 32 200 179.175 179.620 57 200 192.681 194.329 68 200 201.531 203.854 80 200 191.831 193.577 82 200 188.154 189.563 88 200 186.573 187.940 95 200 195.997 198.176100 200 186.933 188.108132 200 197.578 199.798136 200 189.459 191.144143 200 191.185 192.833157 200 191.614 193.477161 200 180.251 181.008169 200 182.275 183.367174 200 191.614 193.477200 200 187.662 189.421
Proc lifereg handles right censored, left censored and interval censored data. There are two other commands in SAS that perform censored regression analysis such as proc qlim.
4.3.2 Regression with Truncated Data
Truncated data occurs when some observations are not included in the analysis because of the value of the variable. We will illustrate analysis with truncation using the dataset, acadindx, that was used in the previous section.
Let's imagine that in order to get into a special honors program, students need to score at least 160 on acadindx. So we will drop all observations in which the value of acadindx is less than or equal 160. Now, let's estimate the same model that we used in the section on censored data, only this time we will pretend that a 200 for acadindx is not censored.
proc reg data = "c:sasregacadindx"; model acadindx = female reading writing; where acadindx >160; run; quit;
The REG ProcedureModel: MODEL1Dependent Variable: acadindx Analysis of Variance Sum of MeanSource DF Squares Square F Value Pr > FModel 3 8074.79638 2691.59879 33.01 <.0001Error 140 11416 81.54545Corrected Total 143 19491Root MSE 9.03025 R-Square 0.4143Dependent Mean 180.42361 Adj R-Sq 0.4017Coeff Var 5.00503 Parameter Estimates Parameter StandardVariable DF Estimate Error t Value Pr > |t|Intercept 1 125.63547 5.89156 21.32 <.0001female 1 -5.23849 1.61563 -3.24 0.0015reading 1 0.44111 0.09635 4.58 <.0001writing 1 0.58733 0.11508 5.10 <.0001
It is clear that the estimates of the coefficients are distorted due to the fact that 53 observations are no longer in the dataset. This amounts to restriction of range on both the response variable and the predictor variables. For example, the coefficient for writing dropped from .79 to .58. What this means is that if our goal is to find the relation between adadindx and the predictor variables in the populations, then the truncation of acadindx in our sample is going to lead to biased estimates. A better approach to analyzing these data is to use truncated regression. In SAS this can be accomplished using proc qlim. Proc qlim is an experimental procedure first available in SAS version 8.1. Proc qlim (Qualitative and LImited dependent variable model) analyzes univariate (and multivariate) limited dependent variable models where dependent variables takes discrete values or dependent variables are observed only in a limited range of values.
data trunc_model; set "c:sasregacadindx"; y = .; if acadindx > 160 & acadindx ~=. then y = acadindx; run; proc qlim data=trunc_model ; model y = female reading writing; endogenous y~ truncated (lb=160); run;
The QLIM Procedure Summary Statistics of Continuous Responses N Obs N Obs Standard Lower Upper Lower UpperVariable Mean Error Type Bound Bound Bound Bound y 180.4236 11.674837 Truncated 160 Model Fit SummaryNumber of Endogenous Variables 1Endogenous Variable yNumber of Observations 144Missing Values 56Log Likelihood -510.00768Maximum Absolute Gradient 3.87471E-6Number of Iterations 14AIC 1030Schwarz Criterion 1045Algorithm converged. Parameter Estimates Standard ApproxParameter Estimate Error t Value Pr > |t|Intercept 110.289206 8.673847 12.72 <.0001female -6.099602 1.925245 -3.17 0.0015reading 0.518179 0.116829 4.44 <.0001writing 0.766164 0.152620 5.02 <.0001_Sigma 9.803571 0.721646 13.59 <.0001
The coefficients from the proc qlim are closer to the OLS results, for example the coefficient for writing is .77 which is closer to the OLS results of .79. However, the results are still somewhat different on the other variables, for example the coefficient for reading is .52 in the proc qlim as compared to .72 in the original OLS with the unrestricted data, and better than the OLS estimate of .47 with the restricted data. While proc qlim may improve the estimates on a restricted data file as compared to OLS, it is certainly no substitute for analyzing the complete unrestricted data file.
4.4 Regression with Measurement Error
As you will most likely recall, one of the assumptions of regression is that the predictor variables are measured without error. The problem is that measurement error in predictor variables leads to under estimation of the regression coefficients.
This section is under development.
4.5 Multiple Equation Regression Models
If a dataset has enough variables we may want to estimate more than one regression model. For example, we may want to predict y1 from x1 and also predict y2 from x2. Even though there are no variables in common these two models are not independent of one another because the data come from the same subjects. This is an example of one type multiple equation regression known as seemly unrelated regression.. We can estimate the coefficients and obtain standard errors taking into account the correlated errors in the two models. An important feature of multiple equation modes is that we can test predictors across equations.
Another example of multiple equation regression is if we wished to predict y1, y2 and y3 from x1 and x2. This is a three equation system, known as multivariate regression, with the same predictor variables for each model. Again, we have the capability of testing coefficients across the different equations.
Multiple equation models are a powerful extension to our data analysis tool kit.
4.5.1 Seemingly Unrelated Regression
Let's continue using the hsb2 data file to illustrate the use of seemingly unrelated regression. This time let's look at two regression models.
science = math female write = read female
It is the case that the errors (residuals) from these two models would be correlated. This would be true even if the predictor female were not found in both models. The errors would be correlated because all of the values of the variables are collected on the same set of observations. This is a situation tailor made for seemingly unrelated regression using the proc syslin with option sur. With the proc syslin we can estimate both models simultaneously while accounting for the correlated errors at the same time, leading to efficient estimates of the coefficients and standard errors. The syntax is as follows.
proc syslin data="c:sasreghsb2" sur ; science: model science = math female ; write: model write = read female ; run;
The first part of the output consists of the OLS estimate for each model. Here is the OLS estimate for the first model.
The SYSLIN ProcedureOrdinary Least Squares EstimationModel SCIENCEDependent Variable science Analysis of Variance Sum of MeanSource DF Squares Square F Value Pr > FModel 2 7993.550 3996.775 68.38 <.0001Error 197 11513.95 58.44645Corrected Total 199 19507.50Root MSE 7.64503 R-Square 0.40977Dependent Mean 51.85000 Adj R-Sq 0.40378Coeff Var 14.74451 Parameter Estimates Parameter StandardVariable DF Estimate Error t Value Pr > |t|Intercept 1 18.11813 3.167133 5.72 <.0001math 1 0.663190 0.057872 11.46 <.0001female 1 -2.16840 1.086043 -2.00 0.0472
And here is OLS estimate for the second model.
The SYSLIN ProcedureOrdinary Least Squares EstimationModel WRITEDependent Variable write Analysis of Variance Sum of MeanSource DF Squares Square F Value Pr > FModel 2 7856.321 3928.161 77.21 <.0001Error 197 10022.55 50.87591Corrected Total 199 17878.88Root MSE 7.13273 R-Square 0.43942Dependent Mean 52.77500 Adj R-Sq 0.43373Coeff Var 13.51537 Parameter Estimates Parameter StandardVariable DF Estimate Error t Value Pr > |t|Intercept 1 20.22837 2.713756 7.45 <.0001read 1 0.565887 0.049385 11.46 <.0001female 1 5.486894 1.014261 5.41 <.0001
Proc syslin with option sur also gives an estimate of the correlation between the errors of the two models. Here is the corresponding output.
The SYSLIN ProcedureSeemingly Unrelated Regression Estimation Cross Model Covariance SCIENCE WRITESCIENCE 58.4464 7.8908WRITE 7.8908 50.8759 Cross Model Correlation SCIENCE WRITESCIENCE 1.00000 0.14471WRITE 0.14471 1.00000 Cross Model Inverse Correlation SCIENCE WRITESCIENCE 1.02139 -0.14780WRITE -0.14780 1.02139 Cross Model Inverse Covariance SCIENCE WRITESCIENCE 0.017476 -.002710WRITE -.002710 0.020076System Weighted MSE 0.9981Degrees of freedom 394System Weighted R-Square 0.3875Finally, we have the seemingly unrelated regression estimation for our models. Note that both the estimates of the coefficients and their standard errors are different from the OLS model estimates shown above.
The SYSLIN ProcedureSeemingly Unrelated Regression EstimationModel SCIENCEDependent Variable science Parameter Estimates Parameter StandardVariable DF Estimate Error t Value Pr > |t|Intercept 1 20.13265 3.149485 6.39 <.0001math 1 0.625141 0.057528 10.87 <.0001female 1 -2.18934 1.086038 -2.02 0.0452Model WRITEDependent Variable write Parameter Estimates Parameter StandardVariable DF Estimate Error t Value Pr > |t|Intercept 1 21.83439 2.698827 8.09 <.0001read 1 0.535484 0.049091 10.91 <.0001female 1 5.453748 1.014244 5.38 <.0001
Now that we have estimated our models let's test the predictor variables. The test for female combines information from both models. The tests for math and read are actually equivalent to the t-tests above except that the results are displayed as F-tests.
proc syslin data = "c:sasreghsb2" sur ; science: model science = math female ; write: model write = read female ; female: stest science.female = write.female =0; math: stest science.math = 0; read: stest write.read = 0; run;
Test Results for Variable FEAMLE Num DF Den DF F Value Pr > F 2 394 18.48 0.0001 Test Results for Variable MATH Num DF Den DF F Value Pr > F 1 394 118.31 0.0001 Test Results for Variable READ Num DF Den DF F Value Pr > F 1 394 119.21 0.0001
Now, let's estimate 3 models where we use the same predictors in each model as shown below.
read = female prog1 prog3 write = female prog1 prog3 math = female prog1 prog3
Here variable prog1 and prog3 are dummy variables for the variable prog. Let's generate these variables before estimating our three models using proc syslin.
data hsb2; set "c:sasreghsb2"; prog1 = (prog = 1); prog3 = (prog = 3); run; proc syslin data = hsb2 sur; model1: model read = female prog1 prog3; model2: model write = female prog1 prog3; model3: model math = female prog1 prog3; run;
The OLS regression estimate of our three models are as follows.
<some output omitted>
The SYSLIN ProcedureOrdinary Least Squares EstimationModel MODEL1Dependent Variable read Parameter Estimates Parameter StandardVariable DF Estimate Error t Value Pr > |t|Intercept 1 56.82950 1.170562 48.55 <.0001female 1 -1.20858 1.327672 -0.91 0.3638prog1 1 -6.42937 1.665893 -3.86 0.0002prog3 1 -9.97687 1.606428 -6.21 <.0001Model MODEL2Dependent Variable write Parameter Estimates Parameter StandardVariable DF Estimate Error t Value Pr > |t|Intercept 1 53.62162 1.042019 51.46 <.0001female 1 4.771211 1.181876 4.04 <.0001prog1 1 -4.83293 1.482956 -3.26 0.0013prog3 1 -9.43807 1.430021 -6.60 <.0001Model MODEL3Dependent Variable math Parameter Estimates Parameter StandardVariable DF Estimate Error t Value Pr > |t|Intercept 1 57.10551 1.036890 55.07 <.0001female 1 -0.67377 1.176059 -0.57 0.5674prog1 1 -6.72394 1.475657 -4.56 <.0001prog3 1 -10.3217 1.422983 -7.25 <.0001
These regressions provide fine estimates of the coefficients and standard errors but these results assume the residuals of each analysis are completely independent of the other. Also, if we wish to test female, we would have to do it three times and would not be able to combine the information from all three tests into a single overall test.
Now let's see the output of the estimate using seemingly unrelated regression.
The SYSLIN ProcedureSeemingly Unrelated Regression Estimation
Model MODEL1Dependent Variable read
Parameter Estimates Parameter StandardVariable DF Estimate Error t Value Pr > |t|Intercept 1 56.82950 1.170562 48.55 <.0001female 1 -1.20858 1.327672 -0.91 0.3638prog1 1 -6.42937 1.665893 -3.86 0.0002prog3 1 -9.97687 1.606428 -6.21 <.0001Model MODEL2Dependent Variable write Parameter Estimates Parameter StandardVariable DF Estimate Error t Value Pr > |t|Intercept 1 53.62162 1.042019 51.46 <.0001female 1 4.771211 1.181876 4.04 <.0001prog1 1 -4.83293 1.482956 -3.26 0.0013prog3 1 -9.43807 1.430021 -6.60 <.0001Model MODEL3Dependent Variable math Parameter Estimates Parameter StandardVariable DF Estimate Error t Value Pr > |t|Intercept 1 57.10551 1.036890 55.07 <.0001female 1 -0.67377 1.176059 -0.57 0.5674prog1 1 -6.72394 1.475657 -4.56 <.0001prog3 1 -10.3217 1.422983 -7.25 <.0001
Note that the coefficients are identical in the OLS results and in the seemingly unrelated regression estimate, however the standard errors are different, only slightly, due to the correlation among the residuals in the multiple equations.
In addition to getting more appropriate standard errors, we can test the effects of the predictors across the equations. We can test the hypothesis that the coefficient for female is 0 for all three outcome variables, as shown below.
proc syslin data = hsb2 sur; model1: model read = female prog1 prog3; model2: model write = female prog1 prog3; model3: model math = female prog1 prog3; feamle: stest model1.female = model2.female = model3.female = 0; run;
Test Results for Variable FEAMLE Num DF Den DF F Value Pr > F 3 588 11.63 0.0001
We can also test the hypothesis that the coefficient for female is 0 for just read and math.
proc syslin data = hsb2 sur; model1: model read = female prog1 prog3; model2: model write = female prog1 prog3; model3: model math = female prog1 prog3; f1: stest model1.female = model3.female = 0; run;
Test Results for Variable F1 Num DF Den DF F Value Pr > F 2 588 0.42 0.6599
We can also test the hypothesis that the coefficients for prog1 and prog3 are 0 for all three outcome variables, as shown below.
proc syslin data = hsb2 sur; model1: model read = female prog1 prog3; model2: model write = female prog1 prog3; model3: model math = female prog1 prog3; progs: stest model1.prog1 = model2.prog1 = model3.prog1 =0, model1.prog3 = model2.prog3 = model3.prog3 =0 ; run;
Test Results for Variable PROGS Num DF Den DF F Value Pr > F 6 588 11.83 0.0001
4.5.2 Multivariate Regression
Let's now use multivariate regression using proc reg to look at the same analysis say that we saw in the proc syslin example above, estimating the following 3 models.
read = female prog1 prog3 write = female prog1 prog3 math = female prog1 prog3
Below we use proc reg to predict read write and math from female prog1 and prog3. Note that the top part of the output is similar to the sureg output in that it gives an overall summary of the model for each outcome variable, however the results are somewhat different and the sureg uses a Chi-Square test for the overall fit of the model, and mvreg uses an F-test. The lower part of the output appears similar to the sureg output, however when you compare the standard errors you see that the results are not the same. These standard errors correspond to the OLS standard errors, so these results below do not take into account the correlations among the residuals (as do the sureg results).
proc reg data =hsb2; model read write math = female prog1 prog3 ; run;
The REG Procedure [Some output omitted] Dependent Variable: read Parameter Estimates Parameter StandardVariable DF Estimate Error t Value Pr > |t|Intercept 1 56.82950 1.17056 48.55 <.0001female 1 -1.20858 1.32767 -0.91 0.3638prog1 1 -6.42937 1.66589 -3.86 0.0002prog3 1 -9.97687 1.60643 -6.21 <.0001Dependent Variable: write Parameter Estimates Parameter StandardVariable DF Estimate Error t Value Pr > |t|Intercept 1 53.62162 1.04202 51.46 <.0001female 1 4.77121 1.18188 4.04 <.0001prog1 1 -4.83293 1.48296 -3.26 0.0013prog3 1 -9.43807 1.43002 -6.60 <.0001Dependent Variable: math Parameter Estimates Parameter StandardVariable DF Estimate Error t Value Pr > |t|Intercept 1 57.10551 1.03689 55.07 <.0001female 1 -0.67377 1.17606 -0.57 0.5674prog1 1 -6.72394 1.47566 -4.56 <.0001prog3 1 -10.32168 1.42298 -7.25 <.0001
Now, let's test female. Note, that female was statistically significant in only one of the three equations. Using the mtest statement after proc reg allows us to test female across all three equations simultaneously. And, guess what? It is significant. This is consistent with what we found using seemingly unrelated regression estimation.
female: mtest female=0; run;
Multivariate Test: female Multivariate Statistics and Exact F Statistics S=1 M=0.5 N=96Statistic Value F Value Num DF Den DF Pr > FWilks' Lambda 0.84892448 11.51 3 194 <.0001Pillai's Trace 0.15107552 11.51 3 194 <.0001Hotelling-Lawley Trace 0.17796108 11.51 3 194 <.0001Roy's Greatest Root 0.17796108 11.51 3 194 <.0001
We can also test prog1 and prog3, both separately and combined. Remember these are multivariate tests.
prog1: mtest prog1 = 0; run;
Multivariate Test: prog1 Multivariate Statistics and Exact F Statistics S=1 M=0.5 N=96Statistic Value F Value Num DF Den DF Pr > FWilks' Lambda 0.89429287 7.64 3 194 <.0001Pillai's Trace 0.10570713 7.64 3 194 <.0001Hotelling-Lawley Trace 0.11820192 7.64 3 194 <.0001Roy's Greatest Root 0.11820192 7.64 3 194 <.0001
prog3: mtest prog3 = 0; run;
Multivariate Test: prog3 Multivariate Statistics and Exact F Statistics S=1 M=0.5 N=96Statistic Value F Value Num DF Den DF Pr > FWilks' Lambda 0.75267026 21.25 3 194 <.0001Pillai's Trace 0.24732974 21.25 3 194 <.0001Hotelling-Lawley Trace 0.32860304 21.25 3 194 <.0001Roy's Greatest Root 0.32860304 21.25 3 194 <.0001
prog: mtest prog1 = prog3 =0; run; quit;
Multivariate Test: prog Multivariate Statistics and F Approximations S=2 M=0 N=96Statistic Value F Value Num DF Den DF Pr > FWilks' Lambda 0.73294667 10.87 6 388 <.0001Pillai's Trace 0.26859190 10.08 6 390 <.0001Hotelling-Lawley Trace 0.36225660 11.68 6 256.9 <.0001Roy's Greatest Root 0.35636617 23.16 3 195 <.0001NOTE: F Statistic for Roy's Greatest Root is an upper bound.NOTE: F Statistic for Wilks' Lambda is exact.
Proc syslin with sur option and proc reg both allow you to test multi-equation models while taking into account the fact that the equations are not independent. The proc syslin with sur option allows you to get estimates for each equation which adjust for the non-independence of the equations, and it allows you to estimate equations which don't necessarily have the same predictors. By contrast, proc reg is restricted to equations that have the same set of predictors, and the estimates it provides for the individual equations are the same as the OLS estimates. However, proc reg allows you to perform more traditional multivariate tests of predictors.
4.6 Summary
This chapter has covered a variety of topics that go beyond ordinary least squares regression, but there still remain a variety of topics we wish we could have covered, including the analysis of survey data, dealing with missing data, panel data analysis, and more. And, for the topics we did cover, we wish we could have gone into even more detail. One of our main goals for this chapter was to help you be aware of some of the techniques that are available in SAS for analyzing data that do not fit the assumptions of OLS regression and some of the remedies that are possible. If you are a member of the UCLA research community, and you have further questions, we invite you to use our consulting services to discuss issues specific to your data analysis.
The content of this web site should not be construed as an endorsement of any particular web site, book, or software product by the University of California.