getwd()
[1] "/data/rmcd/git/rmcd/stataR"
This document illustrates estimation with clustered standard errors in both Stata and R. The pdf version shows both Stata and R output, while the html version shows only R output. The purpose is to illustrate R counterparts to the procedures documented in Petersen (2009) and the companion website, which uses Stata. I do not explain the econometric issues or make any claims about the superiority of an estimator.1 This is purely an exercise in mimicking results across statistical packages, something that seems to be a Frequently Asked Question.2
The commands in Stata and R differ, as the languages differ. In Stata, statistical procedures are a command, with vce(robust)
and vce(cluster)
as options. In R, coefficient estimation is a function and the output from this estimation can then be used as the input to another function to compute the covariance matrix. These two components (estimates and covariance matrix) can then be fed to a function that computes standard errors and p-values. For an R user, Stata may seem too inflexible and hard-coded, while for a Stata user, R may seem needlessly complicated.
In what follows, standard Stata calculations are used for single clustering. The function for double clustering, cluster2
, was written by Mitchell Petersen and obtained from his web site. This was also the source for the other functions with double clustering: logit2.ado, probit2.ado, and tobit2.ado
In all cases, standard and widely-adopted R packages are used to compute both single and double clusters. For examples of clustering with R see the documentation for the sandwich package, especially Berger, Graham, and Zeileis (2017). The approach to computing clustered standard errors is identical in all cases we consider. We obtain coefficient estimates (e.g. using lm
) and then use vcovCL
from the sandwich
package to compute the standard errors. By default, vcovCL
computes robust standard errors, as does the robust
option in Stata. Optionally, vcovCL
can cluster along one or more dimensions. The coefficient and covariance estimates are then fed to lmtest::testcoef
, which returns the estimates, standard errors, and p-values. This procedure also accommodates bootstrapping, using the vcovBS
function, but I haven’t yet experimented yet with that capability.
Here are examples producing identical estimates and roughly the same output in Stata and R.
Stata:
use "data/petersen.dta"
reg y x, robust
reg y x, vce(cluster firm)
R:
library(sandwich)
library(lmtest)
data(PetersenCL)
reg = lm(y ~ x, data=PetersenCL) ## estimate the regression
print(coeftest(reg, vcovCL(reg, type='HC1')), digits=6)
print(coeftest(reg, vcovCL(reg, type='HC1', cluster=~firm)), digits=6)
The R function lm
generates OLS estimates. The regression object, reg
, is then an argument when computing standard errors using vcovCL
.3
This document was created by knitting the Rmarkdown document stata_and_R_clustering.Rmd
. (If you have Stata installed, RStudio can execute Stata code and include the output in a document.) Tables in Stata were produced using outreg2, and those in R using stargazer
(Hlavac 2018).
Here are baseline calculations without clustering and calculating robust standard errors.
Results are in Table .
use "data/petersen.dta"
regress y x
outreg2 using tmp, replace ctitle(OLS) auto(5)
regress y x, robust
outreg2 using tmp, ctitle(Robust OLS) auto(5)
. use "data/petersen.dta"
. regress y x
Source | SS df MS Number of obs = 5,000
-------------+---------------------------------- F(1, 4998) = 1310.74
Model | 5270.664 1 5270.664 Prob > F = 0.0000
Residual | 20097.6389 4,998 4.02113624 R-squared = 0.2078
-------------+---------------------------------- Adj R-squared = 0.2076
Total | 25368.3029 4,999 5.07467552 Root MSE = 2.0053
------------------------------------------------------------------------------
y | Coef. Std. Err. t P>|t| [95% Conf. Interval]
-------------+----------------------------------------------------------------
x | 1.034833 .0285833 36.20 0.000 .9787977 1.090869
_cons | .0296797 .0283593 1.05 0.295 -.025917 .0852764
------------------------------------------------------------------------------
. outreg2 using tmp, replace ctitle(OLS) auto(5)
dir : seeout
. regress y x, robust
Linear regression Number of obs = 5,000
F(1, 4998) = 1328.17
Prob > F = 0.0000
R-squared = 0.2078
Root MSE = 2.0053
------------------------------------------------------------------------------
| Robust
y | Coef. Std. Err. t P>|t| [95% Conf. Interval]
-------------+----------------------------------------------------------------
x | 1.034833 .0283952 36.44 0.000 .9791665 1.0905
_cons | .0296797 .0283607 1.05 0.295 -.0259196 .0852791
------------------------------------------------------------------------------
. outreg2 using tmp, ctitle(Robust OLS) auto(5)
dir : seeout
The variable reg
contains the full OLS regression output, which is used in subsequent calculations. We save the output from the coeftest
function. Results are in Table .
library(sandwich)
library(lmtest)
data(PetersenCL)
reg <- lm(y ~ x, data = PetersenCL) ## `reg` is used throughout
regols = coeftest(reg) ## OLS
print(regols, digits=6)
t test of coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 0.0296797 0.0283593 1.04656 0.29535
x 1.0348334 0.0285833 36.20414 < 2e-16 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
regolsr = coeftest(reg, vcovCL) ## Robust std errors
print(regolsr, digits=6)
t test of coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 0.0296797 0.0283607 1.04651 0.29538
x 1.0348334 0.0283952 36.44401 < 2e-16 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Stata results in this section are in Table . R results are in Table .
use "data/petersen.dta"
regress y x, vce(cluster year)
outreg2 using tmp, ctitle(Cluster: year) auto(5)
. use "data/petersen.dta"
. regress y x, vce(cluster year)
Linear regression Number of obs = 5,000
F(1, 9) = 960.59
Prob > F = 0.0000
R-squared = 0.2078
Root MSE = 2.0053
(Std. Err. adjusted for 10 clusters in year)
------------------------------------------------------------------------------
| Robust
y | Coef. Std. Err. t P>|t| [95% Conf. Interval]
-------------+----------------------------------------------------------------
x | 1.034833 .0333889 30.99 0.000 .9593025 1.110364
_cons | .0296797 .0233867 1.27 0.236 -.0232247 .0825842
------------------------------------------------------------------------------
. outreg2 using tmp, ctitle(Cluster: year) auto(5)
dir : seeout
The vcovCL
function takes as arguments: the estimated model, the cluster variable, and the type of clustering calculation. We don’t need to have separate functions for the covariance matrix and the regression output, but this perhaps illustrates what’s going on.
v_year = vcovCL(reg, type='HC1', cluster = ~year)
reg_year = coeftest(reg, v_year)
print(reg_year, digits=6)
t test of coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 0.0296797 0.0233867 1.26908 0.20447
x 1.0348334 0.0333889 30.99332 < 2e-16 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Both of the following syntaxes would also work:
reg_year1 = coeftest(reg, vcovCL(reg, cluster= ~year, type='HC1'))
reg_year2 = coeftest(reg, vcovCL, cluster= ~year, type='HC1')
all.equal(reg_year1, reg_year2)
[1] TRUE
Stata results in this section are in Table . R results are in Table .
Results are in Table .
use "data/petersen.dta"
regress y x, vce(cluster firm)
outreg2 using tmp, ctitle(Cluster: firm) auto(5)
. use "data/petersen.dta"
. regress y x, vce(cluster firm)
Linear regression Number of obs = 5,000
F(1, 499) = 418.32
Prob > F = 0.0000
R-squared = 0.2078
Root MSE = 2.0053
(Std. Err. adjusted for 500 clusters in firmid)
------------------------------------------------------------------------------
| Robust
y | Coef. Std. Err. t P>|t| [95% Conf. Interval]
-------------+----------------------------------------------------------------
x | 1.034833 .0505957 20.45 0.000 .9354265 1.13424
_cons | .0296797 .0670127 0.44 0.658 -.1019821 .1613415
------------------------------------------------------------------------------
. outreg2 using tmp, ctitle(Cluster: firm) auto(5)
dir : seeout
v_firm = vcovCL(reg, type="HC1", cluster = ~firm)
reg_firm = coeftest(reg, v_firm)
print(reg_firm, digits=6)
t test of coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 0.0296797 0.0670127 0.4429 0.65786
x 1.0348334 0.0505957 20.4530 < 2e-16 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Stata results in this section are in Table . R results are in Table .
This uses Petersen’s cluster2 function
use "data/petersen.dta"
cluster2 y x, fcluster(firm) tcluster(year)
outreg2 using tmp, ctitle(Cluster: both) auto(5)
. use "data/petersen.dta"
. cluster2 y x, fcluster(firm) tcluster(year)
command cluster2 is unrecognized
r(199);
end of do-file
r(199);
The sandwich
package handles double clustering in the same way as single clustering:
v_both = vcovCL(reg, type='HC1', cluster = ~firm+year)
reg_both = coeftest(reg, v_both)
print(reg_both, digits=6)
t test of coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 0.0296797 0.0650639 0.45616 0.64829
x 1.0348334 0.0535580 19.32173 < 2e-16 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
This uses Petersen’s fm.ado function from his website.
use "data/test_data.dta"
tsset firm year
fm y x
outreg2 using tmp, ctitle(Fama-Macbeth) auto(5) tex(fragment)
. use "data/test_data.dta"
. tsset firm year
panel variable: firmid (strongly balanced)
time variable: year, 1 to 10
delta: 1 unit
. fm y x
command fm is unrecognized
r(199);
end of do-file
r(199);
The plm
package handles panel models. Results from the Stata Fama-Macbeth function above can be replicated with a standard panel model using a grouped means estimator (the plm::pmg
function in R).4
library(plm)
fpmg <- pmg(y~x, data=PetersenCL, index=c("year","firm")) ##Fama-MacBeth
reg_fm <- coeftest(fpmg)
print(coeftest(fpmg), digits=6)
t test of coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 0.0312780 0.0233565 1.33916 0.18058
x 1.0355861 0.0333416 31.05989 < 2e-16 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
The standard Stata and R procedures produce identical answers for logit and tobit. This is also true for probit, provided that the same numerical procedure is used for estimation. Maximum likelihood and glm estimates differ slightly, but are the same in Stata and R.
Petersen (2009) includes examples of double clustering for limited dependent models. To illustrate double clustering with limited dependent models, we will create two new variables, yl
, which equals 1 if \(y>0\) and 0 otherwise, and ytrunc
, which equals y
if \(y > 0\) and 0 otherwise. I’m not sure how to think about clustering in the context of limited dependent models, but this stackoverflow post suggests that clustered standard errors for binary response models are not necessarily wrong but don’t really make sense.5
With R, the same procedure used above works with the glm
function, which handles limited dependent models. We first estimate the regression without a correction for clustering, then we use vcovCL
to compute cluster-corrected standard errors.
Stata results in this section are in Table . R results are in Table .
use "data/petersen.dta"
gen yl = (y > 0)
logit yl x
outreg2 using tmp2, ctitle(Logit,) auto(5) replace
logit yl x, vce(robust)
outreg2 using tmp2, ctitle(Logit,robust) auto(5)
logit yl x, vce(cluster firm)
outreg2 using tmp2, ctitle(Logit,cluster: firm) auto(5)
logit yl x, vce(cluster year)
outreg2 using tmp2, ctitle(Logit,cluster: year) auto(5)
logit2 yl x, fcluster(firm) tcluster(year)
outreg2 using tmp2, ctitle(Logit,cluster: both) auto(5) tex(frag)
. use "data/petersen.dta"
. gen yl = (y > 0)
. logit yl x
Iteration 0: log likelihood = -3464.8895
Iteration 1: log likelihood = -3133.2384
Iteration 2: log likelihood = -3133.1693
Iteration 3: log likelihood = -3133.1693
Logistic regression Number of obs = 5,000
LR chi2(1) = 663.44
Prob > chi2 = 0.0000
Log likelihood = -3133.1693 Pseudo R2 = 0.0957
------------------------------------------------------------------------------
yl | Coef. Std. Err. z P>|z| [95% Conf. Interval]
-------------+----------------------------------------------------------------
x | .8118898 .0346105 23.46 0.000 .7440544 .8797251
_cons | .035946 .0302484 1.19 0.235 -.0233398 .0952318
------------------------------------------------------------------------------
. outreg2 using tmp2, ctitle(Logit,) auto(5) replace
dir : seeout
. logit yl x, vce(robust)
Iteration 0: log pseudolikelihood = -3464.8895
Iteration 1: log pseudolikelihood = -3133.2384
Iteration 2: log pseudolikelihood = -3133.1693
Iteration 3: log pseudolikelihood = -3133.1693
Logistic regression Number of obs = 5,000
Wald chi2(1) = 561.72
Prob > chi2 = 0.0000
Log pseudolikelihood = -3133.1693 Pseudo R2 = 0.0957
------------------------------------------------------------------------------
| Robust
yl | Coef. Std. Err. z P>|z| [95% Conf. Interval]
-------------+----------------------------------------------------------------
x | .8118898 .0342562 23.70 0.000 .7447489 .8790306
_cons | .035946 .0302642 1.19 0.235 -.0233707 .0952627
------------------------------------------------------------------------------
. outreg2 using tmp2, ctitle(Logit,robust) auto(5)
dir : seeout
. logit yl x, vce(cluster firm)
Iteration 0: log pseudolikelihood = -3464.8895
Iteration 1: log pseudolikelihood = -3133.2384
Iteration 2: log pseudolikelihood = -3133.1693
Iteration 3: log pseudolikelihood = -3133.1693
Logistic regression Number of obs = 5,000
Wald chi2(1) = 239.03
Prob > chi2 = 0.0000
Log pseudolikelihood = -3133.1693 Pseudo R2 = 0.0957
(Std. Err. adjusted for 500 clusters in firmid)
------------------------------------------------------------------------------
| Robust
yl | Coef. Std. Err. z P>|z| [95% Conf. Interval]
-------------+----------------------------------------------------------------
x | .8118898 .0525134 15.46 0.000 .7089653 .9148142
_cons | .035946 .0599127 0.60 0.549 -.0814808 .1533728
------------------------------------------------------------------------------
. outreg2 using tmp2, ctitle(Logit,cluster: firm) auto(5)
dir : seeout
. logit yl x, vce(cluster year)
Iteration 0: log pseudolikelihood = -3464.8895
Iteration 1: log pseudolikelihood = -3133.2384
Iteration 2: log pseudolikelihood = -3133.1693
Iteration 3: log pseudolikelihood = -3133.1693
Logistic regression Number of obs = 5,000
Wald chi2(1) = 953.59
Prob > chi2 = 0.0000
Log pseudolikelihood = -3133.1693 Pseudo R2 = 0.0957
(Std. Err. adjusted for 10 clusters in year)
------------------------------------------------------------------------------
| Robust
yl | Coef. Std. Err. z P>|z| [95% Conf. Interval]
-------------+----------------------------------------------------------------
x | .8118898 .0262916 30.88 0.000 .7603592 .8634203
_cons | .035946 .0280314 1.28 0.200 -.0189946 .0908866
------------------------------------------------------------------------------
. outreg2 using tmp2, ctitle(Logit,cluster: year) auto(5)
dir : seeout
. logit2 yl x, fcluster(firm) tcluster(year)
command logit2 is unrecognized
r(199);
end of do-file
r(199);
We create the limited dependent variables in R:
PetersenCL$yl <- (PetersenCL$y > 0)
PetersenCL$ytrunc <- ifelse(PetersenCL$y > 0, PetersenCL$y, 0)
reg.logit = glm(yl ~ x, data=PetersenCL, family=binomial(link='logit'))
print(summary(reg.logit), digits=6)
Call:
glm(formula = yl ~ x, family = binomial(link = "logit"), data = PetersenCL)
Deviance Residuals:
Min 1Q Median 3Q Max
-2.28674 -1.07261 0.53530 1.05492 2.28143
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) 0.0359460 0.0302484 1.18836 0.23469
x 0.8118898 0.0346105 23.45789 < 2e-16 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
(Dispersion parameter for binomial family taken to be 1)
Null deviance: 6929.779 on 4999 degrees of freedom
Residual deviance: 6266.339 on 4998 degrees of freedom
AIC: 6270.339
Number of Fisher Scoring iterations: 4
logit0 = coeftest(reg.logit)
logit_robust = coeftest(reg.logit, vcovCL(reg.logit, type='HC0'))
logit_firm = coeftest(reg.logit, vcovCL(reg.logit, type='HC0', cluster=~firm))
logit_year = coeftest(reg.logit, vcovCL(reg.logit, type='HC0', cluster=~year))
logit_both = coeftest(reg.logit, vcovCL(reg.logit, type='HC0', cluster=~year+firm))
print(logit_both, digits=6)
z test of coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) 0.0359460 0.0588165 0.61116 0.5411
x 0.8118898 0.0477014 17.02026 <2e-16 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
In both Stata and R, probit estimation can be performed using maximum likelihood or iterated reweighted least squares (IRLS). The different methods give slightly different answers. This causes confusion, as illustrated in this stackoverflow post.
In this section we compare probit estimates obtained using the probit
functions available in Stata and R, which use maximum likelihood, and those obtained using the glm functions in Stata and R, which (optionally, in the case of Stata) use IRLS. In each case, when the numerical procedure is the same, results are identical between Stata and R.
We do probit estimation using Stata’s glm function with the irls option. We have four sets of probit estimates from Stata. We only perform single clustering in this example as the vce(cluster)
option can only take one variable.
Results are in Table .
use "data/petersen.dta"
gen yl = (y > 0)
glm yl x, irls family(binomial) link(probit)
outreg2 using tmp4, ctitle(Probit,) auto(5) replace
glm yl x, irls family(binomial) link(probit) robust
outreg2 using tmp4, ctitle(Probit,robust) auto(5)
glm yl x, irls family(binomial) link(probit) vce(cluster firm)
outreg2 using tmp4, ctitle(Probit,cluster: firm) auto(5)
glm yl x, irls family(binomial) link(probit) vce(cluster year)
outreg2 using tmp4, ctitle(Probit,cluster: year) auto(5) tex(frag)
. use "data/petersen.dta"
. gen yl = (y > 0)
. glm yl x, irls family(binomial) link(probit)
Iteration 1: deviance = 6266.887
Iteration 2: deviance = 6265.648
Iteration 3: deviance = 6265.648
Iteration 4: deviance = 6265.648
Generalized linear models No. of obs = 5,000
Optimization : MQL Fisher scoring Residual df = 4,998
(IRLS EIM) Scale parameter = 1
Deviance = 6265.647945 (1/df) Deviance = 1.253631
Pearson = 4999.876207 (1/df) Pearson = 1.000375
Variance function: V(u) = u*(1-u) [Bernoulli]
Link function : g(u) = invnorm(u) [Probit]
BIC = -36303.28
------------------------------------------------------------------------------
| EIM
yl | Coef. Std. Err. z P>|z| [95% Conf. Interval]
-------------+----------------------------------------------------------------
x | .496622 .0202253 24.55 0.000 .4569812 .5362629
_cons | .0224236 .0184727 1.21 0.225 -.0137823 .0586294
------------------------------------------------------------------------------
. outreg2 using tmp4, ctitle(Probit,) auto(5) replace
dir : seeout
. glm yl x, irls family(binomial) link(probit) robust
Iteration 1: deviance = 6266.887
Iteration 2: deviance = 6265.648
Iteration 3: deviance = 6265.648
Iteration 4: deviance = 6265.648
Generalized linear models No. of obs = 5,000
Optimization : MQL Fisher scoring Residual df = 4,998
(IRLS EIM) Scale parameter = 1
Deviance = 6265.647945 (1/df) Deviance = 1.253631
Pearson = 4999.876207 (1/df) Pearson = 1.000375
Variance function: V(u) = u*(1-u) [Bernoulli]
Link function : g(u) = invnorm(u) [Probit]
BIC = -36303.28
------------------------------------------------------------------------------
| Semirobust
yl | Coef. Std. Err. z P>|z| [95% Conf. Interval]
-------------+----------------------------------------------------------------
x | .496622 .0201411 24.66 0.000 .4571462 .5360979
_cons | .0224236 .0184753 1.21 0.225 -.0137874 .0586345
------------------------------------------------------------------------------
. outreg2 using tmp4, ctitle(Probit,robust) auto(5)
dir : seeout
. glm yl x, irls family(binomial) link(probit) vce(cluster firm)
Iteration 1: deviance = 6266.887
Iteration 2: deviance = 6265.648
Iteration 3: deviance = 6265.648
Iteration 4: deviance = 6265.648
Generalized linear models No. of obs = 5,000
Optimization : MQL Fisher scoring Residual df = 4,998
(IRLS EIM) Scale parameter = 1
Deviance = 6265.647945 (1/df) Deviance = 1.253631
Pearson = 4999.876207 (1/df) Pearson = 1.000375
Variance function: V(u) = u*(1-u) [Bernoulli]
Link function : g(u) = invnorm(u) [Probit]
BIC = -36303.28
(Std. Err. adjusted for 500 clusters in firmid)
------------------------------------------------------------------------------
| Semirobust
yl | Coef. Std. Err. z P>|z| [95% Conf. Interval]
-------------+----------------------------------------------------------------
x | .496622 .0306577 16.20 0.000 .436534 .55671
_cons | .0224236 .036582 0.61 0.540 -.0492759 .094123
------------------------------------------------------------------------------
. outreg2 using tmp4, ctitle(Probit,cluster: firm) auto(5)
dir : seeout
. glm yl x, irls family(binomial) link(probit) vce(cluster year)
Iteration 1: deviance = 6266.887
Iteration 2: deviance = 6265.648
Iteration 3: deviance = 6265.648
Iteration 4: deviance = 6265.648
Generalized linear models No. of obs = 5,000
Optimization : MQL Fisher scoring Residual df = 4,998
(IRLS EIM) Scale parameter = 1
Deviance = 6265.647945 (1/df) Deviance = 1.253631
Pearson = 4999.876207 (1/df) Pearson = 1.000375
Variance function: V(u) = u*(1-u) [Bernoulli]
Link function : g(u) = invnorm(u) [Probit]
BIC = -36303.28
(Std. Err. adjusted for 10 clusters in year)
------------------------------------------------------------------------------
| Semirobust
yl | Coef. Std. Err. z P>|z| [95% Conf. Interval]
-------------+----------------------------------------------------------------
x | .496622 .015463 32.12 0.000 .4663151 .5269289
_cons | .0224236 .0163695 1.37 0.171 -.0096601 .0545072
------------------------------------------------------------------------------
. outreg2 using tmp4, ctitle(Probit,cluster: year) auto(5) tex(frag)
tmp4.tex
dir : seeout
.
Here we present results for simple robust standard errors, along with both single and double clustering. Results are in Table .
reg.probit <- glm(yl ~ x, data=PetersenCL, family=binomial(link='probit'))
print(summary(reg.probit), digits=6)
Call:
glm(formula = yl ~ x, family = binomial(link = "probit"), data = PetersenCL)
Deviance Residuals:
Min 1Q Median 3Q Max
-2.370821 -1.075240 0.519401 1.057323 2.363382
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) 0.0224236 0.0184727 1.21388 0.2248
x 0.4966220 0.0202253 24.55449 <2e-16 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
(Dispersion parameter for binomial family taken to be 1)
Null deviance: 6929.779 on 4999 degrees of freedom
Residual deviance: 6265.648 on 4998 degrees of freedom
AIC: 6269.648
Number of Fisher Scoring iterations: 4
probitglm0 = coeftest(reg.probit)
probitglm_robust <- coeftest(reg.probit, vcovCL, type='HC0')
print(probitglm_robust, digits=6)
z test of coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) 0.0224236 0.0184753 1.2137 0.22486
x 0.4966220 0.0201411 24.6571 < 2e-16 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
probitglm_firm <- coeftest(reg.probit, vcovCL, type='HC0', cluster=~firm)
print(probitglm_firm, digits=6)
z test of coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) 0.0224236 0.0365820 0.61297 0.5399
x 0.4966220 0.0306577 16.19893 <2e-16 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
probitglm_year <- coeftest(reg.probit, vcovCL, type='HC0', cluster=~year)
print(probitglm_year, digits=6)
z test of coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) 0.0224236 0.0163695 1.36984 0.17074
x 0.4966220 0.0154630 32.11683 < 2e-16 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
probitglm_both <- coeftest(reg.probit, vcovCL, type='HC0', cluster=~firm+year)
print(probitglm_both, digits=6)
z test of coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) 0.0224236 0.0355650 0.6305 0.52837
x 0.4966220 0.0278089 17.8584 < 2e-16 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
We obtain identical estimates in Stata and R when using the sampleSelection::probit
function in R. It is notable that the sandwich
package works correctly with the sampleSelection
package. Results are in Table .
use "data/petersen.dta"
gen yl = (y > 0)
probit yl x
outreg2 using tmp5, ctitle(OLS) auto(5) replace
probit yl x, robust
outreg2 using tmp5, ctitle(Probit,robust) auto(5)
probit yl x, vce(cluster firm)
outreg2 using tmp5, ctitle(Probit,cluster: firm) auto(5)
probit yl x, vce(cluster year)
outreg2 using tmp5, ctitle(Probit,cluster: year) auto(5)
probit2 yl x, fcluster(firm) tcluster(year)
outreg2 using tmp5, ctitle(Probit,cluster: both) auto(5) tex(frag)
. use "data/petersen.dta"
. gen yl = (y > 0)
. probit yl x
Iteration 0: log likelihood = -3464.8895
Iteration 1: log likelihood = -3132.8482
Iteration 2: log likelihood = -3132.824
Iteration 3: log likelihood = -3132.824
Probit regression Number of obs = 5,000
LR chi2(1) = 664.13
Prob > chi2 = 0.0000
Log likelihood = -3132.824 Pseudo R2 = 0.0958
------------------------------------------------------------------------------
yl | Coef. Std. Err. z P>|z| [95% Conf. Interval]
-------------+----------------------------------------------------------------
x | .496622 .0202469 24.53 0.000 .4569389 .5363051
_cons | .0224236 .0184721 1.21 0.225 -.0137811 .0586282
------------------------------------------------------------------------------
. outreg2 using tmp5, ctitle(OLS) auto(5) replace
dir : seeout
. probit yl x, robust
Iteration 0: log pseudolikelihood = -3464.8895
Iteration 1: log pseudolikelihood = -3132.8482
Iteration 2: log pseudolikelihood = -3132.824
Iteration 3: log pseudolikelihood = -3132.824
Probit regression Number of obs = 5,000
Wald chi2(1) = 605.40
Prob > chi2 = 0.0000
Log pseudolikelihood = -3132.824 Pseudo R2 = 0.0958
------------------------------------------------------------------------------
| Robust
yl | Coef. Std. Err. z P>|z| [95% Conf. Interval]
-------------+----------------------------------------------------------------
x | .496622 .0201839 24.60 0.000 .4570623 .5361817
_cons | .0224236 .018474 1.21 0.225 -.0137848 .0586319
------------------------------------------------------------------------------
. outreg2 using tmp5, ctitle(Probit,robust) auto(5)
dir : seeout
. probit yl x, vce(cluster firm)
Iteration 0: log pseudolikelihood = -3464.8895
Iteration 1: log pseudolikelihood = -3132.8482
Iteration 2: log pseudolikelihood = -3132.824
Iteration 3: log pseudolikelihood = -3132.824
Probit regression Number of obs = 5,000
Wald chi2(1) = 261.28
Prob > chi2 = 0.0000
Log pseudolikelihood = -3132.824 Pseudo R2 = 0.0958
(Std. Err. adjusted for 500 clusters in firmid)
------------------------------------------------------------------------------
| Robust
yl | Coef. Std. Err. z P>|z| [95% Conf. Interval]
-------------+----------------------------------------------------------------
x | .496622 .0307234 16.16 0.000 .4364053 .5568388
_cons | .0224236 .0365799 0.61 0.540 -.0492718 .0941189
------------------------------------------------------------------------------
. outreg2 using tmp5, ctitle(Probit,cluster: firm) auto(5)
dir : seeout
. probit yl x, vce(cluster year)
Iteration 0: log pseudolikelihood = -3464.8895
Iteration 1: log pseudolikelihood = -3132.8482
Iteration 2: log pseudolikelihood = -3132.824
Iteration 3: log pseudolikelihood = -3132.824
Probit regression Number of obs = 5,000
Wald chi2(1) = 1028.43
Prob > chi2 = 0.0000
Log pseudolikelihood = -3132.824 Pseudo R2 = 0.0958
(Std. Err. adjusted for 10 clusters in year)
------------------------------------------------------------------------------
| Robust
yl | Coef. Std. Err. z P>|z| [95% Conf. Interval]
-------------+----------------------------------------------------------------
x | .496622 .015486 32.07 0.000 .46627 .526974
_cons | .0224236 .0163607 1.37 0.171 -.0096428 .0544899
------------------------------------------------------------------------------
. outreg2 using tmp5, ctitle(Probit,cluster: year) auto(5)
dir : seeout
. probit2 yl x, fcluster(firm) tcluster(year)
command probit2 is unrecognized
r(199);
end of do-file
r(199);
In R, a direct maximum likelihood probit function is available in the sampleSelection
package. Results are in Table .
library(sampleSelection)
reg.probit <- probit(yl ~ x, data=PetersenCL)
print(summary(reg.probit), digits=6)
--------------------------------------------
Probit binary choice model/Maximum Likelihood estimation
Newton-Raphson maximisation, 4 iterations
Return code 1: gradient close to zero
Log-Likelihood: -3132.82
Model: Y == 'TRUE' in contrary to 'FALSE'
5000 observations (2454 'negative' and 2546 'positive') and 2 free parameters (df = 4998)
Estimates:
Estimate Std. error t value Pr(> t)
(Intercept) 0.0224236 0.0184721 1.21391 0.22478
x 0.4966220 0.0202469 24.52836 < 2e-16 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Significance test:
chi2(1) = 664.131 (p=1.8878e-146)
--------------------------------------------
probitml0 <- coeftest(reg.probit)
probitml_robust <- coeftest(reg.probit, vcovCL)
probitml_firm <- coeftest(reg.probit, vcovCL, type='HC0', cluster=~firm)
probitml_year <- coeftest(reg.probit, vcovCL, type='HC0', cluster=~year)
probitml_both <- coeftest(reg.probit, vcovCL, type='HC0', cluster=~firm+year)
print(probitml0, digits=6)
t test of coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 0.0224236 0.0184721 1.21391 0.22484
x 0.4966220 0.0202469 24.52836 < 2e-16 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
print(probitml_both, digits=6)
t test of coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 0.0224236 0.0355594 0.63059 0.52834
x 0.4966220 0.0278631 17.82363 < 2e-16 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
I’m not sure what outreg2
is doing with the tobit output in this example, but the tables are off. Suggestions welcome. Results are in Tables – .
use "data/petersen.dta"
gen ytrunc = y
replace ytrunc = 0 if y < 0
tobit ytrunc x, ll(0)
outreg2 using tmp3a, ctitle(Tobit,) auto(5) onecol tex(frag) replace
tobit ytrunc x, ll(0) vce(cluster firm)
outreg2 using tmp3b, ctitle(Tobit,cluster: firm) auto(5) onecol tex(frag) replace
tobit2 ytrunc x, ll(0) fcluster(firm) tcluster(year)
outreg2 using tmp3c, ctitle(Tobit,cluster: both) auto(5) tex(frag) onecol replace
. use "data/petersen.dta"
. gen ytrunc = y
. replace ytrunc = 0 if y < 0
(2,454 real changes made)
. tobit ytrunc x, ll(0)
Refining starting values:
Grid node 0: log likelihood = -8052.5473
Fitting full model:
Iteration 0: log likelihood = -8052.5473
Iteration 1: log likelihood = -7129.4277
Iteration 2: log likelihood = -6957.7436
Iteration 3: log likelihood = -6953.127
Iteration 4: log likelihood = -6953.1162
Iteration 5: log likelihood = -6953.1162
Tobit regression Number of obs = 5,000
Uncensored = 2,546
Limits: lower = 0 Left-censored = 2,454
upper = +inf Right-censored = 0
LR chi2(1) = 879.38
Prob > chi2 = 0.0000
Log likelihood = -6953.1162 Pseudo R2 = 0.0595
------------------------------------------------------------------------------
ytrunc | Coef. Std. Err. t P>|t| [95% Conf. Interval]
-------------+----------------------------------------------------------------
x | 1.002673 .0345256 29.04 0.000 .9349881 1.070359
_cons | .0362914 .0360107 1.01 0.314 -.0343054 .1068881
-------------+----------------------------------------------------------------
var(e.ytrunc)| 4.067103 .1248859 3.829496 4.319453
------------------------------------------------------------------------------
. outreg2 using tmp3a, ctitle(Tobit,) auto(5) onecol tex(frag) replace
tmp3a.tex
dir : seeout
. tobit ytrunc x, ll(0) vce(cluster firm)
Refining starting values:
Grid node 0: log likelihood = -8052.5473
Fitting full model:
Iteration 0: log pseudolikelihood = -8052.5473
Iteration 1: log pseudolikelihood = -7129.4277
Iteration 2: log pseudolikelihood = -6957.7436
Iteration 3: log pseudolikelihood = -6953.127
Iteration 4: log pseudolikelihood = -6953.1162
Iteration 5: log pseudolikelihood = -6953.1162
Tobit regression Number of obs = 5,000
Uncensored = 2,546
Limits: lower = 0 Left-censored = 2,454
upper = +inf Right-censored = 0
F( 1, 4999) = 324.97
Prob > F = 0.0000
Log pseudolikelihood = -6953.1162 Pseudo R2 = 0.0595
(Std. Err. adjusted for 500 clusters in firmid)
------------------------------------------------------------------------------
| Robust
ytrunc | Coef. Std. Err. t P>|t| [95% Conf. Interval]
-------------+----------------------------------------------------------------
x | 1.002673 .0556206 18.03 0.000 .8936325 1.111714
_cons | .0362914 .072831 0.50 0.618 -.1064894 .1790722
-------------+----------------------------------------------------------------
var(e.ytrunc)| 4.067103 .2335828 3.634017 4.551803
------------------------------------------------------------------------------
. outreg2 using tmp3b, ctitle(Tobit,cluster: firm) auto(5) onecol tex(frag) re
> place
tmp3b.tex
dir : seeout
. tobit2 ytrunc x, ll(0) fcluster(firm) tcluster(year)
command tobit2 is unrecognized
r(199);
end of do-file
r(199);
Results are in Table .
library(censReg)
reg.tobit <- censReg(ytrunc ~ x, data=PetersenCL, left=0)
tobit0 <- coeftest(reg.tobit)
tobit_robust <- coeftest(reg.tobit, vcovCL, type='HC0')
tobit_firm <- coeftest(reg.tobit, vcovCL, type='HC0', cluster=~firm)
tobit_year <- coeftest(reg.tobit, vcovCL, type='HC0', cluster=~year)
tobit_both <- coeftest(reg.tobit, vcovCL, type='HC0', cluster=~firm+year)
print(summary(reg.tobit), digits=6)
Call:
censReg(formula = ytrunc ~ x, left = 0, data = PetersenCL)
Observations:
Total Left-censored Uncensored Right-censored
5000 2454 2546 0
Coefficients:
Estimate Std. error t value Pr(> t)
(Intercept) 0.0362914 0.0360107 1.00779 0.31355
x 1.0026733 0.0345256 29.04148 < 2e-16 ***
logSigma 0.7014655 0.0153532 45.68864 < 2e-16 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Newton-Raphson maximisation, 5 iterations
Return code 2: successive function values within tolerance limit
Log-likelihood: -6953.12 on 3 Df
print(tobit_both, digits=6)
t test of coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 0.0362914 0.0707035 0.51329 0.60777
x 1.0026733 0.0592672 16.91785 < 2e-16 ***
logSigma 0.7014655 0.0294918 23.78510 < 2e-16 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
The following tables summarize the results for Stata and R. The code that produced these results is in the preceding sections. Stata tables were produced using outreg2
and R tables were produced using stargazer
. For details, see the Rmarkdown version of this document.
OLS | Robust OLS | Cluster: year | Cluster: firm | Cluster: both | Fama-Macbeth | |
(1) | (2) | (3) | (4) | (5) | (6) | |
x | 1.034830*** | 1.034830*** | 1.034830*** | 1.034830*** | 1.034830*** | 1.035590*** |
(0.028583) | (0.028395) | (0.033389) | (0.050596) | (0.053558) | (0.033342) | |
Constant | 0.029680 | 0.029680 | 0.029680 | 0.029680 | 0.029680 | 0.031278 |
(0.028359) | (0.028361) | (0.023387) | (0.067013) | (0.065064) | (0.023356) | |
Cluster: | None | Robust | Firm | Year | Firm+Year | |
Observations | 5,000 | 5,000 | 5,000 | 5,000 | 5,000 | 5,000 |
R2 | 0.207766 | 0.207766 | 0.207766 | 0.207766 | 0.207766 | 0.207498 |
Note: | p<0.1; p<0.05; p<0.01 |
Logit | Robust | Cluster: firm | Cluster: year | Cluster: both | |
(1) | (2) | (3) | (4) | (5) | |
x | 0.811890*** | 0.811890*** | 0.811890*** | 0.811890*** | 0.811890*** |
(0.034611) | (0.034256) | (0.052513) | (0.026292) | (0.047701) | |
Constant | 0.035946 | 0.035946 | 0.035946 | 0.035946 | 0.035946 |
(0.030248) | (0.030264) | (0.059913) | (0.028031) | (0.058816) | |
Cluster: | None | Robust | Firm | Year | Firm+Year |
Observations | 5,000 | 5,000 | 5,000 | 5,000 | 5,000 |
Note: | p<0.1; p<0.05; p<0.01 |
Stata results in this section are in Table . R results are in Table .
Probit | Robust | Cluster: firm | Cluster: year | Cluster: both | |
(1) | (2) | (3) | (4) | (5) | |
x | 0.496622*** | 0.496622*** | 0.496622*** | 0.496622*** | 0.496622*** |
(0.020225) | (0.020141) | (0.030658) | (0.015463) | (0.027809) | |
Constant | 0.022424 | 0.022424 | 0.022424 | 0.022424 | 0.022424 |
(0.018473) | (0.018475) | (0.036582) | (0.016370) | (0.035565) | |
Cluster: | None | Robust | Firm | Year | Firm+Year |
Observations | 5,000 | 5,000 | 5,000 | 5,000 | 5,000 |
Note: | p<0.1; p<0.05; p<0.01 |
Stata results in this section are in Table . R results are in Table .
Probit | Robust | Cluster: firm | Cluster: year | Cluster: both | |
(1) | (2) | (3) | (4) | (5) | |
x | 0.496622*** | 0.496622*** | 0.496622*** | 0.496622*** | 0.496622*** |
(0.020247) | (0.020184) | (0.030723) | (0.015486) | (0.027863) | |
Constant | 0.022424 | 0.022424 | 0.022424 | 0.022424 | 0.022424 |
(0.018472) | (0.018474) | (0.036580) | (0.016361) | (0.035559) | |
Cluster: | None | Robust | Firm | Year | Firm+Year |
Observations | 5,000 | 5,000 | 5,000 | 5,000 | 5,000 |
Note: | p<0.1; p<0.05; p<0.01 |
Tobit | Robust | Cluster: year | Cluster: firm | Cluster: both | |
(1) | (2) | (3) | (4) | (5) | |
x | 1.002670*** | 1.002670*** | 1.002670*** | 1.002670*** | 1.002670*** |
(0.034526) | (0.034107) | (0.055621) | (0.039777) | (0.059267) | |
logSigma | 0.701465*** | 0.701465*** | 0.701465*** | 0.701465*** | 0.701465*** |
(0.015353) | (0.016150) | (0.028716) | (0.017493) | (0.029492) | |
Constant | 0.036291 | 0.036291 | 0.036291 | 0.036291 | 0.036291 |
(0.036011) | (0.035942) | (0.072831) | (0.031408) | (0.070704) | |
Cluster: | None | Robust | Firm | Year | Firm+Year |
Observations | 5,000 | 5,000 | 5,000 | 5,000 | 5,000 |
Note: | p<0.1; p<0.05; p<0.01 |
In order to focus on the econometric commands, the code chunks displayed in the text hid the commands necessary to produce the Stata output tables. In this section we present the complete code, including the outreg2
commands used to compute tables. These listings also show the R print statements that produce terminal output similar to Stata’s default terminal output.
The procedure for producting formatted tables differs with Stata and R. In Stata, table options are specified along with each regression, and the resulting file is then read into the final document. With R, regression results are saved in a variable and table options are specified when the table is created.
## Base OLS and robust
use "data/petersen.dta"
regress y x
outreg2 using tmp, replace ctitle(OLS) auto(5)
regress y x, robust
outreg2 using tmp, ctitle(Robust OLS) auto(5)
## Cluster by year
use "data/petersen.dta"
regress y x, vce(cluster year)
outreg2 using tmp, ctitle(Cluster: year) auto(5)
## cluster by firm
use "data/petersen.dta"
regress y x, vce(cluster firm)
outreg2 using tmp, ctitle(Cluster: firm) auto(5)
## cluster by both
use "data/petersen.dta"
cluster2 y x, fcluster(firm) tcluster(year)
outreg2 using tmp, ctitle(Cluster: both) auto(5)
## fama-macbeth
use "data/test_data.dta"
tsset firm year
fm y x
outreg2 using tmp, ctitle(Fama-Macbeth) auto(5) tex(fragment)
Because the table options have been specified in each regression, with LaTeX it is only necessary to use \input{tmp}
to create the formatted Table .
Each set of commands below produces regression coefficients and standard errors, storing the results in a variable (OLS estimates in reg
, cluster-by-year estimates in reg_year
, etc.) These commands run the various regressions and compute standard errors, storing each result in a variable. The print
statements included below create output that resembles the Stata output for a regression. They are not necessary to create the formatted table.
## Base OLS and robust
library(sandwich)
library(lmtest)
data(PetersenCL)
reg <- lm(y ~ x, data = PetersenCL) ## `reg` is used throughout
regols = coeftest(reg) ## OLS
print(regols, digits=6)
regolsr = coeftest(reg, vcovCL) ## Robust std errors
print(regolsr, digits=6)
## Cluster by year
v_year = vcovCL(reg, type='HC1', cluster = ~year)
reg_year = coeftest(reg, v_year)
print(reg_year, digits=6)
## cluster by firm
v_firm = vcovCL(reg, type="HC1", cluster = ~firm)
reg_firm = coeftest(reg, v_firm)
print(reg_firm, digits=6)
## cluster by both
v_both = vcovCL(reg, type='HC1', cluster = ~firm+year)
reg_both = coeftest(reg, v_both)
print(reg_both, digits=6)
## fama-macbeth
library(plm)
fpmg <- pmg(y~x, data=PetersenCL, index=c("year","firm")) ##Fama-MacBeth
reg_fm <- coeftest(fpmg)
print(coeftest(fpmg), digits=6)
Table is produced by the stargazer
function. There are other functions that can produce tables from regression objects, including texreg
, apsrtable, and pander
.
stargazer(OLS=reg, 'OLS robust'=reg, Year=reg, Firm=reg,
'Firm+Year'=reg, 'Fama-Macbeth'=fpmg
,se=list(regols[,2], regolsr[,2], reg_year[,2], reg_firm[,2],
reg_both[,2], reg_fm[,2])
,type=output_type
,header=FALSE
,column.labels=c('OLS', 'Robust OLS', 'Cluster: year',
'Cluster: firm', 'Cluster: both', 'Fama-Macbeth')
,title='Vanilla and clustered standard errors for OLS, and Fama-Macbeth: R'
,no.space=TRUE ## no blank line between estimates
,digits=6
,column.sep.width='2pt'
,add.lines=list(c('Cluster:', 'None', 'Robust', 'Firm', 'Year', 'Firm+Year'))
,dep.var.caption=''
,dep.var.labels=''
,dep.var.labels.include=FALSE
,model.names=FALSE ## controls "coefficient" string on top
,model.numbers=TRUE
,omit.stat=c('f', 'ser', 'adj.rsq')
,intercept.bottom=TRUE
,label='tbl:ols:r'
,font.size='footnotesize'
)
Berger, Susanne, Nathaniel Graham, and Achim Zeileis. 2017. “Various Versatile Variances: An Object-Oriented Implementation of Clustered Covariances in R.” https://cran.r-project.org/web/packages/sandwich/vignettes/sandwich-CL.pdf.
Hlavac, Marek. 2018. “Stargazer: Well-Formatted Regression and Summary Statistics Tables.” https://CRAN.R-project.org/package=stargazer.
Petersen, Mitchell A. 2009. “Estimating Standard Errors in Finance Panel Data Sets: Comparing Approaches.” Review of Financial Studies 22 (1): 435.
Berger, Graham, and Zeileis (2017), a vignette from the sandwich
package, provides an overview of the topic and has numerous references.↩
There are numerous posts online comparing results in Stata and R, with some matching and some failing to match results. Examples include this Stackoverflow post, and this Princeton tutorial.↩
The vcovCL
function is object-oriented, so it adapts to whatever regression object it receives. One benefit of this approach is that the author of an estimation package can rely on the sandwich
package for standard errors. Similarly, improvements in the sandwich
package automatically accrue to estimation packages.↩
This is discussed in this blog post.↩
As I understand it, the point is that robust standard errors are correct if the model is otherwise specified and estimated correctly. However, with a limited dependent model, the initial coefficient estimates depend on the underlying covariance, so it doesn’t make sense to estimate the model and then go back and correct the covariance. If I misunderstand please let me know!↩