Panel Data Regression

Issue with Pooled OLS

assuming \(i\) is the index for different city and \(t\) is the index for time. We are trying to see the relationship between house price \(HP\) and crime rate. \[HP_{i,t} = \beta_0+\beta_1 Crime_{i,t} + V_t + \alpha_i + u_{i,t}\] where \(V_t\) is time dependent variable which represent the time trend across all cities, and \(\alpha_i\) is the city specific variable that is fixed across time (e.g. geographic, demographic, race, education). if \(\alpha_i\) is not in the OLS, then it would end up in the error term, and \(cov(Crime_{i,t}, \alpha) \neq0\) (a city’s crime rate can be related to its city specific variables). In this case the \(\beta\) is biased and inconsistent.

First Difference Approach (FD)

\[ \begin{aligned} \Delta HP_{i,t} &= \beta_0-\beta_0 + \beta_1 \Delta Crime_{i,t} + V_{t} - V_{t-1} + \alpha_i-\alpha_i + \Delta u_{i,t}\\ &= \beta_1 \Delta Crime_{i,t} + V_{t} - V_{t-1} + \Delta u_{i,t} \end{aligned} \]

Now the city’s specific effect has been removed, and \(\beta_1\) is consistent estimated this way.

Fixed Effect Approach (FE)

\[HP_{i,t} = \beta_0+\beta_1 Crime_{i,t} + \beta_2 Unemployment_{i,t} + \alpha_i + u_{i,t}\] The average of the house price for each city accross time is: \[\bar{HP_i} = \frac{1}{T} \sum_{t=1}^{T}HP_{i,t}\] \[\bar{\alpha_{i}} = \frac{1}{T}T\alpha_{i} = \alpha_i\]

Thus,

\[\bar{HP_i} = \beta_0 + \beta_1 \bar{Crime_i}+\beta_2 \bar{Unemployment_i}+\alpha_i+\bar{u_i}\]

Then

\[HP_{i,t}-\bar{HP_i} = \beta_1 (Crime_{i,t}-\bar{Crime_i}) + \beta_2 (Unemployment_{i,t}-\bar{Unemployment_i}) + (u_{i,t}-\bar{u_i})\]

Note that both First Difference and Fixed Effect can successfully estimate the coefficients for crime rate \(\beta_1\) & employment rate \(\beta_2\), it tells us nothing about the city specific characteristics variables \(\alpha_i\)

Dummy Variables Estimator

\[HP_{i,t} = \beta_0+\beta_1 Crime_{i,t} + \beta_2 Unemployment_{i,t} + \alpha_i + u_{i,t}\]

\[HP_{i,t} = \beta_0+\beta_1 Crime_{i,t} + \beta_2 Unemployment_{i,t} + \mu_1d_2 + \mu_2d_3 + u_{i,t}\] where \(d_2 = 1\) , when \(i=2\), otherwise \(d_2=0\). This way, we can actually estimate \(\alpha_i\), however, if we have larger number of different cities, then this approach becomes unrealistic.

R Squared in FE/LSDV

\[\tilde{HP_{i,t}} = \beta_1\tilde{Crime_{i,t}}+\beta_2\tilde{Unemployment_{i,t}}+\tilde{u_{i,t}}\]

\(R^2\) here for fixed effect model means the variation of \(HP_{i,t}\) explained by the model relative to \(HP_i\).

For LSDV, the high \(R^2\) is not surprised and not very indicative. Even without the crime and unemployment rate, we still have dummy variables (city specific) and year in the model.

First Difference v.s. Fixed Effect

\[T=2: FD = FE\]

\[T \geq3 : FD \neq FE\]

Both \(\hat\beta_{FD}\) and \(\hat\beta_{FE}\) are unbiased, so we compare the efficiency.

\[HP_{i,t} = \beta_0+\beta_1 Crime_{i,t} + \beta_2 Unemployment_{i,t} + \alpha_i + u_{i,t}\]

\(cov(u_{i,t}, u_{i,t-1}) = 0\) then \[ \begin{aligned} Cov(\Delta u_{i,t}, \Delta u_{i,t-1}) &= Cov(u_{i,t}-u_{i,t-1},u_{i,t-1}-u_{i,t-2}) \\ &= Cov(u_{i,t-1},u_{i,t-1}) \\ &= -Var(u_{i,t-1}) \end{aligned} \] Therefore, if we have serial uncorrelated idiosyncratic errors \(u_{i,t}\) then \(\Delta u_{i,t}\) is serially correlated, and therefore fixed effect is preferred. However, if we have serial correlated \(u_{i,t}\), then it depends on the \(\rho\) of \(u_{i,t} = \rho u_{i,t-1} + \epsilon_{i,t}\).

However, in both method, we don’t actually estimate the \(u_{i,t}\), so we don’t know if they are serially correlated. We can do use the estimation from first difference:

if \(Cov(\Delta \hat u_{i,t})\) is significant negative, then FE is favorable if \(Cov(\Delta \hat u_{i,t}) = 0\) then FD is favorable.

In the end, it is better to do both, and examine the difference of these two estimators.

If \(T>N\),where \(N\) is small, then FE is quite sensitive. If \(X_{i,t}\) is unit root, then FE is subject to spurious regression.

If \(T\) is very large, then FE is less sensitive than FD with respect to violation of strict exogeneity (\(Cov(u_{i,t}, X_{i,s})\))

Random Effects

\[HP_{i,t} = \beta_0+\beta_1 Crime_{i,t} + \beta_2 Unemployment_{i,t} + \alpha_i + u_{i,t}\]

If \(Cov(\alpha_i, X_{i,t}) = 0\) then \(\beta_{OLS}\) is consistent, but we have serial correlation in the error terms. To address this, we do FGLS which is called the random effect for Panel Data.

\[Cov(\alpha_i+u_{i,t}, \alpha_i+u_{i,s}) = Var(\alpha_i)\]

\[HP_{i,t} - \lambda \bar{HP_{i,t}} = \beta_0(1-\lambda)+\beta_1 (Crime_{i,t}-\lambda Crime_{i,t}) + \beta_2 (Unemployment_{i,t} \lambda Unemployment_{i,t}) + \alpha_i (1-\lambda) + u_{i,t} (1-\lambda)\] \[\lambda = 1-\frac{\sigma_{u}^2}{\sigma_{u}^2+T \sigma_{\alpha}^2}^{1/2}\]

note that when \(\lambda = 1\), then it’s fixed effect.

Steps of feasible GLS (Random Effect):

estimate \(\hat{\lambda}\) by FE or Pooled OLS
Use Pooled OLS on the transformed system

We can only use random effect if \(Cov(\alpha_i, X_{i,t}) = 0\)

Benefits of Random Effects

Benefits of random effects \(-\) time constant variables

\[HP_{i,t} = \beta_0+\beta_1 Crime_{i,t} + \beta_2 Unemployment_{i,t} + \beta_3 Geography_{i} + \beta_4 Race_{i}+ \alpha_i + u_{i,t}\]

Time constant variables are not possible to be estimated using FE or FD. Since \(\lambda\) lays between 0 and 1, and therefore the Time constant variables are not going to disappear in the transformed equation.

Hausman Test

Hausman Test for Random Effects vs Fixed Effects.

Null Hypothesis \(H_0 = Cov(\alpha_i, X_{i,t}) = 0\) \[w = \frac{(\hat{\beta_{FE}^{*}}-\hat{\beta_{RE}^{*}})^2} {Var(\hat{\beta_{FE})}-Var(\hat{\beta_{RE})}} \sim \chi_1^2\] under \(H_0\)

Intuition, if Null Hypothesis is true, then

\(\hat{\beta_{RE}}\), \(\hat{\beta_{FE}}\) are consistent, and therefore the numarator of \(w\) statistics should be close to 0.
\(SE(\hat{\beta_{RE}}) < SE(\hat{\beta_{FE}})\) and therefore the denominator of the \(w\) statistics should be large.