Adjust the MLE in a binary regression — adjust

This function computes the asymptotic inflation and variance of the MLE of a binary regression when both $n$ and $p$ are large, assuming that covariates $X$ are multivariate Gaussian.

Usage

adjust_binary(glm_output, verbose = TRUE, echo = TRUE, ...)

Arguments

glm_output: An object from the class glm, for example the output from the glm function
verbose: If TRUE, print progress at every step.
echo: If TRUE, return the input glm_output.
...: Additional arguments.

Value

A list with elements

glm_output: If echo = TRUE, returns the input glm_output.
gamma_hat: Estimated signal strength $\gamma_0$.
param: Estimated paramters $(\alpha_\star, \lambda_\star, \sigma_\star, b_\star)$. If the model does not contain an intercept, returns $(\alpha_\star, \lambda_\star, \sigma_\star)$.
intercept: Does the model contain an intercept?
tau_hat: Estimated conditional standard deviation.
coef_adj: Adjusted MLE: this is $\hat{\beta}^{\mathrm{MLE}} / \alpha_\star$.
std_adj: Estimated standard error of $\hat{\beta}_j$: $\sigma_\star / \tau_j$.
coef_unadj: Unadjusted MLE.
std_unadj: Unadjusted standard error. This is output from the glm function.

Details

This function estimates coefficients of a binary regression by adjusting the MLE based on the theory when $n$ and $p$ grows to infinity, and the covariates are multivariate Gaussian. In this setting, the MLE of one variable $\beta_j$ is $$ \hat{\beta}^{\mathrm{MLE}}_j - \alpha_\star \beta_j \approx \mathcal{N}(0, \sigma_\star^2 / \tau_j^2). $$ where $\tau_j^2$ is the conditional variance of $x_j$ given all the other variables. This implies $\hat{\beta}_j^\mathrm{Adj} = \hat{\beta}^{MLE}_j / \alpha_\star$ is unbiased for the coefficient $\beta_j$. adjust_binary computes $\hat{\beta}_j^\mathrm{Adj}$ and returns it in the vector coef_adj.

Additional arguments

B: Number of repetitions at each kappa in ProbeFrontier algorithm.
tol: tol parameter when solving for signal strength and intercept. see signal_strength.

Notes

The variables should be centered to have zero mean.
The input to this function should be an object from the class glm, and it should include family specifying the link function. Currently, the algorithm can handle logit and probit link.
if estimated $\hat{\gamma}<0.0001$, we set $\gamma=0$.

References

The Asymptotic Distribution of the MLE in High-dimensional Logistic Models: Arbitrary Covariance, Qian Zhao, Pragya Sur and Emmanuel J. Candes, arXiv:2001.09351

Examples

# Problem size
n <- 1000L
p <- 300L
# Generate data matrix
X <- matrix(rnorm(n*p, 0, 1), n, p) / sqrt(p)
# Sample parameters
beta <- rep(c(0, 1), p / 2) * 2
# Generate response
Y <- rbinom(n, 1, 1 / (1 + exp(- X %*% beta)))
fit <- glm(Y ~ X + 0, family = binomial, x = TRUE, y = TRUE)
adjusted_fit <- adjust_binary(fit)
#> Link function:  logit 
#> ------ Begin Probe Frontior ------
#> kappa =  0.405 ; pi_hat =  0.3 
#> kappa =  0.4575 ; pi_hat =  1 
#> kappa =  0.43125 ; pi_hat =  0.7 
#> kappa =  0.418125 ; pi_hat =  0.7 
#> kappa =  0.4115625 ; pi_hat =  0.5 
#> kappa =  0.4148438 ; pi_hat =  0.6 
#> kappa =  0.4132031 ; pi_hat =  0.6 
#> kappa =  0.4123828 ; pi_hat =  0.7 
#> Found kappa_s =  0.4097962 
#> ------- 
#> -- Finding signal strength and the intercept -- 
#>  No intercept, estimated signal strength gamma_hat =  1.267636 
#>  ---- 
#> Found parameters:
#>  alpha_s lambda_s  sigma_s 
#> 1.766048 4.471162 2.795273 
# Adjusted MLE
head(adjusted_fit$coef_unadj)
#>        X1        X2        X3        X4        X5        X6 
#> -2.044095  1.215670  3.915436  6.992635 -5.281657  9.027532