Skip to contents

This function computes the asymptotic inflation and variance of the MLE of a binary regression when both \(n\) and \(p\) are large, assuming that covariates \(X\) are multivariate Gaussian.

Usage

adjust_binary(glm_output, verbose = TRUE, echo = TRUE, ...)

Arguments

glm_output

An object from the class glm, for example the output from the glm function

verbose

If TRUE, print progress at every step.

echo

If TRUE, return the input glm_output.

...

Additional arguments.

Value

A list with elements

glm_output

If echo = TRUE, returns the input glm_output.

gamma_hat

Estimated signal strength \(\gamma_0\).

param

Estimated paramters \((\alpha_\star, \lambda_\star, \sigma_\star, b_\star)\). If the model does not contain an intercept, returns \((\alpha_\star, \lambda_\star, \sigma_\star)\).

intercept

Does the model contain an intercept?

tau_hat

Estimated conditional standard deviation.

coef_adj

Adjusted MLE: this is \(\hat{\beta}^{\mathrm{MLE}} / \alpha_\star\).

std_adj

Estimated standard error of \(\hat{\beta}_j\): \(\sigma_\star / \tau_j\).

coef_unadj

Unadjusted MLE.

std_unadj

Unadjusted standard error. This is output from the glm function.

Details

This function estimates coefficients of a binary regression by adjusting the MLE based on the theory when \(n\) and \(p\) grows to infinity, and the covariates are multivariate Gaussian. In this setting, the MLE of one variable \(\beta_j\) is $$ \hat{\beta}^{\mathrm{MLE}}_j - \alpha_\star \beta_j \approx \mathcal{N}(0, \sigma_\star^2 / \tau_j^2). $$ where \(\tau_j^2\) is the conditional variance of \(x_j\) given all the other variables. This implies \(\hat{\beta}_j^\mathrm{Adj} = \hat{\beta}^{MLE}_j / \alpha_\star\) is unbiased for the coefficient \(\beta_j\). adjust_binary computes \(\hat{\beta}_j^\mathrm{Adj}\) and returns it in the vector coef_adj.

Additional arguments

B

Number of repetitions at each kappa in ProbeFrontier algorithm.

tol

tol parameter when solving for signal strength and intercept. see signal_strength.

Notes

  • The variables should be centered to have zero mean.

  • The input to this function should be an object from the class glm, and it should include family specifying the link function. Currently, the algorithm can handle logit and probit link.

  • if estimated \(\hat{\gamma}<0.0001\), we set \(\gamma=0\).

References

The Asymptotic Distribution of the MLE in High-dimensional Logistic Models: Arbitrary Covariance, Qian Zhao, Pragya Sur and Emmanuel J. Candes, arXiv:2001.09351

Examples

# Problem size
n <- 1000L
p <- 300L
# Generate data matrix
X <- matrix(rnorm(n*p, 0, 1), n, p) / sqrt(p)
# Sample parameters
beta <- rep(c(0, 1), p / 2) * 2
# Generate response
Y <- rbinom(n, 1, 1 / (1 + exp(- X %*% beta)))
fit <- glm(Y ~ X + 0, family = binomial, x = TRUE, y = TRUE)
adjusted_fit <- adjust_binary(fit)
#> Link function:  logit 
#> ------ Begin Probe Frontior ------
#> kappa =  0.405 ; pi_hat =  0.3 
#> kappa =  0.4575 ; pi_hat =  1 
#> kappa =  0.43125 ; pi_hat =  0.7 
#> kappa =  0.418125 ; pi_hat =  0.7 
#> kappa =  0.4115625 ; pi_hat =  0.5 
#> kappa =  0.4148438 ; pi_hat =  0.6 
#> kappa =  0.4132031 ; pi_hat =  0.6 
#> kappa =  0.4123828 ; pi_hat =  0.7 
#> Found kappa_s =  0.4097962 
#> ------- 
#> -- Finding signal strength and the intercept -- 
#>  No intercept, estimated signal strength gamma_hat =  1.267636 
#>  ---- 
#> Found parameters:
#>  alpha_s lambda_s  sigma_s 
#> 1.766048 4.471162 2.795273 
# Adjusted MLE
head(adjusted_fit$coef_unadj)
#>        X1        X2        X3        X4        X5        X6 
#> -2.044095  1.215670  3.915436  6.992635 -5.281657  9.027532