# glmhd: statistical inference in high-dimensional binary regression

#### Qian Zhao

#### 2023-03-29

Source:`vignettes/my-vignette.Rmd`

`my-vignette.Rmd`

In this vignette we will show you how to use functions in the
`glmhd`

package to estimate the bias and variance of high
dimensional logistic MLE (maximum likelihood estimates). “High
dimension” refers to the setting when the number of observations \(n\) and number of variables \(p\) are both large.

#### Warm-up

As a recurring example, we consider a logistic regression model.

```
# Sample from a logistic model
# n - number of samples
# beta - coefficient vector
# R - cholesky decomposition of the covariance matrix (R^t R) of the variables
# adjust - if TRUE, computes the adjusted p-value for H0: beta1 = 0
sample_logistic <- function(n, beta, R, adjust = FALSE){
p <- length(beta) # number of variables
# Generate data matrix
X <- matrix(rnorm(n * p, 0, 1), n, p) %*% R / sqrt(n)
# Sample response
Y <- rbinom(n, 1, 1 / (1 + exp(- X %*% beta)))
# Logistic regression
fit <- glm(Y ~ X + 0, family = binomial, x = TRUE, y = TRUE)
if(!adjust){
# Returns the p-th fitted coefficient and its standard error estimate
c(fit$coef[p], summary(fit)$coef[p, 2])
}else{
# Returns the adjusted p-value for the first coordinate
adjusted_fit <- adjust_glm(fit, verbose = FALSE, echo = TRUE)
summary(adjusted_fit)$coef[1,4]
}
}
```

We pick \(n = 1000\) and \(p = 200\), and set the variables to be from multivariate Gaussian from a AR(1) model with \(\rho = 0.5\). For simplicity, we fix the the first half variables to be null and the second half have fixed coefficients \(\beta = 4\).

We repeat this function \(B = 200\) times, and draw a histogram of the MLE of the \(p\)th coefficient.

`beta_hat <- replicate(B <- 200, sample_logistic(n, beta, R))`

As we can see, the estimated \(\hat{\beta}\) is not centered at the true value \(\beta = 4\) (black line). The average \(\hat{\beta}\) is at 5.247, which is larger. If we look at a single simulation below, the estimated standard error of \(\hat{\beta}_p\) is 4.175, which is smaller than the observed standard deviation 5.434 from the independent samples.

```
X <- matrix(rnorm(n * p, 0, 1), n, p) %*% R / sqrt(n)
Y <- rbinom(n, 1, 1 / (1 + exp(- X %*% beta)))
fit <- glm(Y ~ X + 0, family = binomial)
```

```
summary(fit)$coef[p,2] # estimated std. from inverse Fisher information
#> [1] 4.17
```

In general, as \(n, p\to\infty\) at
a fixed ratio \(p / n\to \kappa \in (0,
1)\), and if the variables \(X\)
to be multivariate Gaussian, the MLE of any coordinate \(\beta_j\) satisfies \[
\frac{\hat{\beta}_j^\mathrm{MLE} - \alpha_\star \beta_j}{\sigma_\star /
\tau_j} \stackrel{d}{\longrightarrow} \mathcal{N}(0, 1),
\] where \(\tau^2_j =
\mathrm{Var}(x_j\,|\,x_{-j})\) is the conditional variance of
\(j\)th variable given all the others
[1]. These parameters \((\alpha_\star,
\sigma_\star)\) can be computed given \(\kappa = p/n\) and \(\gamma = \sqrt{\mathrm{Var}(X^\top
\beta)}\). In this example, \[
\kappa = 0.2, \quad \gamma = 2.18,
\] and the corresponding \((\alpha_\star, \sigma_\star / \tau_p) =
(1.49,5.37)\). We overlay the theoretical density curve of \(\hat{\beta}_p\) in the above histogram
(blue curve), which agree well with the obeserved histogram overall,
except for a few large \(\hat{\beta}_p\) in simulations. In reality,
the parameter \(\gamma\) is typically
unknown, so in the next section we will use `glmhd`

package
to estimate it.

#### Adjusting the MLE with glmhd

The goal of `glmhd`

package is to estimate the inflation
and standard deviation of the MLE more accurately than the classical
theory. We use the *ProbeFrontier* algorithm [2] to estimate the
signal strength parameter \(\gamma\),
and then we estimate the inflation and standard deviation of the MLE.
Now let’s load the library.

We repeat the previous example and compute the MLE. Note that we
specify `x = TRUE`

and `y = TRUE`

here so the
original data matrix is returned in `fit`

.

```
# fit the MLE again
fit <- glm(Y ~ X + 0, family = binomial, x = TRUE, y = TRUE)
```

Now we call the function `adjust_glm`

to compute the
adjusted MLE and standard error. There are three arguments for this
function, but you only need to input the first one, the output
`glm`

object. Below, the second argument and third arguments
are set at their default values, `verbose`

specifies if
progress should be printed, and `echo = TRUE`

means that the
input `glm`

object is also returned in the output.
Setting`verbose = TRUE`

helps to track progress of the code,
and identify issues if the algorithm fails.

`adjusted_fit <- adjust_glm(fit, verbose = FALSE, echo = TRUE)`

You can access the estimated parameters through
`adjusted_fit$param`

. In this vector,
`adjusted_fit$param["alpha_s"]`

is estimated \(\alpha_\star\) and
`adjusted_fit$param["sigma_s"]`

is estimated \(\sqrt{\kappa}\sigma_\star\). For example,
you can compare the estimated inflation 1.416 with the observed value
1.312.

```
adjusted_fit$param["alpha_s"] # Estimated inflation
#> alpha_s
#> 1.42
adjusted_fit$param["sigma_s"] / sqrt(p/n) # estimated sigma_star
#> sigma_s
#> 4.13
```

You can obtain the adjusted MLE \(\hat{\beta}^\mathrm{Adj}_p = \hat{\beta}_p^\mathrm{MLE} / \alpha_\star\) and the adjusted standard error estimate \(\sigma_\star / \tau_j\) as follows.

```
adjusted_fit$coef_adj[p] # Adjusted MLE
#> X200
#> 4.17
adjusted_fit$std_adj[p] # Adjusted std.
#> X200
#> 4.77
```

Next, `summary(adjusted_fit)`

prints the coefficient
table. The first column is the adjusted MLE, the second column is \(\hat{\sigma}_j / \alpha_\star\), because
the adjusted MLE satisfies \[
\frac{\hat{\beta}_j^\mathrm{Adj} - \beta_j}{\hat{\sigma}_j /
\alpha_\star} \approx \mathcal{N}(0, 1),
\] for \(\hat{\sigma}_j = \sigma_\star
/ \tau_j\) the adjusted std. estimate.

```
print(summary(adjusted_fit), max = 40)
#> Call:
#> adjust_glm(glm_output = fit, verbose = FALSE, echo = TRUE)
#>
#> adjusted_mle std z.value p.value
#> X1 3.6837 3.3982 1.08 0.2784
#> X2 -5.0466 3.8409 -1.31 0.1889
#> X3 2.2015 3.7254 0.59 0.5546
#> X4 1.7436 3.8513 0.45 0.6508
#> X5 -3.0104 3.6758 -0.82 0.4128
#> X6 -3.5051 3.8287 -0.92 0.3599
#> X7 6.1912 3.5640 1.74 0.0824 .
#> X8 -6.4551 3.6305 -1.78 0.0754 .
#> [ reached getOption("max.print") -- omitted 192 rows ]
#> ---
#> Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
```

#### Hypothesis testing

The `summary`

function prints the adjusted MLE \(\hat{\beta}^\mathrm{Adj}\) and its std.
\(\hat{\sigma} / \alpha_\star\), as
well as a z-statistics \(z =
\hat{\beta}/\hat{\sigma}^\mathrm{MLE}\) and a two-sided p-value
\(\mathrm{P}(|\mathcal{N}(0,1)|\geq
|z|)\). If the variables are multivariate Gaussian and the true
model is logistic, then this p-value is from a Uniform[0,1] distribution
under the null hypothesis that \(\beta_j =
0\).

The following is a simulation to examine whether the calculated p-value is approximately uniform. For simplicity, we consider the first variable, which is a null, and the function is also included at the beginning.

`# pval_adj <- replicate(B <- 100, sample_logistic(n, p, R, adjust = TRUE))`

You can use the likelihood ratio test (LRT) by calling the function
`lrt_glm`

. It requires two arguments, the first is a list of
`glm`

fits and the second is the `param`

output
from `adjust_glm`

. It prints the likelihood ratio statistics
and p-values computed according to the rescaled chi-squared
distribution. Under the null hypothesis \[
H_0:\quad \beta_1 = \ldots = \beta_k = 0,
\] two times the likelihood ratio statistics \(\Lambda\) is asymptotically a re-scaled
chi-squared variable with \(k\) degrees
of freedom \[
2 \Lambda \stackrel{d}{\longrightarrow} \frac{\kappa_\star
\sigma_\star^2}{\lambda_\star}\chi^2_k.
\]

```
f1 <- fit
f2 <- glm(Y ~ X[, -1] + 0, family = binomial, x = TRUE, y = TRUE)
# P-values for the LRT
lrt <- lrt_glm(list(f1, f2), param = adjusted_fit$param)
print(lrt)
#> $models
#> $models[[1]]
#> Y ~ X + 0
#>
#> $models[[2]]
#> Y ~ X[, -1] + 0
#>
#>
#> $anova.tab
#> Resid.Df Resid.Dev Df Deviance p.value
#> 1 801 698 NA NA NA
#> 2 800 697 1 1.43 0.314
#>
#> attr(,"class")
#> [1] "lrt_adj"
```

#### Model with an intercept

The previous example concerns a logistic model without an intercept, you can use the same function call for a model with an intercept. This is a separate section because the theory justification for this algorithm is still under study. The algorithm implemented in this package is based on the following conjecture: \[ \frac{\hat{\beta}_j^\mathrm{MLE} - \alpha_\star \beta_j}{\sigma_\star / \tau_j} \stackrel{d}{\longrightarrow} \mathcal{N}(0, 1), \] and the MLE of the intercept converges to some fixed quantity \[ \hat{\beta}_0 \stackrel{p}{\longrightarrow} b_\star. \] The adjusted MLE is \(\hat{\beta}^\mathrm{Adj} = \hat{\beta} / \alpha_\star\) and adjusted std. is \(\sigma_\star / \tau_j\).

Consider an example where the intercept term \(\beta_0 = - 0.5\). In this example, we can
compute the \((\alpha_\star,
\sigma_\star)\) (they depend on \(\kappa = p/n\), \(\gamma\) and true intercept \(\beta_0\)) and compare them with the
estimates from independent samples. Notice that the input argument
`beta_0`

for `find_param`

function is the absolute
value of \(\beta_0\).

```
gamma <- sqrt(sum((R %*% beta)^2/n)) # signal strength
params <- find_param(kappa = p/n, gamma = gamma, beta0 = 0.5, verbose = FALSE)
params[1] # true alpha_star
#> [1] 1.5
params[3] / sqrt(p/n) # true sigma_star
#> [1] 4.74
```

```
sample_logistic_intercept <- function(n, beta, R){
p <- length(beta)
X <- matrix(rnorm(n * p, 0, 1), n, p) %*% R / sqrt(n)
Y <- rbinom(n, 1, 1 / (1 + exp(- X %*% beta + 0.5)))
fit <- glm(Y ~ X, family = binomial)
# Returns the p-th fitted coefficient and its standard error
c(fit$coef[p+1], summary(fit)$coef[p+1, 2])
}
```

`beta_hat <- replicate(B, sample_logistic_intercept(n, beta, R))`

The observed inflation in 200 samples is 1.483 and observed standard deviation is 5.395. The theoretical density curve of \(\hat{\beta}_p\) is shown in blue, and it agrees well with the histogram of \(\hat{\beta}_p\).

You can use the same function `adjust_glm`

to compute the
adjusted MLE \(\hat{\beta}_j^\mathrm{Adj} =
\hat{\beta}_j^\mathrm{MLE} / \alpha_\star\) and adjusted standard
error \(\sigma_\star / \tau_j\) without
knowing the true parameters.

```
X <- matrix(rnorm(n * p, 0, 1), n, p) %*% R / sqrt(n)
Y <- rbinom(n, 1, 1 / (1 + exp(- X %*% beta + 0.5)))
fit <- glm(Y ~ X, family = binomial, x = TRUE, y = TRUE)
```

```
# Adjust MLE coefficients
adjusted_fit <- adjust_glm(fit, verbose = FALSE)
```

`adjusted_fit$intercept`

returns the estimated intercept,
which you can compare with the MLE -0.959.

```
adjusted_fit$intercept
#> [1] -0.852
```

You can also find the estimated inflation and standard deviation like before.

```
# Estimated inflation and std.
adjusted_fit$param["alpha_s"]
#> alpha_s
#> 1.5
adjusted_fit$param["sigma_s"] / sqrt(p/n)
#> sigma_s
#> 4.76
# Adjusted MLE
adjusted_fit$coef_adj[p]
#> X200
#> 4.02
# Adjusted std.
adjusted_fit$std_adj[p]
#> X200
#> 5.55
```

#### Other link functions

If you are working with a probit model, you can use the
`glm`

function with
`family = binomial(link = "probit")`

, and use the
`glm`

object as input to `adjust_glm`

It
automatically identifies the link function from the `family`

object.

#### References

[1] *A modern maximum-likelihood theory for high-dimensional
logistic regression*, Pragya Sur and Emmanuel J. Candes, Proceedings
of the National Academy of Sciences Jul 2019, 116 (29) 14516-14525

[2] *The Asymptotic Distribution of the MLE in High-dimensional
Logistic Models: Arbitrary Covariance*, Qian Zhao, Pragya Sur and
Emmanuel J. Candes, arXiv:2001.09351