glmnet
Glmnet is a package that fits generalized linear and similar models
via penalized maximum likelihood. The regularization path is computed
for the lasso or elastic net penalty at a grid of values (on the log
scale) for the regularization parameter lambda. The algorithm is
extremely fast, and can exploit sparsity in the input matrix
x
. It fits linear, logistic and multinomial, poisson, and
Cox regression models. It can also fit multi-response linear regression,
generalized linear models for custom families, and relaxed lasso
regression models. The package includes methods for prediction and
plotting, and functions for cross-validation.
The authors of glmnet are Jerome Friedman, Trevor Hastie, Rob Tibshirani, Balasubramanian Narasimhan, Kenneth Tay and Noah Simon, with contribution from Junyang Qian, and the R package is maintained by Trevor Hastie. A MATLAB version of glmnet is maintained by Junyang Qian, and a Python version by B. Balakumar (although both are a few versions behind).
This vignette describes basic usage of glmnet in R. There are additional vignettes that should be useful:
glmnet
.family
functions in glmnet
” describes how
to fit custom generalized linear models (GLMs) with the elastic net
penalty via the family
argument.relax
argument.glmnet
solves the problem
$$ \min_{\beta_0,\beta} \frac{1}{N} \sum_{i=1}^{N} w_i l(y_i,\beta_0+\beta^T x_i) + \lambda\left[(1-\alpha)\|\beta\|_2^2/2 + \alpha \|\beta\|_1\right], $$
over a grid of values of λ covering the entire range of possible solutions. Here l(yi, ηi) is the negative log-likelihood contribution for observation i; e.g. for the Gaussian case it is $\frac{1}{2}(y_i-\eta_i)^2$. The elastic net penalty is controlled by α, and bridges the gap between lasso regression (α = 1, the default) and ridge regression (α = 0). The tuning parameter λ controls the overall strength of the penalty.
It is known that the ridge penalty shrinks the coefficients of correlated predictors towards each other while the lasso tends to pick one of them and discard the others. The elastic net penalty mixes these two: if predictors are correlated in groups, an α = 0.5 tends to either select or leave out the entire group of features. This is a higher level parameter, and users might pick a value upfront or experiment with a few different values. One use of α is for numerical stability; for example, the elastic net with α = 1 − ϵ for some small ϵ > 0 performs much like the lasso, but removes any degeneracies and wild behavior caused by extreme correlations.
The glmnet
algorithms use cyclical coordinate descent,
which successively optimizes the objective function over each parameter
with others fixed, and cycles repeatedly until convergence. The package
also makes use of the strong rules for efficient restriction of the
active set. Due to highly efficient updates and techniques such as warm
starts and active-set convergence, our algorithms can compute the
solution path very quickly.
The code can handle sparse input-matrix formats, as well as range
constraints on coefficients. The core of glmnet
is a set of
Fortran subroutines, which make for very fast execution.
The theory and algorithms in this implementation are described in Friedman, Hastie, and Tibshirani (2010), Simon et al. (2011), Tibshirani et al. (2012) and Simon, Friedman, and Hastie (2013).
Like many other R packages, the simplest way to obtain
glmnet
is to install it directly from CRAN. Type the
following command in R console:
Users may change the repos
argument depending on their
locations and preferences. Other arguments such as the directories to
install the packages at can be altered in the command. For more details,
see help(install.packages)
. Alternatively, users can
download the package source from CRAN and type Unix
commands to install it to the desired location.
The purpose of this section is to give users a general sense of the package. We will briefly go over the main functions, basic operations and outputs. After this section, users may have a better idea of what functions are available, which ones to use, or at least where to seek help.
First, we load the glmnet
package:
The default model used in the package is the Guassian linear model or “least squares”, which we will demonstrate in this section. We load a set of data created beforehand for illustration:
The command loads an input matrix x
and a response
vector y
from this saved R data archive.
We fit the model using the most basic call to
glmnet
.
fit
is an object of class glmnet
that
contains all the relevant information of the fitted model for further
use. We do not encourage users to extract the components directly.
Instead, various methods are provided for the object such as
plot
, print
, coef
and
predict
that enable us to execute those tasks more
elegantly.
We can visualize the coefficients by executing the plot
method:
Each curve corresponds to a variable. It shows the path of its
coefficient against the ℓ1-norm of the whole coefficient
vector as λ varies. The axis
above indicates the number of nonzero coefficients at the current λ, which is the effective degrees of
freedom (df) for the lasso. Users may also wish to annotate the
curves: this can be done by setting label = TRUE
in the
plot command.
A summary of the glmnet
path at each step is displayed
if we just enter the object name or use the print
function:
##
## Call: glmnet(x = x, y = y)
##
## Df %Dev Lambda
## 1 0 0.00 1.63100
## 2 2 5.53 1.48600
## 3 2 14.59 1.35400
## 4 2 22.11 1.23400
## 5 2 28.36 1.12400
## 6 2 33.54 1.02400
....
It shows from left to right the number of nonzero coefficients
(Df
), the percent (of null) deviance explained
(%dev
) and the value of λ (Lambda
). Although
glmnet
fits the model for 100 values of lambda
by default, it stops early if %dev
does not change
sufficently from one lambda to the next (typically near the end of the
path.) Here we have truncated the prinout for brevity.
We can obtain the model coefficients at one or more λ’s within the range of the sequence:
## 21 x 1 sparse Matrix of class "dgCMatrix"
## s1
## (Intercept) 0.150928072
## V1 1.320597195
## V2 .
## V3 0.675110234
## V4 .
## V5 -0.817411518
## V6 0.521436671
## V7 0.004829335
....
(Why s
and not lambda
? In case we want to
allow one to specify the model size in other ways in the future.) Users
can also make predictions at specific λ’s with new input data:
## s1 s2
## [1,] -4.3067990 -4.5979456
## [2,] -4.1244091 -4.3447727
## [3,] -0.1133939 -0.1859237
## [4,] 3.3458748 3.5270269
## [5,] -1.2366422 -1.2772955
The function glmnet
returns a sequence of models for the
users to choose from. In many cases, users may prefer the software to
select one of them. Cross-validation is perhaps the simplest and most
widely used method for that task. cv.glmnet
is the main
function to do cross-validation here, along with various supporting
methods such as plotting and prediction.
cv.glmnet
returns a cv.glmnet
object, a
list with all the ingredients of the cross-validated fit. As with
glmnet
, we do not encourage users to extract the components
directly except for viewing the selected values of λ. The package provides
well-designed functions for potential tasks. For example, we can plot
the object:
This plots the cross-validation curve (red dotted line) along with
upper and lower standard deviation curves along the λ sequence (error bars). Two special
values along the λ sequence
are indicated by the vertical dotted lines. lambda.min
is
the value of λ that gives
minimum mean cross-validated error, while lambda.1se
is the
value of λ that gives the most
regularized model such that the cross-validated error is within one
standard error of the minimum.
We can use the following code to get the value of
lambda.min
and the model coefficients at that value of
λ:
## [1] 0.06284188
## 21 x 1 sparse Matrix of class "dgCMatrix"
## s1
## (Intercept) 0.145832036
## V1 1.340981414
## V2 .
## V3 0.708347140
## V4 .
## V5 -0.848087765
## V6 0.554823782
## V7 0.038519738
....
To get the corresponding values at lambda.1se
, simply
replace lambda.min
with lambda.1se
above, or
omit the s
argument, since lambda.1se
is the
default.
Note that the coefficients are represented in the sparse matrix
format. This is because the solutions along the regularization path are
often sparse, and hence it is more efficient in time and space to use a
sparse format. If you prefer non-sparse format, pipe the output through
as.matrix()
.
Predictions can be made based on the fitted cv.glmnet
object as well. The code below gives predictions for the new input
matrix newx
at lambda.min
:
## lambda.min
## [1,] -1.3574653
## [2,] 2.5776672
## [3,] 0.5846421
## [4,] 2.0280562
## [5,] 1.5780633
This concludes glmnet
101. With the tools introduced so
far, users are able to fit the entire elastic net family, including
ridge regression, using squared-error loss. There are many more
arguments in the package that give users a great deal of flexibility. To
learn more, move on to later sections.
family = "gaussian"
(default)"gaussian"
is the default family
argument
for the function glmnet
. Suppose we have observations xi ∈ ℝp
and the responses yi ∈ ℝ, i = 1, …, N.
The objective function for the Gaussian family is $$
\min_{(\beta_0, \beta) \in \mathbb{R}^{p+1}}\frac{1}{2N} \sum_{i=1}^N
(y_i -\beta_0-x_i^T \beta)^2+\lambda \left[ (1-\alpha)\|\beta\|_2^2/2 +
\alpha\|\beta\|_1\right],
$$ where λ ≥ 0 is a
complexity parameter and 0 ≤ α ≤ 1 is a compromise between
ridge regression (α = 0) and
lasso regression (α = 1).
glmnet
applies coordinate descent to solve the problem.
Specifically, suppose we have current estimates $\tilde{\beta_0}$ and β̃ℓ ∀ℓ ∈ 1, …, p. By computing the
gradient at βj = β̃j
and simple calculus, the update is $$
\tilde{\beta}_j \leftarrow \frac{S(\frac{1}{N}\sum_{i=1}^N
x_{ij}(y_i-\tilde{y}_i^{(j)}),\lambda \alpha)}{1+\lambda(1-\alpha)},
$$ where ỹi(j) = β̃0 + ∑ℓ ≠ jxiℓβ̃ℓ,
and S(z, γ)
is the soft-thresholding operator with value sign(z)(|z| − γ)+.
This formula above applies when the x
variables are
standardized to have unit variance (the default); it is slightly more
complicated when they are not. Note that for
family = "gaussian"
, glmnet
standardizes y to have unit variance before
computing its lambda
sequence (and then unstandardizes the
resulting coefficients). If you wish to reproduce or compare results
with other software, it is best to supply a standardized y first (Using the “1/N” variance
formula).
glmnet
provides various arguments for users to customize
the fit: we introduce some commonly used arguments here. (For more
information, type ?glmnet
.)
alpha
is for the elastic net mixing parameter α, with range α ∈ [0, 1]. α = 1 is lasso regression (default)
and α = 0 is ridge
regression.
weights
is for the observation weights, default is 1
for each observation. (Note: glmnet
rescales the weights
internally to sum to N, the sample size.)
nlambda
is the number of λ values in the sequence (default is
100).
lambda
can be provided if the user wants to specify
the lambda sequence, but typical usage is for the program to construct
the lambda sequence on its own. When automatically generated, the λ sequence is determined by
lambda.max
and lambda.min.ratio
. The latter is
the ratio of smallest value of the generated λ sequence (say
lambda.min
) to lambda.max
. The program
generates nlambda
values linear on the log scale from
lambda.max
down to lambda.min
.
lambda.max
is not user-specified but is computed from the
input x and y: it is the smallest value for
lambda
such that all the coefficients are zero. For
alpha = 0
(ridge) lambda.max
would be ∞: in this case we pick a value corresponding
to a small value for alpha
close to zero.)
standardize
is a logical flag for x
variable standardization prior to fitting the model sequence. The
coefficients are always returned on the original scale. Default is
standardize = TRUE
.
As an example, we set α = 0.2 (more like a ridge
regression), and give double weight to the latter half of the
observations. We set nlambda
to 20 so that the model fit is
only compute for 20 values of λ. In practice, we recommend
nlambda
to be 100 (default) or more. In most cases, it does
not come with extra cost because of the warm-starts used in the
algorithm, and for nonlinear models leads to better convergence
properties.
We can then print the glmnet
object:
##
## Call: glmnet(x = x, y = y, weights = wts, alpha = 0.2, nlambda = 20)
##
## Df %Dev Lambda
## 1 0 0.00 7.9390
## 2 4 17.89 4.8890
## 3 7 44.45 3.0110
## 4 7 65.67 1.8540
## 5 8 78.50 1.1420
## 6 9 85.39 0.7033
## 7 10 88.67 0.4331
## 8 11 90.25 0.2667
## 9 14 91.01 0.1643
## 10 17 91.38 0.1012
## 11 17 91.54 0.0623
## 12 17 91.60 0.0384
## 13 19 91.63 0.0236
## 14 20 91.64 0.0146
## 15 20 91.64 0.0090
## 16 20 91.65 0.0055
## 17 20 91.65 0.0034
This displays the call that produced the object fit
and
a three-column matrix with columns Df
(the number of
nonzero coefficients), %dev
(the percent deviance
explained) and Lambda
(the corresponding value of λ). (The digits
argument can used to specify significant digits in the printout.)
Here the actual number of λ’s is less than that specified in
the call. This is because of the algorithm’s stopping criteria.
According to the default internal settings, the computations stop if
either the fractional change in deviance down the path is less than
10−5 or the fraction of
explained deviance reaches 0.999. From
the last few lines of the output, we see the fraction of deviance does
not change much and therefore the computation ends before the all 20
models are fit. The internal parameters governing the stopping criteria
can be changed. For details, see the Appendix section or type
help(glmnet.control)
.
glmnet
objectsWe can extract the coefficients and make predictions for a
glmnet
object at certain values of λ. Two commonly used arguments
are:
s
for specifiying the value(s) of λ at which to extract
coefficients/predictions.
exact
for indicating whether the exact values of
coefficients are desired or not. If exact = TRUE
and
predictions are to be made at values of s
not included in
the original fit, these values of s
are merged with
object$lambda
and the model is refit before predictions are
made. If exact = FALSE
(default), then the
predict
function uses linear interpolation to make
predictions for values of s
that do not coincide with
lambdas used in the fitting algorithm.
Here is a simple example illustrating the use of both these function arguments:
## [1] FALSE
coef.apprx <- coef(fit, s = 0.5, exact = FALSE)
coef.exact <- coef(fit, s = 0.5, exact = TRUE, x=x, y=y)
cbind2(coef.exact[which(coef.exact != 0)],
coef.apprx[which(coef.apprx != 0)])
## [,1] [,2]
## [1,] 0.2613110 0.2613110
## [2,] 1.0055470 1.0055473
## [3,] 0.2677140 0.2677134
## [4,] -0.4476485 -0.4476475
## [5,] 0.2379287 0.2379283
## [6,] -0.8230862 -0.8230865
## [7,] -0.5553678 -0.5553675
The left and right columns show the coefficients for
exact = TRUE
and exact = FALSE
respectively.
(For brevity we only show the non-zero coefficients.) We see from the
above that 0.5 is not in the sequence and that hence there are some
small differences in coefficient values. Linear interpolation is usually
accurate enough if there are no special requirements. Notice that with
exact = TRUE
we have to supply by named argument any data
that was used in creating the original fit, in this case x
and y
.
Users can make predictions from the fitted glmnet
object. In addition to the arguments in coef
, the primary
argument is newx
, a matrix of new values for x
at which predictions are desired. The type
argument allows
users to choose the type of prediction returned:
“link” returns the fitted values (i.e. β̂0 + xiTβ̂)
“response” gives the same output as “link” for “gaussian” family.
“coefficients” returns the model codfficients.
“nonzero” retuns a list of the indices of the nonzero
coefficients for each value of s
.
For example, the following code gives the fitted values for the first 5 observations at λ = 0.05:
## s1
## [1,] -1.3362652
## [2,] 2.5894245
## [3,] 0.5872868
## [4,] 2.0977222
## [5,] 1.6436280
If multiple values of s
are supplied, a matrix of
predictions is produced. If no value of s
is supplied, a
matrix of predictions is supplied, with columns corresponding to all the
lambdas used in the fit.
We can plot the fitted object as in the Quick Start section. Here we
walk through more arguments for the plot
function. The
xvar
argument allows users to decide what is plotted on the
x
-axis. xvar
allows three measures: “norm” for
the ℓ1-norm of the
coefficients (default), “lambda” for the log-lambda value and “dev” for
%deviance explained. Users can also label the curves with the variable
index numbers simply by setting label = TRUE
.
For example, let’s plot fit
against the log-lambda value
and with each curve labeled:
Now when we plot against %deviance we get a very different picture. This is percent deviance explained on the training data, and is a measure of complexity of the model. We see that toward the end of the path, %deviance is not changing much but the coefficients are “blowing up” a bit. This enables us focus attention on the parts of the fit that matter. This will especially be true for other models, such as logistic regression.
K-fold cross-validation can be performed using the
cv.glmnet
function. In addition to all the
glmnet
parameters, cv.glmnet
has its special
parameters including nfolds
(the number of folds),
foldid
(user-supplied folds), and
type.measure
(the loss used for cross-validation):
“deviance” or “mse” for squared loss, and
“mae” uses mean absolute error.
As an example,
does 20-fold cross-validation based on mean squared error criterion (the default for “gaussian” family). Printing the resulting object gives some basic information on the cross-validation performed:
##
## Call: cv.glmnet(x = x, y = y, type.measure = "mse", nfolds = 20)
##
## Measure: Mean-Squared Error
##
## Lambda Index Measure SE Nonzero
## min 0.07569 34 1.063 0.1135 9
## 1se 0.14517 27 1.164 0.1297 8
cv.glmnet
also supports parallel computing. To make it
work, users must register parallel beforehand. We give a simple example
of comparison here. Unfortunately, the package doMC
is not
available on Windows platforms (it is on others), so we cannot run the
code here, but we present timing information recorded during one of our
test runs.
## user system elapsed
## 2.440 0.080 2.518
## user system elapsed
## 2.450 0.157 1.567
As suggested from the above, parallel computing can significantly speed up the computation process especially for large-scale problems.
The coef
and predict
methods for
cv.glmnet
objects are similar to those for a
glmnet
object, except that two special strings are also
supported by s
(the values of λ requested):
“lambda.min”: the λ at which the smallest MSE is achieved.
“lambda.1se”: the largest λ at which the MSE is within one standard error of the smallest MSE (default).
## [1] 0.07569327
## lambda.min
## [1,] -1.3638848
## [2,] 2.5713428
## [3,] 0.5729785
## [4,] 1.9881422
## [5,] 1.5179882
## 21 x 1 sparse Matrix of class "dgCMatrix"
## s1
## (Intercept) 0.14867414
## V1 1.33377821
## V2 .
## V3 0.69787701
## V4 .
## V5 -0.83726751
## V6 0.54334327
## V7 0.02668633
....
Users can explicitly control the fold that each observation is
assigned to via the foldid
argument. This is useful, for
example, in using cross-validation to select a value for α:
foldid <- sample(1:10, size = length(y), replace = TRUE)
cv1 <- cv.glmnet(x, y, foldid = foldid, alpha = 1)
cv.5 <- cv.glmnet(x, y, foldid = foldid, alpha = 0.5)
cv0 <- cv.glmnet(x, y, foldid = foldid, alpha = 0)
There are no built-in plot functions to put them all on the same plot, so we are on our own here:
par(mfrow = c(2,2))
plot(cv1); plot(cv.5); plot(cv0)
plot(log(cv1$lambda) , cv1$cvm , pch = 19, col = "red",
xlab = "log(Lambda)", ylab = cv1$name)
points(log(cv.5$lambda), cv.5$cvm, pch = 19, col = "grey")
points(log(cv0$lambda) , cv0$cvm , pch = 19, col = "blue")
legend("topleft", legend = c("alpha= 1", "alpha= .5", "alpha 0"),
pch = 19, col = c("red","grey","blue"))
We see that the lasso (alpha=1
) does about the best
here. We also see that the range of lambdas used differs with
alpha
.
In this section we breifly describe some other useful arguments when
calling glmnet
: upper.limits
,
lower.limits
, penalty.factor
,
exclude
and intercept
.
Suppose we want to fit our model but limit the coefficients to be
bigger than -0.7 and less than 0.5. This can be achieved by specifying
the upper.limits
and lower.limits
arguments:
Often we want the coefficients to be positive: to do so, we just need
to specify lower.limits = 0
. (Note, the lower limit must be
no bigger than zero, and the upper limit no smaller than zero.) These
bounds can be a vector, with different values for each coefficient. If
given as a scalar, the same number gets recycled for all.
The penalty.factor
argument allows users to apply
separate penalty factors to each coefficient. This is very useful when
we have prior knowledge or preference over the variables. Specifically,
if vj
denotes the penalty factor for the jth variable, the penalty term
becomes $$
\lambda \sum_{j=1}^p \boldsymbol{v_j} P_\alpha(\beta_j) = \lambda
\sum_{j=1}^p \boldsymbol{v_j} \left[ (1-\alpha)\frac{1}{2} \beta_j^2 +
\alpha |\beta_j| \right].
$$
The default is 1 for each coefficient, i.e. coefficients are
penalized equally. Note that any variable with
penalty.factor
equal to zero is not penalized at all! This
is useful in the case where some variables are always to be included
unpenalized in the model, such as the demographic variables sex and age
in medical studies. Note the penalty factors are internally rescaled to
sum to nvars
, the number of variables in the given
x
matrix.
Here is an example where we set the penalty factors for variables 1, 3 and 5 to be zero:
p.fac <- rep(1, 20)
p.fac[c(1, 3, 5)] <- 0
pfit <- glmnet(x, y, penalty.factor = p.fac)
plot(pfit, label = TRUE)
We see from the labels that the three variables with zero penalty factors always stay in the model, while the others follow typical regularization paths and shrunk to zero eventually.
exclude
allows one to block certain variables from being
the model at all. Of course, one could simply subset these out of
x
, but sometimes exclude
is more useful, since
it returns a full vector of coefficients, just with the excluded ones
set to zero.
The intercept
argument allows the user to decide if an
intercept should be included in the model or not (it is never
penalized). The default is intercept = TRUE
. If
intercept = FALSE
the intercept is forced to be zero.
family = "mgaussian"
(multi-response)The multi-response Gaussian family is useful when there are a number
of (correlated) responses, also known as the “multi-task learning”
problem. Here, a variable is either included in the model for all the
responses, or excluded for all the responses. Most of the arguments for
this family are the same as that for family = "gaussian"
,
so we focus on the differences with the single response model.
As the name suggests, the response y is not a vector but a matrix of quantitative responses. As a result, the coefficients at each value of lambda are also a matrix.
glmnet
solves the problem $$
\min_{(\beta_0, \beta) \in \mathbb{R}^{(p+1)\times K}}\frac{1}{2N}
\sum_{i=1}^N \|y_i -\beta_0-\beta^T x_i\|^2_F+\lambda \left[
(1-\alpha)\|\beta\|_F^2/2 + \alpha\sum_{j=1}^p\|\beta_j\|_2\right].
$$ Here βj is the jth row of the p × K coefficient matrix
β, and we replace the absolute
penalty on each single coefficient by a group-lasso penalty on each
coefficient K-vector βj for a single
predictor (i.e. column of the x
matrix). The group lasso
penalty behaves like the lasso, but on the whole group of coefficients
for each response: they are either all zero, or else none are zero, but
are shrunk by an amount depending on λ.
We use a set of data generated beforehand for illustration. We fit a
regularized multi-response Gaussian model to the data, with an object
mfit
returned.
data(MultiGaussianExample)
x <- MultiGaussianExample$x
y <- MultiGaussianExample$y
mfit <- glmnet(x, y, family = "mgaussian")
The standardize.response
argument is only for
mgaussian
family. If
standardize.response = TRUE
, the response variables are
standardized (default is FALSE
).
As before, we can use the plot
method to visualize the
coefficients:
Note that we set type.coef = "2norm"
. Under this
setting, a single curve is plotted per variable, with value equal to the
ℓ2 norm of the variable’s
coefficient vector. The default setting is
type.coef = "coef"
, where a coefficient plot is created for
each response (multiple figures). xvar
and
label
are two other arguments which have the same
functionality as in the single-response case.
We can extract the coefficients and make predictions at requested
values of λ by using the
coef
and predict
methods respectively, as
before. Here is an example of a predict
call:
## , , 1
##
## y1 y2 y3 y4
## [1,] -4.7106263 -1.1634574 0.6027634 3.740989
## [2,] 4.1301735 -3.0507968 -1.2122630 4.970141
## [3,] 3.1595229 -0.5759621 0.2607981 2.053976
## [4,] 0.6459242 2.1205605 -0.2252050 3.146286
## [5,] -1.1791890 0.1056262 -7.3352965 3.248370
##
## , , 2
##
## y1 y2 y3 y4
## [1,] -4.6415158 -1.2290282 0.6118289 3.779521
## [2,] 4.4712843 -3.2529658 -1.2572583 5.266039
## [3,] 3.4735228 -0.6929231 0.4684037 2.055574
## [4,] 0.7353311 2.2965083 -0.2190297 2.989371
## [5,] -1.2759930 0.2892536 -7.8259206 3.205211
The prediction result is saved in a three-dimensional array with the first two dimensions being the prediction matrix for each response variable and the third corresponding to the response variables.
family = "binomial"
Logistic regression is a widely-used model when the response is binary. Suppose the response variable takes values in 𝒢 = {1, 2}. Denote yi = I(gi = 1). We model
$$\mbox{Pr}(G=2|X=x)=\frac{e^{\beta_0+\beta^Tx}}{1+e^{\beta_0+\beta^Tx}},$$
which can be written in the following form:
$$\log\frac{\mbox{Pr}(G=2|X=x)}{\mbox{Pr}(G=1|X=x)}=\beta_0+\beta^Tx,$$ the so-called “logistic” or log-odds transformation.
The objective function for logistic regression is the penalized negative binomial log-likelihood, and is $$ \min_{(\beta_0, \beta) \in \mathbb{R}^{p+1}} -\left[\frac{1}{N} \sum_{i=1}^N y_i \cdot (\beta_0 + x_i^T \beta) - \log (1+e^{(\beta_0+x_i^T \beta)})\right] + \lambda \big[ (1-\alpha)\|\beta\|_2^2/2 + \alpha\|\beta\|_1\big]. $$ Logistic regression is often plagued with degeneracies when p > N and exhibits wild behavior even when N is close to p; the elastic net penalty alleviates these issues, and regularizes and selects variables as well.
We use a “proximal Newton” algorithm for optimizing this criterion. This makes repeated use of a quadratic approximation to the log-likelihood, and then weighted coordinate descent on the resulting penalized weighted least-squares problem. These constitute an outer and inner loop, also known as iteratively reweighted penalized least squares.
For illustration purposes, we load the pre-generated input matrix
x
and the response vector y
from the data
file. The input matrix x is
the same as for other families. For binomial logistic regression, the
response variable y should be
either a binary vector, a factor with two levels, or a two-column matrix
of counts or proportions. The latter is useful for grouped binomial
data, or in applications where we have “soft” class membership, such as
occurs in the EM algorithm.
Other optional arguments of glmnet
for binomial
regression are almost same as those for Gaussian family. Don’t forget to
set family
option to “binomial”:
As before, we can print and plot the fitted object, extract the
coefficients at specific λ’s
and also make predictions. For plotting, the optional arguments such as
xvar
and label
work in the same way as for
family = "gaussian"
. Prediction is a little different for
family = "binomial"
, mainly in the function argument
type
:
“link” gives the linear predictors.
“response” gives the fitted probabilities.
“class” produces the class label corresponding to the maximum probability.
As with family = "gaussian"
, “coefficients” computes the
coefficients at values of s
and “nonzero” retuns a list of
the indices of the nonzero coefficients for each value of
s
. Note that the results (“link”, “response”,
“coefficients”, “nonzero”) are returned only for the class corresponding
to the second level of the factor response.
In the following example, we make prediction of the class labels at λ = 0.05, 0.01.
## s1 s2
## [1,] "0" "0"
## [2,] "1" "1"
## [3,] "1" "1"
## [4,] "0" "0"
## [5,] "1" "1"
For logistic regression, cv.glmnet
has similar arguments
and usage as Gaussian. nfolds
, weights
,
lambda
, parallel
are all available to users.
There are some differences in type.measure
: “deviance” and
“mse” do not both mean squared loss. Rather,
“mse” uses squared loss.
“deviance” uses actual deviance.
“mae” uses mean absolute error.
“class” gives misclassification error.
“auc” (for two-class logistic regression ONLY) gives area under the ROC curve.
For example, the code below uses misclassification error as the criterion for 10-fold cross-validation:
As before, we can plot the object and show the optimal values of λ.
## [1] 0.02140756
## [1] 0.04945423
coef
and predict
for the
cv.glmnet
object for family = "binomial"
are
simliar to the Gaussian case and we omit the details.
Like other generalized linear models, glmnet
allows for
an “offset”. This is a fixed vector of N numbers that is added into the
linear predictor. For example, you may have fitted some other logistic
regression using other variables (and data), and now you want to see if
the present variables can add further predictive power. To do this, you
can use the predicted logit from the other model as an offset in the
glmnet
call. Offsets are also useful in Poisson models,
which we discuss later.
family = "multinomial"
The multinomial model extends the binomial when the number of classes is more than two. Suppose the response variable has K levels ${\cal G}=\{1,2,\ldots,K\}$. Here we model $$\mbox{Pr}(G=k|X=x)=\frac{e^{\beta_{0k}+\beta_k^Tx}}{\sum_{\ell=1}^Ke^{\beta_{0\ell}+\beta_\ell^Tx}}.$$ There is a linear predictor for each class!
Let Y be the N × K indicator response matrix, with elements yiℓ = I(gi = ℓ). Then the elastic net penalized negative log-likelihood function becomes $$ \ell(\{\beta_{0k},\beta_{k}\}_1^K) = -\left[\frac{1}{N} \sum_{i=1}^N \Big(\sum_{k=1}^Ky_{il} (\beta_{0k} + x_i^T \beta_k)- \log \big(\sum_{\ell=1}^K e^{\beta_{0\ell}+x_i^T \beta_\ell}\big)\Big)\right] +\lambda \left[ (1-\alpha)\|\beta\|_F^2/2 + \alpha\sum_{j=1}^p\|\beta_j\|_q\right]. $$ Here we really abuse notation! β is a p × K matrix of coefficients. βk refers to the kth column (for outcome category k), and βj the jth row (vector of K coefficients for variable j). The last penalty term is ∥βj∥q. We support two options for q: q ∈ {1, 2}. When q = 1, this is a lasso penalty on each of the parameters. When q = 2, this is a grouped-lasso penalty on all the K coefficients for a particular variable, which makes them all be zero or nonzero together.
The standard Newton algorithm can be tedious here. Instead, for q = 1 we use a so-called partial Newton algorithm by making a partial quadratic approximation to the log-likelihood, allowing only (β0k, βk) to vary for a single class at a time. For each value of λ, we first cycle over all classes indexed by k, computing each time a partial quadratic approximation about the parameters of the current class. Then, the inner procedure is almost the same as for the binomial case. When q = 2, we use a different approach that we will not explain here.
For the family = "multinomial"
case, usage is similar to
that for family = "binomial"
. In this section we describe
the differences. First, we load a set of generated data:
The response variable can be a nc >= 2
level factor,
or an nc
-column matrix of counts or proportions. Internally
glmnet will make the rows of this matrix sum to 1, and absorb the total
mass into the weight for that observation. offset
should be
a nobs x nc
matrix if one is provided.
A special option for multinomial regression is
type.multinomial
, which allows the usage of a grouped lasso
penalty (q = 2) if
type.multinomial = "grouped"
. The default is
type.multinomial = "ungrouped"
(q = 1).
fit <- glmnet(x, y, family = "multinomial", type.multinomial = "grouped")
plot(fit, xvar = "lambda", label = TRUE, type.coef = "2norm")
For the plot
method, the function arguments are
xvar
, label
and type.coef
, in
addition to other ordinary graphical parameters. xvar
and
label
are the same as other families while
type.coef
is only for multinomial regression and
multi-response Gaussian model. It can produce a figure of coefficients
for each response variable if type.coef = "coef"
or a
figure showing the ℓ2-norm
in one figure if type.coef = "2norm"
.
We can also do cross-validation and plot the returned object. Note
that although type.multinomial
is not a named argument in
cv.glmnet
, in fact any argument that can be passed to
glmnet
is valid in the argument list of
cv.glmnet
. Such arguments are passed via the
...
argument directly to the calls to glmnet
inside the cv.glmnet
function.
Users may wish to predict at the optimally selected λ:
## 1
## [1,] "3"
## [2,] "2"
## [3,] "2"
## [4,] "3"
## [5,] "1"
## [6,] "3"
## [7,] "3"
## [8,] "1"
## [9,] "1"
## [10,] "2"
family = "poisson"
Poisson regression is used to model count data under the assumption of Poisson error, or otherwise non-negative data where the mean and variance are proportional. Like the Gaussian and binomial models, the Poisson distribution is a member of the exponential family of distributions. We usually model its positive mean on the log scale: log μ(x) = β0 + β′x.
The log-likelihood for observations {xi, yi}1N is given by $$ l(\beta|X, Y) = \sum_{i=1}^N \left(y_i (\beta_0+\beta^T x_i) - e^{\beta_0+\beta^Tx_i}\right). $$ As before, we optimize the penalized log-likelihood: $$ \min_{\beta_0,\beta} -\frac1N l(\beta|X, Y) + \lambda \left((1-\alpha) \sum_{i=1}^N \beta_i^2/2 +\alpha \sum_{i=1}^N |\beta_i|\right). $$
glmnet
uses an outer Newton loop and an inner weighted
least-squares loop (as in logistic regression) to optimize this
criterion.
First, we load a pre-generated set of Poisson data:
We apply the function glmnet
with
family = "poisson"
:
The optional input arguments of glmnet
for
"poisson"
family are similar to those for other
families.
offset
is a particularly useful argument for Poisson
models. When dealing with rate data in Poisson models, the counts
collected are often based on different exposures such as length of time
observed, area and years. A poisson rate μ(x) is relative to a unit
exposure time, so if an observation yi was exposed
for Ei
units of time, then the expected count would be Eiμ(x),
and the log mean would be log (Ei) + log (μ(x)).
In a case like this, we would supply an offset log (Ei) for
each observation. Hence offset
is a vector of length N that is included in the linear
predictor. (Warning: if offset
is supplied in
glmnet
, offsets must also also be supplied to
predict
via the newoffset
argument to make
reasonable predictions.)
Again, we plot the coefficients to have a first sense of the result.
As before, we can extract the coefficients and make predictions at
certain λ’s using
coef
and predict
respectively. The optional
input arguments are similar to those for other families. For the
predict
method, the argument type
has the same
meaning as that for family = "binomial"
, except that
“response” gives the fitted mean (rather than fitted probabilities in
the binomial case). For example, we can do the following:
## 21 x 1 sparse Matrix of class "dgCMatrix"
## s1
## (Intercept) 0.61123371
## V1 0.45819758
## V2 -0.77060709
## V3 1.34015128
## V4 0.04350500
....
## s1 s2
## [1,] 2.4944232 4.4263365
## [2,] 10.3513120 11.0586174
## [3,] 0.1179704 0.1781626
## [4,] 0.9713412 1.6828778
## [5,] 1.1133472 1.9934537
We may also use cross-validation to find the optimal λ’s and thus make inferences.
Options are almost the same as the Gaussian family except that for
type.measure
:
family = "cox"
The Cox proportional hazards model is commonly used for the study of
the relationship beteween predictor variables and survival time. We have
another vignette (“Regularized Cox
Regression”) dedicated solely to fitting regularized Cox models with
the glmnet
package; please consult that vignette for
details.
family = family()
Since version 4.0, glmnet
has the facility to fit any
GLM family by specifying a family
object, as used by
stats::glm
. For these more general families, the outer
Newton loop is performed in R, while the inner elastic-net loop is
performed in Fortran, for each value of lambda. The price for this
generality is a small hit in speed. For details, see the vignette “GLM
family
functions in glmnet
”
Once we have fit a series of models using glmnet
, we
often assess their performance on a set of evaluation or test data. We
usually go through the process of building a prediction matrix, deciding
on the performance measure, and computing these measures for a series of
values for lambda
(and gamma
for relaxed
fits). glmnet
provides three functions
(assess.glmnet
, roc.glmnet
and
confusion.glmnet
) that make these tasks easier.
The function assess.glmnet
computes the same performance
measures produced by cv.glmnet
, but on a validation or test
dataset.
data(BinomialExample)
x <- BinomialExample$x
y <- BinomialExample$y
itrain <- 1:70
fit <- glmnet(x[itrain, ], y[itrain], family = "binomial", nlambda = 5)
assess.glmnet(fit, newx = x[-itrain, ], newy = y[-itrain])
## $deviance
## s0 s1 s2 s3 s4
## 1.3877348 0.8612331 1.7948439 3.0335905 3.6930687
## attr(,"measure")
## [1] "Binomial Deviance"
##
## $class
## s0 s1 s2 s3 s4
## 0.4666667 0.1666667 0.2333333 0.1666667 0.1666667
## attr(,"measure")
## [1] "Misclassification Error"
##
## $auc
## [1] 0.5000000 0.9062500 0.8482143 0.8169643 0.8303571
## attr(,"measure")
## [1] "AUC"
##
## $mse
## s0 s1 s2 s3 s4
## 0.5006803 0.2529304 0.3633411 0.3514574 0.3500440
## attr(,"measure")
## [1] "Mean-Squared Error"
##
## $mae
## s0 s1 s2 s3 s4
## 0.9904762 0.5114320 0.4597047 0.3928790 0.3767338
## attr(,"measure")
## [1] "Mean Absolute Error"
This produces a list with all the measures suitable for a
binomial model, computed for the entire sequence of lambdas in the fit
object. Here the function identifies the model family from the
fit
object.
A second use case builds the prediction matrix before calling
assess.glmnet
:
pred <- predict(fit, newx = x[-itrain, ])
assess.glmnet(pred, newy = y[-itrain], family = "binomial")
Here we have to provide the family
as an argument; the
results (not shown) are the same. Users can see the various measures
suitable for each family via
## $gaussian
## [1] "mse" "mae"
##
## $binomial
## [1] "deviance" "class" "auc" "mse" "mae"
##
## $poisson
## [1] "deviance" "mse" "mae"
##
## $cox
## [1] "deviance" "C"
##
## $multinomial
## [1] "deviance" "class" "mse" "mae"
##
## $mgaussian
## [1] "mse" "mae"
##
## $GLM
## [1] "deviance" "mse" "mae"
assess.glmnet
can also take the result of
cv.glmnet
as input. In this case the predictions are made
at the optimal values for the parameter(s).
cfit <- cv.glmnet(x[itrain, ], y[itrain], family = "binomial", nlambda = 30)
assess.glmnet(cfit, newx = x[-itrain, ], newy = y[-itrain])
## $deviance
## lambda.1se
## 0.9482246
## attr(,"measure")
## [1] "Binomial Deviance"
##
## $class
## lambda.1se
## 0.2
## attr(,"measure")
## [1] "Misclassification Error"
....
This uses the default value of s = "lambda.1se"
, just
like predict
would have done. Users can provide additional
arguments that get passed on to predict
. For example, the
code below shows the performance measures for
s = "lambda.min"
:
## $deviance
## lambda.min
## 0.8561849
## attr(,"measure")
## [1] "Binomial Deviance"
##
## $class
## lambda.min
## 0.1666667
## attr(,"measure")
## [1] "Misclassification Error"
....
One interesting use case for assess.glmnet
is to get the
results of cross-validation using other measures. By specifying
keep = TRUE
in the cv.glmnet
call, a matrix of
prevalidated predictions are stored in the returned output as the
fit.preval
component. We can then use this component in the
call to assess.glmnet
:
cfit <- cv.glmnet(x, y, family = "binomial", keep = TRUE, nlambda = 30)
assess.glmnet(cfit$fit.preval, newy = y, family = "binomial")
## $deviance
## s0 s1 s2 s3 s4 s5 s6 s7
## 1.4131504 1.2947431 1.1887748 1.0940954 0.9903013 0.8999068 0.8407559 0.8075691
## s8 s9 s10 s11 s12 s13 s14 s15
## 0.8053265 0.8302531 0.8783749 0.9490040 1.0501903 1.1733004 1.2947184 1.4200727
## s16 s17 s18 s19 s20 s21 s22 s23
## 1.5596069 1.7098302 1.8594572 2.0060912 2.1283083 2.2510820 2.3757275 2.4799198
## s24 s25 s26 s27 s28 s29
## 2.5765861 2.6589164 2.7143188 2.7634920 2.8119805 2.8404461
## attr(,"measure")
## [1] "Binomial Deviance"
....
Users can verify that the first measure here deviance
is
identical to the component cvm
on the cfit
object.
In the special case of binomial models, users often would like to see
the ROC curve for validation or test data. Here the function
roc.glmnet
provides the goodies. Its first argument is as
in assess.glmnet
. Here we illustrate one use case, using
the prevlidated CV fit.
cfit <- cv.glmnet(x, y, family = "binomial", type.measure = "auc",
keep = TRUE)
rocs <- roc.glmnet(cfit$fit.preval, newy = y)
roc.glmnet
returns a list of cross-validated ROC data,
one for each model along the path. The code below demonstrates how one
can plot the output. The first line identifies the lambda
value giving the best area under the curve (AUC). Then we plot all the
ROC curves in grey and the “winner” in red.
For binomial and multinomial models, we often wish to examine the
classification performance on new data. The function
confusion.glmnet
will do that for us.
data(MultinomialExample)
x <- MultinomialExample$x
y <- MultinomialExample$y
set.seed(101)
itrain <- sample(1:500, 400, replace = FALSE)
cfit <- cv.glmnet(x[itrain, ], y[itrain], family = "multinomial")
cnf <- confusion.glmnet(cfit, newx = x[-itrain, ], newy = y[-itrain])
confusion.glmnet
produces a table of class
“confusion.table” which inherits from class “table”, and we also provide
a print
method for it.
## True
## Predicted 1 2 3 Total
## 1 13 6 4 23
## 2 7 25 5 37
## 3 4 3 33 40
## Total 24 34 42 100
##
## Percent Correct: 0.71
The first argument to confusion.glmnet
should be a
glmnet
or cv.glmnet
object (from which
predictions can be made), or a matrix/array of predictions, such as the
kept "fit.preval"
component in the output of a
cv.glmnet
call with keep = TRUE
. When a
matrix/array of predictions is provided, we need to specify the
family
option, otherwise confusion can exist
between “binomial” and “multinomial” prediction matrices.
When predictions for more than one model in the path is provided,
confusion.glmnet
returns a list of confusion tables. For
example, the prevalidated predictions from cv.glmnet
are
for the whole lambda
path, and so we are returned a list of
confusion tables. In the code below, we identify and print the one
achieving the smallest classification error.
cfit <- cv.glmnet(x, y, family = "multinomial", type = "class", keep = TRUE)
cnf <- confusion.glmnet(cfit$fit.preval, newy = y, family = "multinomial")
best <- cfit$index["min",]
print(cnf[[best]])
## True
## Predicted 1 2 3 Total
## 1 76 22 14 112
## 2 39 129 23 191
## 3 27 23 147 197
## Total 142 174 184 500
##
## Percent Correct: 0.704
The exclude
argument to glmnet
accepts a
vector of indices, indicating which variables should be excluded from
the fit, i.e. get zeros for their coefficients. From version 4.1-2, the
exclude
argument can accept a function. The idea
is that variables can be filtered based on some of their properties
(e.g. too sparse) before any models are fit. When performing
cross-validation (CV), this filtering should be done separately inside
each fold of cv.glmnet
.
Here is a typical filter function: it excludes variables that are more than 80% sparse.
This function gets invoked inside the call to glmnet
,
and uses the supplied x
to generate the indices.
We give a simple example using this:
set.seed(101)
n <-500; p <- 50
x <- matrix(rnorm(n * p), n, p)
x[sample(seq(length(x)), 4 * n * p / 5)] <- 0
y <- rnorm(n) + x %*% (rnorm(p) / 5) > 0
excl <- filter(x)
print(excl)
## [1] 4 5 6 7 8 9 11 15 22 23 24 25 26 27 28 29 33 35 36 40 42 46
fit.orig <- glmnet(x, y, family = "binomial", exclude = excl)
fit.new <- glmnet(x, y, family = "binomial", exclude = filter)
all.equal(fit.orig, fit.new)
## [1] "Component \"call\": target, current do not match when deparsed"
The fits are the same (the call
component is all that
differs).
The nice thing with this approach is that cv.glmnet
does
the right thing: it applies the filter function separately to each
training-fold x
, and hence accounts for any bias that may
be incurred by the filtering.
The most general form of the filter function is
Yes, users can use y
as well in their filtering
criterion, and the observation weights if that makes sense. As we did
above, some of the arguments can be omitted, but the ...
must always be there. We will give some useful examples of filter
functions a bit later.
An important point to note is that any other arguments in a user’s
filter function are ignored. So, what we should if we want one
or more additional parameters to guide the filtering, such as the choice
of 0.8
in our filter function above? The safest way to
achieve this is to write a filter generator function, that
produces a filter function as its output. We give an example:
sparsity <- function(fraction = 0.7) {
function(x, ...) which(colMeans(x == 0) > fraction)
}
sparsity(0.5)
## function (x, ...)
## which(colMeans(x == 0) > fraction)
## <environment: 0x5593391b71c0>
Here, sparsity()
is a function that produces a function.
The argument fraction
is found in the environment of that
function, and hence is built in. We can now run
cv.glmnet
as follows:
foldid <- sample(rep(1:10,length.out = length(y)))
cvfit.filt1 <- cv.glmnet(x, y, family = "binomial", foldid = foldid,
exclude = filter)
cvfit.filt2 <- cv.glmnet(x, y, family = "binomial", foldid = foldid,
exclude = sparsity(0.8))
all.equal(cvfit.filt1, cvfit.filt2)
## [1] "Component \"call\": target, current do not match when deparsed"
## [2] "Component \"glmnet.fit\": Component \"call\": target, current do not match when deparsed"
These agree in the important parts.
We end this section with some useful filtering functions. In
genomics, where the x
matrix can be very wide, we often
filter features based on variance. We will use expression arrays as an
example. All else being equal, we expect the variance of the gene
expression of a gene to be higher if there is something going
on across the samples, and we guess that that something is
related to the response. Here is a function that efficiently computes
the column-wise variance for a wide matrix, followed by a filter
function that uses it.
uvar <- function(x, means = FALSE) {
# if means = TRUE, the means and variances are returned,
# otherwise just the variances
m <- colMeans(x)
n <- nrow(x)
x <- x - outer(rep(1,n),m)
v <- colSums(x^2) / (n - 1)
if (means) list(mean = m, var = v) else v
}
vfilter <- function(q = 0.3) {
function(x,...) {
v <- uvar(x)
which(v < quantile(v, q))
}
}
Here, our filter function vfilter()
is a function
generator as discussed above, and will exclude the fraction
q
of variables with the lowest variance.
We can also use y
as part of our filtering. This final
example builds on the example above. Here the response y
is
binary, taking two values. We want to filter out variables with low
values of the two-sample t-statistic with groups defined by
y
. Again, we present an efficient function for computing
the t-statistics, and then the filter. (Notice that for convenience, our
function uvar
above can optionally return a mean vector and
a variance vector.)
ut.test <- function(x, y, s0 = 0) {
ni <- table(y); n <- sum(ni)
if(length(ni) != 2) stop("Only two-sample t-test here")
index <- seq(n)
mv <- tapply(index, y, function(i, x) uvar(x[i, ], means = TRUE), x = x)
ss <- ((ni[1] - 1) * mv[[1]]$var + (ni[2] - 1) * mv[[2]]$var)
sd <- sqrt(ss * (1 / ni[[1]] + 1 / ni[[2]]) / (n - 2))
numer <- mv[[1]]$mean - mv[[2]]$mean
numer / (sd + s0)
}
tfilter <- function(q = 0.3, s0 = 0) {
function(x, y, ...) {
tstats <- ut.test(x, y, s0 = s0)
which(tstats < quantile(tstats, q))
}
}
One might think that unsupervised filtering using just x
before fitting the model is fair game, and perhaps need not be accounted
for in cross-validation. This is an interesting issue, and one can argue
that there could be some bias if it was not accounted for. But filtering
using x
and y
definitely introduces
potential bias. However, it can still give good results, as long as it
it correctly accounted for when running CV.
In this section, we describe other features in the
glmnet
package that might be of interest to users.
Our package supports sparse input matrices, which allow the efficient
storage and operation of large matrices having only a few nonzero
entries. The usage of sparse matrices (inherits from class
"sparseMatrix"
as in the Matrix
package) in
glmnet
is the same as if a regular matrix is provided.
We load a set of sample data created beforehand. It loads
x
, a 100*20 sparse input matrix and y
, the
response vector.
## [1] "dgCMatrix"
## attr(,"package")
## [1] "Matrix"
Users can create a sparse matrix with the function
sparseMatrix
by providing the locations and values of the
nonzero entries. Alternatively, the Matrix
function from
the Matrix
package can also be used to contruct a sparse
matrix by setting sparse = TRUE
, but this defeats the
purpose somewhat if the matrix is large.
We can fit the model the same way as before.
We can also do the cross-validation and plot the resulting object.
The usage of other functions are similar and we do not expand here.
Note that sparse matrices can also be used for newx
, the
new input matrix in the predict
function. For example,
i <- sample(1:5, size = 25, replace = TRUE)
j <- sample(1:20, size = 25, replace = TRUE)
x <- rnorm(25)
nx <- sparseMatrix(i = i, j = j, x = x, dims = c(5, 20))
predict(cvfit, newx = nx, s = "lambda.min")
## lambda.min
## [1,] 2.3731134
## [2,] -0.1100765
## [3,] -3.0124618
## [4,] 0.3655119
## [5,] -0.4475473
The glmnet
package includes a function
bigGlm
for fitting a single unpenalized
generalized linear model (GLM), but allowing all the options of
glmnet
. In other words, the user can set coefficient upper
and/or lower bounds, and can provide the x
matrix in sparse
matrix format. This is not too much more than fitting a model with a
single value of lambda = 0
(with some protection from edge
cases). predict
and print
methods can be
called on the output.
data(BinomialExample)
x <- BinomialExample$x
y <- BinomialExample$y
fit <- bigGlm(x, y, family = "binomial", lower.limits = -1)
print(fit)
##
## Call: bigGlm(x = x, y = y, family = "binomial", lower.limits = -1)
##
## Df %Dev Lambda
## 1 30 77.57 0
x
from mixed variables and/or missing
dataThe glmnet
package includes a function
makeX
that makes it easy to create the model matrix
x
needed as input to glmnet
. It takes as input
a data frame, which can contain vectors, matrices and factors. Some of
the features are:
sparse
option returns a matrix in column-sparse
format. This is useful if the data are large, and factors have many
levels.train
and
test
. This ensures the factor levels correspond, and also
imputes missing data in the test data from means in the training
data.Our first example demonstrates how makeX
works with
factors:
set.seed(101)
X <- matrix(rnorm(5), nrow = 5)
X2 <- sample(letters[1:3], 5, replace = TRUE)
X3 <- sample(LETTERS[1:3], 5, replace = TRUE)
df <- data.frame(X, X2, X3)
makeX(df)
## X X2a X2b X2c X3B X3C
## 1 -0.3260365 0 0 1 0 1
## 2 0.5524619 0 1 0 0 1
## 3 -0.6749438 0 0 1 1 0
## 4 0.2143595 0 0 1 0 1
## 5 0.3107692 1 0 0 0 1
Include the option sparse = TRUE
if a sparse output is
desired:
## 5 x 6 sparse Matrix of class "dgCMatrix"
## X X2a X2b X2c X3B X3C
## 1 -0.3260365 . . 1 . 1
## 2 0.5524619 . 1 . . 1
## 3 -0.6749438 . . 1 1 .
## 4 0.2143595 . . 1 . 1
## 5 0.3107692 1 . . . 1
Next, let us add some missing values to our data matrix. By default,
makeX
leaves NA
s as is:
Xn <- X ; Xn[3,1] <- NA
X2n <- X2; X2n[1] <- NA
X3n <- X3; X3n[5] <- NA
dfn <- data.frame(Xn, X2n, X3n)
dfn
## Xn X2n X3n
## 1 -0.3260365 <NA> C
## 2 0.5524619 b C
## 3 NA c B
## 4 0.2143595 c C
## 5 0.3107692 a <NA>
## Xn X2na X2nb X2nc X3nB X3nC
## 1 -0.3260365 NA NA NA 0 1
## 2 0.5524619 0 1 0 0 1
## 3 NA 0 0 1 1 0
## 4 0.2143595 0 0 1 0 1
## 5 0.3107692 1 0 0 NA NA
We can impute the missing values with column means by passing the
option na.impute = TRUE
:
## 5 x 6 sparse Matrix of class "dgCMatrix"
## Xn X2na X2nb X2nc X3nB X3nC
## 1 -0.3260365 0.25 0.25 0.5 . 1.00
## 2 0.5524619 . 1.00 . . 1.00
## 3 0.1878885 . . 1.0 1.00 .
## 4 0.2143595 . . 1.0 . 1.00
## 5 0.3107692 1.00 . . 0.25 0.75
Finally if a test set is available as well, both the training and
test sets can be passed to makeX
at the same time so that
all the levels of factors present in the training and test sets will be
represented correctly in the output matrix. In the example below, the
third column of the training set only contains “B” and “C” while that of
the training set only contains “A” and “C”. By passing both data sets to
makeX
at the same time, this third column is correctly
expanded into 3 feature columns for both the training and test sets.
set.seed(102)
X <- matrix(rnorm(5), nrow = 5)
X2 <- sample(letters[1:3], 5, replace = TRUE)
X3 <- sample(LETTERS[1:3], 5, replace = TRUE)
Xn <- X ; Xn[5,1] <- NA
X2n <- X2; X2n[1] <- NA
X3n <- X3; X3n[2] <- NA
dftn <- data.frame(Xn, X2n, X3n)
dftn
## Xn X2n X3n
## 1 0.1805229 <NA> C
## 2 0.7847340 c <NA>
## 3 -1.3531646 c C
## 4 1.9832982 a A
## 5 NA c A
## $x
## 5 x 7 sparse Matrix of class "dgCMatrix"
## Xn X2na X2nb X2nc X3nA X3nB X3nC
## 1 -0.3260365 NA NA NA . . 1
## 2 0.5524619 . 1 . . . 1
## 3 NA . . 1 . 1 .
## 4 0.2143595 . . 1 . . 1
## 5 0.3107692 1 . . NA NA NA
##
## $xtest
## 5 x 7 sparse Matrix of class "dgCMatrix"
## Xn X2na X2nb X2nc X3nA X3nB X3nC
## 6 0.1805229 NA NA NA . . 1
## 7 0.7847340 . . 1 NA NA NA
## 8 -1.3531646 . . 1 . . 1
## 9 1.9832982 1 . . 1 . .
## 10 NA . . 1 1 . .
Ever run a job on a big dataset, and wonder how long it will take?
glmnet
and cv.glmnet
come equipped with a
progress bar, which can by displayed by passing
trace.it = TRUE
to these functions.
##
|================================== |65%
This display changes in place as the fit is produced. The progress
bar is also very helpful with cv.glmnet
:
##
Training
|=============================================| 100%
Fold: 1/10
|=============================================| 100%
Fold: 2/10
|=============================================| 100%
Fold: 3/10
|============================= | 70%
Tracing of the folds works a little differently when distributed computing is used.
If the user wants glmnet
and cv.glmnet
to
always print the progress bar, this can be achieved (for a session) via
a call to glmnet.control
with the itrace
argument:
To reset it, one makes a similar call and sets
itrace = 0
.
Glmnet uses a convergence criterion that focuses not on coefficient change but rather the impact of the change on the fitted values, and hence the loss part of the objective. The net result is a weighted norm of the coefficient change vector.
For Gaussian models it uses the following. Suppose observation i has weight wi. Let vj be the (weighted) sum-of-squares for variable xj: $$v_j=\sum_{i=1}^Nw_ix_{ij}^2.$$ If there is an intercept in the model, these xj will be centered by the weighted mean, and hence this would be a weighted variance. After β̂jo has been updated to β̂jn, we compute Δj = vj(β̂jo − β̂jn)2. After a complete cycle of coordinate descent, we look at Δmax = maxjΔj. Why this measure? We can write $$\Delta_j=\frac1N\sum_{i=1}^N w_i(x_{ij}\hat\beta_j^o-x_{ij}\hat\beta_j^n)^2,$$ which measures the weighted sum of squares of changes in fitted values for this term. This measures the impact of the change in this coefficient on the fit. If the largest such change is negligible, we stop.
For logistic regression and other non-Gaussian models it is similar for the inner loop, only now the weights for each observation are more complex. For example, for logistic regression the weights are those that arise from the current Newton step, i.e. wi* = wip̂i(1 − p̂i), where the p̂i’s are the fitted probabilities as we entered the current inner loop. The intuition is the same: it measures the impact of the coefficient change on the current weighted least squares loss, or quadratic approximation to the log-likelihood loss.
What about outer-loop convergence?
If the argument family
was a character string, we
use the same measure, except now β̂o is the
coefficient vector before we entered this inner loop, and β̂n the converged
solution for this inner loop. Hence if this Newton step had no impact,
we declare outer-loop convergence.
If the argument family
was a class “family” object,
outer-loop convergence is determined by the change in the objective
function value. If the fractional change in the objective function value
is less than the epsnr
control parameter, we declare
outer-loop convergence. epsnr
can be changed via a call to
glmnet.control
.
Our package has a set of internal parameters which control some aspects of the computation of the path. The factory default settings are expected to serve in most cases, and users do not need to make changes unless there are special requirements.
There are several parameters that users can change:
fdev
- minimum fractional change in deviance for
stopping path; factory default = 1.0e-5.
devmax
- maximum fraction of explained deviance for
stopping path; factory default = 0.999.
eps
- minimum value of lambda.min.ratio
(see glmnet
documentation); factory default=
1.0e-6.
big
- large floating point number; factory default =
9.9e35. Inf in definition of upper.limits
is set to
big
.
mnlam
- minimum number of path points (lambda
values) allowed; factory default = 5.
pmin
- minimum null probability for any class;
factory default = 1.0e-5.
exmx
- maximum allowed exponent; factory default =
250.0.
prec
- convergence threshold for multi-response
bounds adjustment solution; factory default = 1.0e-10.
mxit
- maximum iterations for multi-response bounds
adjustment solution; factory default = 100.
epsnr
: convergence threshold for the iteratively
reweighted least squares loop (see “The family
Argument for
glmnet
” vignette); factory default = 1e-08.
mxitnr
: maximum iterations for the iteratively
reweighted least squares loop for each value of λ (see “The family
Argument for glmnet
” vignette); factory default =
25.
factory
- If TRUE
, reset all the
parameters to the factory default; default is
FALSE
.
We illustrate how to change these control parameters through an
example. Note that any changes made hold for the duration of the R
session, or unless they are changed by the user with a subsequent call
to glmnet.control
.
data(QuickStartExample)
x <- QuickStartExample$x
y <- QuickStartExample$y
fit <- glmnet(x, y)
length(fit$lambda) # number of lambda values fit
## [1] 67
We can change the minimum fractional change in deviance for stopping
path and compare the results. By setting fdev
to be larger
than the default, we see that the computation stopped earlier in the
path.
## [1] 11
Users can reset to the default settings with the following code:
To view current settings, call glmnet.control
without
any arguments:
## $fdev
## [1] 1e-05
##
## $eps
## [1] 1e-06
##
## $big
## [1] 9.9e+35
....
Some may want to use glmnet
to solve the lasso or
elastic net problem at a single λ. We compare here the solution by
glmnet
with other packages (such as CVX), and also as an
illustration of parameter settings in this situation.
(Warning: Though such problems can be solved by
glmnet
, it is not recommended and is not
the spirit of the package. glmnet
fits the
entire solution path for the lasso or elastic net
problems efficiently with various techniques such as using warm starts
and strong rules. Those advantages will disappear if the λ sequence is forced to be only one
value.)
We illustrate with a typical example in linear models for the purpose of comparison. Given X and Y, we want to find β such that minβ∥Y − Xβ∥22 + λ0∥β∥1, where, say, λ0 = 8.
We first solve this using glmnet
. Notice that there is
no intercept term in the objective function, and the columns of X are not necessarily standardized.
Corresponding parameters have to be set to make it work correctly. In
addition, there is a 1/(2n)
factor before the quadratic term by default, so we need to adjust λ accordingly. For the purpose of
comparison, we set thresh = 1e-20
. However, this is not
necessary in many practical applications.
np <- dim(x); n <- np[1]; p <-np[2]
fit <- glmnet(x, y, intercept = F, standardize = F,
lambda = 8 / (2 * n), thresh = 1e-20)
We then extract the coefficients (with no intercept):
Alternatively, a more stable and strongly
recommended way to perform this task is to first fit the entire
lasso or elastic net path without specifying lambda
, but
then provide the requested λ0 to a
predict
call to extract the corresponding coefficients.
(Remember to set exact = TRUE
in the predict
call to get the exact solution. Otherwise, it will be approximated by
linear interpolation.)
fit <- glmnet(x, y, intercept = F, standardize = F, thresh = 1e-20)
beta_glmnet <- as.matrix(predict(fit, s = 8 / (2 * n),
type = "coefficients",
exact = TRUE, x = x, y = y)[-1,])
Next, we use CVX, a general convex optimization solver, to solve this specific lasso problem. CVX is implemented in the CVXR package on CRAN.
library(CVXR)
beta <- Variable(p)
loss <- sum((y-x%*%beta)^2)/(2*n)
lassoPenalty <- function(beta,lambda)lambda*p_norm(beta,1)
obj <- loss + lassoPenalty(beta, lambda = 8/(2*n))
prob <- Problem(Minimize(obj))
result <- solve(prob)
beta_CVX <- result$getValue(beta)
For convenience, the results were saved in
CVXResult.RData
, and we simply load in the results.
Finally, we solve the same problem with the lars
package:
library(lars)
fit_lars <- lars(x, y, type = "lasso", intercept = F, normalize = F)
beta_lars <- predict(fit_lars, s = 8 / 2, type = "coefficients",
mode = "lambda")$coefficients
The results are listed below up to 6 decimal digits (due to convergence thresholds). We see that all three packages give the same result.
cmp <- round(cbind(beta_glmnet, beta_lars, beta_CVX), digits = 6)
colnames(cmp) <- c("beta_glmnet", "beta_lars", "beta_CVX")
cmp
## beta_glmnet beta_lars beta_CVX
## V1 1.389118 1.389118 1.389118
## V2 0.007991 0.007991 0.007991
## V3 0.731234 0.731234 0.731234
## V4 0.031119 0.031119 0.031119
## V5 -0.866793 -0.866793 -0.866793
## V6 0.564867 0.564867 0.564867
## V7 0.069678 0.069678 0.069678
## V8 0.358346 0.358346 0.358346
## V9 0.000000 0.000000 0.000000
## V10 0.070565 0.070565 0.070565
## V11 0.173464 0.173464 0.173464
## V12 -0.027472 -0.027472 -0.027472
## V13 -0.017960 -0.017960 -0.017960
## V14 -1.138053 -1.138053 -1.138053
## V15 -0.070990 -0.070990 -0.070990
## V16 0.000000 0.000000 0.000000
## V17 0.000000 0.000000 0.000000
## V18 0.000000 0.000000 0.000000
## V19 0.000000 0.000000 0.000000
## V20 -1.097528 -1.097528 -1.097528
family = "gaussian"
(default)family = "mgaussian"
(multi-response)family = "binomial"
family = "multinomial"
family = "poisson"
family = "cox"
family = family()