High-performance and user-friendly GLMs with glum 3

Generalized linear models (GLMs) are interpretable, relatively quick to train, and specifying them helps the modeler understand the main effects in the data. This makes them a popular choice today to complement other machine-learning approaches.

Our initial goal in building glum was to offer the community an efficient, feature-rich, and Python-first GLM library with a scikit-learn-style API. The new release focuses on making glum’s strengths more accessible to the user. One could say that glum just got a little friendlier.

In the following, we will guide you through the key user-facing change in the new release: a formula interface. If this is the first time you encounter glum, we recommend also checking out the getting started guide.

A brief intro to formulas in glum

Wilkinson-style notation, also known as formulas, has become the de facto standard of specifying linear models in a wide range of statistical software. Since they were implemented in S¹ and subsequently popularized in R, they have found their way into the interface of many packages involving linear models in various languages. We are excited that now glum is also a part of this group! What’s more, we implemented formulas in a very glum-like manner, with a focus on performance, especially for sparse and categorical data.² Our formula interface is based on the fantastic formulaic library.

The idea behind formulas is to replace some pre-processing steps, usually performed with pandas, polars, or scikit-learn transformers, with a domain-specific language for specifying linear models. Their basic structure is that the dependent and independent variables are separated by a tilde (~), while the independent variables are separated by addition (+). However, they are capable of much more flexibility. Consider the following example:³

formula = (
    "Miles_per_Gallon ~ {np.log(2.20462 * Weight_in_lbs)} "
    "+ bs(Horsepower, 3) + C(Cylinders) * Origin"
)
model = glum.GeneralizedLinearRegressor(
    formula=formula, drop_first=True, fit_intercept=False, alpha=0.01
)
model.fit(cars[:380])

There is quite a lot going on here, so let us highlight the main features:

Code in {} is evaluated as a Python expression and not according to the rules of the formula grammar. This allows for rather complex pre-processing steps.
C denotes that a term should be treated as categorical, even if it is of numeric type.
* outside of curly brackets means that both variables and their interaction should be included in the model. With numerics, an interaction is simply multiplication, but with categoricals, it consists of all combinations of their levels.
bs creates a basis spline. Check out formulaic’s documentation for more helpful, built-in, transformations.
drop_first=True goes beyond simply dropping the first level of each categorical variable when used with the formula interface to ensure structural full rankness.

The other half of the magic happens at prediction time. One can simply use the fitted model to get predictions for unseen data, as expected:

model.predict(cars[380:])

Because the model includes the pre-processing steps in the formula, there is no need to do anything else when using predict. Furthermore, stateful transforms, such as bs, center, or C, store the state-relevant information at training time and use that at prediction time. For instance, in the current example, bs will use knots for the spline based on the training data, and C will create the categories encountered at training time.

Efficient categorical interactions

Besides efficient optimization, glum’s excellent performance stems from its efficient handling of mixed sparsity data (i.e. dense, sparse, and categorical columns). Notably, glum can estimate a model with categorical variables as if they were dummy encoded without expanding the dummy matrix.

The formula interface was created with this in mind. In particular, variable interactions preserve sparsity (the interaction of a dense and a sparse column will always be sparse), and glum uses this fact to save memory. Even more unique is how categorical interactions are handled. Glum exploits the observation that the interaction of any two categoricals can be represented as one new categorical. This makes it possible to include interactions with high-cardinality categoricals in models without worrying about memory implications. As an example, the following snippet will most likely run out of memory on your computer:

df = pd.DataFrame(
    {
        "c1": pd.Categorical(np.random.randint(low=1, high=200, size=1_000_000)),
        "c2": pd.Categorical(np.random.randint(low=1, high=200, size=1_000_000)),
        "y": np.random.rand(1_000_000),
    }
)

X = formulaic.model_matrix("c1 : c2", data=df)
big_model_ext = glum.GeneralizedLinearRegressor(alpha=1).fit(X=X, y=df["y"])

On the other hand, glum should have no problem fitting the following:

big_model = glum.GeneralizedLinearRegressor(formula="y ~ c1 : c2", alpha=1).fit(df)

This is because glum’s formula interface is not simply some thin wrapper that delegates the model matrix creation to formulaic and uses the result for model fitting. Instead, we created a custom formula materializer for tabmat (glum’s matrix backend), which efficiently handles interactions between dense, sparse, and categorical columns.

The new release also enhances categorical handling by providing flexible options for managing missing values and customizing feature names.

Formulas work well with other new features

The formula interface is also designed to work well with other recent additions to glum: Wald tests and term names. In the case of the former, the linear hypotheses can be specified as formulas, making tests more flexible and convenient:

unregularized_model = glum.GeneralizedLinearRegressor(
    formula="Miles_per_Gallon ~ Weight_in_lbs + bs(Horsepower, 3)",
    drop_first=True,
    fit_intercept=True,
    alpha=0,
).fit(cars)
unregularized_model.wald_test(
    cars, formula="`bs(Horsepower, 3)[1]` = 2 * `bs(Horsepower, 3)[2]`"
)

The new release also brings term names, which aggregate over feature names. When a model is specified via a formula, term names correspond to formula terms separated by + signs.⁴ For example, the hypothesis that horsepower does not have a statistically significant effect in the example model can be tested like so:

unregularized_model.wald_test(cars, terms=["bs(Horsepower, 3)"], r=[0])

Check out the example notebook for even more information and examples on formulas in glum.

Next steps

If you have made it this far, make sure to try glum. We welcome feature requests or bug reports and appreciate contributions.

Their actual invention predates any statistical software still in use today: G. N. Wilkinson, C. E. Rogers: Symbolic Description of Factorial Models for Analysis of Variance (1973). ↩
Check out tabmat v4, too, which also gained the ability to create matrices based on formulas. ↩
To execute the following code snippets yourself, first run import formulaic; import glum; import pandas as pd; import numpy as np; from vega_datasets import data; cars = data.cars().dropna(). ↩
Interaction via a*b is an exception, as it is automatically expanded to a + b + a:b. ↩

A brief intro to formulas in glum

Efficient categorical interactions

Formulas work well with other new features

Next steps

Footnotes