conda support in pre-commit

Developing code involves several tasks that are simple yet repetitive. This includes styling your code (we use black) and checking for common issues. These tasks can be easily automated. To automate them, we use Git's hook system using the pre-commit utility. On every commit pre-commit applies our code style conventions with black, checks for common (but subtle) issues with flake8 and lets mypy check that our Python type hints are consistent. These checks resolve many issues long before code reaches the CI system and goes into human code review without putting any additional burden on the developer.

As scientific Python and R developers, we have lots of compiled dependencies and thus prefer to manage our (Python) environments using conda. In addition, much of our development takes place in environments that are not directly connected to the internet. All packages are installed from a local conda mirror, and we try to keep this our sole package mirror. We do not have a local PyPI or CRAN mirror. While this may sound like an impediment, we are very active contributors to conda-forge and can get most packages built and deployed there in less than 4 hours (a massive thanks to the really fast reviewers!).

To be able to use only a single package mirror, all tools must support conda to create environments. Sadly, pre-commit only had support for virtualenv and python-venv to create a virtual environment for Python. It had no support for R at all. To work around this problem, we did not use pre-commit's feature to create the environments the checks are run in automatically but instead used the language: system flag to tell pre-commit to use the black / mypy / … it finds first on the PATH. While this is a simple solution, we lose the ability to ensure that pre-commit runs the correct version of a check. This is especially problematic for, e.g., mypy - different versions of which produce different issues. The only solution was to make sure that all our code repositories use the same versions of the check and all are updated simultaneously. Everyone who has ever managed a large codebase will know that this is quite a job.

To work around this, one of the possible options was to set up a PyPI mirror. This would enable us to use the language: python facility in pre-commit with the required features. But it would only solve the issue for Python checks. For checks in other languages like R or other binaries that are available through conda-forge we would still be out of luck. In the end we choose the option to implement conda as a language type in pre-commit.

Python pre-commit hook with conda

To write a pre-commit hook that is set up using conda, you should have a repository that contains the hook specification file .pre-commit-hooks.yaml and the conda environment definition environment.yml. As an example, a pre-commit hook for mypy would have the following environment.yml:

channels:
  - conda-forge
  - defaults
dependencies:
  - mypy=0.761

With the respective .pre-commit-hooks.yaml taken from the language: python hook:

- id: mypy-conda
  name: mypy-conda
  entry: mypy
  language: conda
  'types': [python]
  args: ["--ignore-missing-imports", "--scripts-are-modules"]
  require_serial: true
  additional_dependencies: []

You can then use this hook using the following snippet in your .pre-commit-config.yaml

 - repo: https://github.com/Quantco/pre-commit-mirrors-mypy
   rev: '0.761'
   hooks:
     - id: mypy-conda

R pre-commit hook with conda

As conda is a general package management system and not bound to Python, we can also use it to write pre-commit hooks in R. While in Python, things are commonly available as executable via entry points, in the R world, most things are only callable from R and not directly from the command line. Thus we need to include a bit of R code in the hook.

First, we declare the dependencies in the environment.yml:

channels:
  - conda-forge
  - defaults
dependencies:
  - r-base=3.6

Additional R packages are readily available on conda-forge with an r- prefix and the CRAN package name in lowercase. For example, if you wanted to write a pre-commit hook that knits Readme.Rmd into Readme.md on each commit that touches Readme.Rmd, you would also add r-knitr to the environment.yml.

For the scope of this example, we'll limit ourselves to reimplementing the parsable-R hook from lorenzwalthert/precommit. This hook is originally written as language: script, meaning that it is up to the user to have the dependencies pre-installed on the system. While for this simple case, it would be enough to have any working R installation at all, in other cases, you may want to pin to a specific R version or package version, at which point having proper dependency management courtesy of conda becomes extremely helpful. The hook itself calls a script that loops over all R files and checks them for (in)valid R code.

For using language: conda, we need to combine the script and pre-commit configuration in .pre-commit-hooks.yaml:

- id: parsable-R-conda
  name: parsable-R-conda
  description: check if a .R file is parsable
  language: conda
  types: [r]
  entry: |
         Rscript -e 'files <- commandArgs(trailing = TRUE)

         out <- lapply(files, function(path) {
           tryCatch(
             parse(path), 
             error = function(x) stop("File ", path, " is not parsable", call. = FALSE)
           )
         })'

You can then use this hook in your project in the usual way:

 - repo: https://github.com/some/repo
   rev: '0.1.2'
   hooks:
     - id: parsable-R-conda

This hook will also run successfully if R is not on your PATH and even if R is not installed at all. The hook will create a conda environment that has the necessary dependencies (in this case only base R).

pre-commit mirrors for conda

As the language: conda feature is quite new to pre-commit, we also needed to take care of having the hooks available for conda. For starters, this meant converting the existing Python/virtualenv-based hooks to conda as explained above in the Python pre-commit hook with conda section. At the time of writing this post, we have also converted the following hooks to use conda:

mypy: https://github.com/Quantco/pre-commit-mirrors-mypy
flake8: https://github.com/Quantco/pre-commit-mirrors-flake8
isort: https://github.com/Quantco/pre-commit-mirrors-isort
pyupgrade: https://github.com/Quantco/pre-commit-mirrors-pyupgrade
black: https://github.com/Quantco/pre-commit-mirrors-black

These repositories contain the same configurations as their virtualenv equivalents but install all dependencies from conda-forge via conda.

Universal pre-commit config behind firewalls

As we already mentioned, we use these hooks in firewalled/internet-decoupled environments. While we are not able to reach https://github.com in these environments, we can still use the same .pre-commit-config.yaml we are using in settings where we have full internet access. You can still specify https://github.com/ URLs in your configuration but let git rewrite the clone URLs on the fly to a local git repository mirror using:

git config --global url."https://github-mirror.local".insteadOf "https://github.com"

This is not actually related to the conda support, but it is definitely worth a note for all who face the same situation.