Introducing the multiregex library

In this post we discuss how we improved the runtime performance of a text mining step in a machine learning pipeline by a factor of 12. We introduce the multiregex library, which is based on the ideas developed in this post.

At QuantCo, we use machine learning to create value from large amounts of data for our clients. While most of that data is numerical or categorical, some of the data is raw, unstructured text. For example, when we build a machine learning model that learns from human behavior, we want the machine learning model to be able to learn from notes inputted by humans. For privacy reasons, we cannot (and do not want to!) include the full raw text in the training data. Instead, we use text mining to extract the core information of any raw text as a preprocessing step. This also helps with converting the text to a format that can be fed into tabular models like Random Forests.

Using regular expressions to preprocess unstructured text

In the specific model that we are covering in this post, we use a list of regular expressions to extract relevant information from notes written by humans.

The results of matching those regular expressions are passed to steps later in the model pipeline.

To illustrate, consider the following example of a human-written note, some patterns and their matches on the note, and the model features that are derived from the matches:

raw_text = "Changed value of position X to USD 12.34"
patterns = {
    "position": "pos(?:ition|\.) (\w+)",
    "dollar_amount": "(?:USD|\$) ?(\d+(?:\.\d+))?",
    ... many more ...
]
matches = [
    ("position", "X"),
    ("dollar_amount", "12.34"),
    ...
]
features = {
    "has_X_position": ("position", "X") in matches,
    "sum_dollar_amount": sum(float(amount) for tag, amount in matches if tag == "dollar_amount"),
    ...,
}

As part of our training pipeline, we run these regular expression patterns against all text in the training data. We have a lot of patterns, and even more text in the training data, so we have to execute a lot of pattern evaluations (billions!). At the same time, the actual number of matches per single instance of raw text is low; on average, only 0.1% of expressions match against any given text. This fact will become important to leverage for optimization later in the post.

It is well known that evaluating regular expressions is not fast, despite decades of research in building optimized regular expression engines.

Indeed, once we optimized other (slower) parts of the training pipeline, execution of all those regular expression pattern evaluations remained as major bottleneck. And so we began to brainstorm how to speed up this part of the training pipeline.

The obvious idea of simply throwing more hardware at the problem did not work here, because we were already running the pipeline in a highly parallelised manner. So we needed to optimize single-threaded performance.

Finding a faster regular expression engine

Maybe there are alternative, faster regular expression engines for Python?

One of the most well-known regular expression engines is Google's re2. In Python, it can be used via the pyre2 binding. Simply replacing

import re

with

import re2 as re

gave a 3x speedup. Not bad for a single-line code change!

Sadly, not all of our patterns are supported by re2, and pyre2 falls back to using Python's re for those patterns. Even worse, re2 silently has different results for some of the patterns. For example, non-ASCII word boundaries are treated differently from re, and unfortunately most of the text we work with contains a lot of non-ASCII letters such as German umlauts, ie: ä, ö, ü. This ruled out re2 as an alternative.

We also tried a couple of other engines, but none provided the combination of good compatibility and better performance that we were looking for.

Disappointed by our search for a fast, powerful regular expression engine, we even considered ditching regular expressions completely and replacing them with an alternative text-matching language that is faster to evaluate. There are a lot of good reasons to keep using regular expressions though:

We have hundreds of hand-crafted patterns in the codebase. All of these would need translation to the new language.
We are sharing the pattern definitions with software upstream and downstream to our prediction pipeline. Since we cannot assume that that software is generally written in Python, we need a text-matching language that has good implementations in most major programming languages.
Regular expressions, despite their reputation of being a hard-to-read, "write-only" language, are well known by most programmers.

Consequently, we decided that we still want to keep using regular expressions.

But so far we had not exploited the fact that we have very few matches. This made us think if there was a way to build a really fast pre-filtering step that rules out some of the pattern candidates without (fully) evaluating them? That pre-filter does not have to be perfect; if it rules out X% of patterns that will "obviously" not match a particular string, that means we would save X% of compute time on that string. This concept is remarkably similar to Bloom filters. So, how do you build a regular expression Bloom filter?

Regular expression Bloom filters

Taking a look at some of the example patterns given earlier in this post, there seems to be an obvious way to pre-filter pattern candidates: You find "simple" substrings in the pattern that can be evaluated without a regular expression engine, that is, sequences of literal characters:

# This pattern will never match if "pos" is not in the string
"pos(?:ition|\.) (\w+)" -> ["pos"]

# This pattern will never match if neither "USD" nor "$" are in the string
"(?:USD|\$) ?(\d+(?:\.\d+))?" -> ["USD", "$"]

To prototype this idea, we used a subset of the patterns and manually created a list of "pre-matchers" for each pattern. We then added the bloom filter condition before calling into re.search:

def bloomified_re_search(pattern, pre_matchers, string):
    if any(pre_matcher in string for pre_matcher in pre_matchers):
        # Maybe a match. Let's evaluate the regular expression.
        return re.search(pattern, string)
    else:
        # Definitely not a match.
        return None

This works, and produces a speedup of around 5x. Profiling shows that we eliminated most of the time spent in the Python regular expression engine, but now a lot of time is spent evaluating the pre-matchers.

We thought that it would be rather easy to speed up that part using Cython, and re-implemented the code in Cython in a relatively dumb but very effective (highly optimized) fashion: a large cascade of if strstr(...)¹ calls automatically generated from a list of patterns and pre-matchers:

...

cdef extern from "string.h":
    char* strstr(const char *haystack, const char *needle)

...

cdef const char *s_data = my_python_string
pattern_candidates = set()
# Pre-matchers for "pos(?:ition|\.) (\w+)"
if (strstr(s_data, "pos") is not NULL):
    pattern_candidates.add(pattern1)
# Pre-matchers for "(?:USD|\$) ?(\d+(?:\.\d+))?"
if (strstr(s_data, "USD") is not NULL or strstr(s_data, "$") is not NULL):
    pattern_candidates.add(pattern2)
... 100s more ...

...

This solution produced a 10x speedup relative to the original runtime. Nice! With the full list of patterns, the generated Cython file would be a few thousand lines of code and would take around 30 s to compile, but that's fine with us.

At this point, we were quite happy with the solution, apart from two problems: First, we don't want to add a Cython build step to the project. Second, writing all those pre-matchers by hand is tedious and error-prone.

Automatically generating pre-matchers

Let's talk about generating pre-matchers first.

Regular expressions are frequently used for parsing. At the same time, every regular expression engine needs a parser itself for parsing the patterns into an intermediate representation that can be used for evaluation or compilation. CPython's regular expression parser is hidden in the re._parser module. While it is designed to be used by the internals of CPython's re module, it can also be used to retrieve an Abstract Syntax Tree (AST) from a regular expression pattern:

>>> import re, pprint
>>> re_ast = re._parser.parse(r"pos(?:ition|\.) (\w+)")
>>> pprint.pprint(list(re_ast))
[(LITERAL, 112),    # "p"
 (LITERAL, 111),    # "o"
 (LITERAL, 115),    # "s"
 (BRANCH,           # "(?:ition|\.)"
  ...),
 (LITERAL, 32),     # " "
 (SUBPATTERN,       # "(\w+)"
  ...)]

As we can see, it is quite easy to extract pre-matchers from simple patterns: Just find a sequence of LITERALs in their AST. In this particular example, that could be either ["pos"] or [" "]. We prefer to use ["pos"] over [" "] because it clearly produces fewer false positives. As a general heuristic for choosing between sets of pre-matcher we can assume that longer strings tend to produce fewer false positives.

For more complicated patterns, constructing pre-matchers is not quite as simple. For example, we need to take into account that patterns may have alternatives (denoted by | in regular expression syntax, and called BRANCH in the AST):

>>> re_ast = re._parser.parse(r"(?:USD|\$) ?(\d+(?:\.\d+))?")
>> pprint.pprint(list(re_ast))
[(BRANCH,
  (None,
   [[(LITERAL, 85), (LITERAL, 83), (LITERAL, 68)],    # "USD"
   [(LITERAL, 36)]])),                                # "$"
 ...]

In the case of a branch, we have to add both of the alternatives to the list of pre-matchers, because the occurrence of any individual one can be an indicator that the pattern may match. So in this example, a valid set of pre-matchers is ["USD", "$"].

Our final pre-matcher generator covers more cases, some of which are specific to the patterns used in our application. We use the CPython re unit tests to evaluate the correctness and completeness of our pre-matcher generator.

The generator is not perfect though: For a handful of patterns we had to manually create pre-matchers because either the generator was not smart enough to generate any, or because the autogenerated pre-matchers were very weak (correct but lots of false positives). As an example, the autogenerated pre-matchers for "A[0-9]" are ["A"], which is a valid set of pre-matchers but produces a lot of false positives. Manually setting ["A0", "A1", ..., "A9"] as the pre-matchers is a lot faster even if the set of pre-matchers is larger.

Aho-Corasick

Earlier in the post we introduced autogenerated Cython code to speed up substring matches. Adding Cython to a project's build and packaging pipeline does not come without cost, though: It involves a compilation step either during packaging or during installation.

Taking a step back, we realised that our generated code essentially generates a list of substring matches from a fixed "dictionary" of strings. We realised we could use a proper dictionary-matching algorithm like Aho-Corasick for this. There are various Python libraries that implement dictionary-matching algorithms; we opted for pyahocorasick.

The way pyahocorasick works is that you first create an automaton with all the strings you are interested in (the "dictionary"):

import ahocorasick

automaton = ahocorasick.Automaton()
automaton.add_word("pos", r"pos(?:ition|\.) (\w+)")
automaton.add_word("USD", r"(?:USD|\$) ?(\d+(?:\.\d+))?")
automaton.add_word("$", r"(?:USD|\$) ?(\d+(?:\.\d+))?")
automaton.make_automaton()

Then you can use the automaton to search for all strings in the dictionary at the same time:

>>> list(automaton.iter("Changed value of position X"))
[(19, 'pos(?:ition|\\.) (\\w+)')]

 >>> list(automaton.iter("Changed value of position X to USD 12.34"))
[(19, 'pos(?:ition|\\.) (\\w+)'),
 (33, '(?:USD|\\$) ?(\\d+(?:\\.\\d+))?')]

pyahocorasick saves ourselves the additional Cython dependency and also brings another 20% speedup over the Cython version.

Enter `multiregex`

Our open-source multiregex library is based on the ideas developed in this post. We extracted it from the particular client project to make it available for other QuantCo projects and everyone else.

We aim to make it a "drop-in" replacement to re as much as possible. It uses CPython's re engine for regular expression matching, re._parser to autogenerate pre-matchers, and pyahocorasick to quickly find match candidates.

Here is a sneak preview of how to use multiregex. For details, please reference the multiregex documentation.

import multiregex

# Autogenerate pre-matchers and create an pyahocorasick automaton
matcher = multiregex.RegexMatcher(
    [
        r"pos(?:ition|\.) (\w+)",
        r"(?:USD|\$) ?(\d+(?:\.\d+))?"
    ]
)

# Use like multiple calls to `re.search`, just faster!
matcher.search("Changed value of position X to USD 12.34")
# =>
# [
#     (re.compile(r'pos(?:ition|\.) (\w+)', re.UNICODE),       <re.Match object; span=(17, 27), match='position X'>),
#     (re.compile(r'(?:USD|\$) ?(\d+(?:\.\d+))?', re.UNICODE), <re.Match object; span=(31, 40), match='USD 12.34'>)
# ]

Conclusion

In this post we discuss how we use regular expression based text mining to build privacy-preserving machine learning models and how we sped up that text mining step by an order of magnitude. We introduce "regular expression Bloom filters" that efficiently elimite the set patterns that definitely do not match a string without evaluating the patterns themselves. Regular expression Bloom filters are based on the idea of cheap "pre-matchers" that can be auto-generated from regular expression ASTs. We use the Aho-Corasick algorithm to quickly evaluate those pre-matchers on large of amounts of text, and evaluate only patterns that remain after pre-matching.

Our work on regular expression Bloom filter is open-sourced as the multiregex library.

After building the multiregex library, we realized that Rust's regex crate has this optimization built-in. See the section on Literal optimizations on Andrew Gallant's blog. It might be worthwhile checking the regex crate's performance and compatibility with Python's re module in the future.

strstr on cppreference.com ↩