Published on

dataframely — A declarative, 🐻‍❄️-native data frame validation library

Authors

At QuantCo, we are constantly trying to improve the quality of our code bases to ensure that they remain easily maintainable. More recently, this often involved migrating data pipelines from pandas to polars in order to achieve significant performance gains.

At the end of 2023, we started undertaking an effort to modernize a massive legacy codebase in our one of our longest-running projects. While doing that, we realized that our existing data frame processing code had an integral flaw: column names, data types, value ranges, and other invariants — none of it was obvious just from reading the code.

As a result, the typical approach for understanding a function's behavior involved executing it on client infrastructure — the only place the actual data is available. Then, we would manually step through each pandas transformation to inspect the data before and after every change. Naturally, this is tedious, error-prone, and far from efficient.

Once we'd rewritten a chain of transformations in polars, the absence of static type checking or runtime validation on data frame contents meant that bugs were hard to catch. To ensure correctness, we often had to run our entire pipeline end-to-end on large datasets - which required significant time and compute resources.

Eventually, we realized that we needed a better way to describe, validate and reason about the content of the data frames in our data pipeline. We wanted to make invariants obvious while reading the code and actually enforce these invariants at runtime to ensure correctness.

Data frame validation to the rescue

A natural solution to this problem are data frame validation libraries. Already back in 2023, Python libraries existed that allowed defining data frame schemas and verifying that data frames comply with these schemas, i.e. fulfill predefined expectations.

In some projects, we had already been using pandera, a widely known open-source library, to validate pandas data frames. Unfortunately, back in 2023, pandera did not have any polars support and a notable polars-native alternative, namely patito, was still in its infancy and could not be considered production-ready.

However, even today, we're still encountering several limitations with pandera and patito for our use case. We concluded that they are inherent to their scope & design and cannot easily be addressed by contributing to these projects - which, we still actively do regardless (e.g., we maintain the conda-forge feedstock of pandera).

Specifically, pandera and patito are missing support for

  • validation of interdependent data frames
  • soft validation including introspection of failures
  • test data generation from schemas
  • strict static type checking for data frames

Introducing dataframely: A polars-native data frame validation library

To remedy the shortcomings of these libraries, we developed dataframely. dataframely is a declarative data frame validation library with first-class support for polars data frames. Its purpose is to make data pipelines written in polars (1) more robust by ensuring that data meets expectations and (2) more readable by adding schema information to data frame type hints.

Talk is cheap, so, let's have a look at some code examples.

Defining schemas

To get started with dataframely, you first define a schema. At QuantCo, we are often dealing with insurance claims -- for instance, we might create a schema for a data frame containing hospital invoices:

import dataframely as dy

class InvoiceSchema(dy.Schema):
    invoice_id = dy.String(primary_key=True)
    admission_date = dy.Date(nullable=False)
    discharge_date = dy.Date(nullable=False)
    amount = dy.Decimal(nullable=False, min_exclusive=Decimal(0))

    @dy.rule()
    def discharge_after_admission() -> pl.Expr:
        return pl.col("discharge_date") >= pl.col("admission_date")

While we can describe the data frame in terms of its columns and their data types, we can also encode expectations on the column level as well as across columns. For example, we can designate one (or multiple) column(s) as primary key or define a custom validation rule that acts across columns.

Validating a data frame

Once we've defined a schema, we can pass a pl.DataFrame or pl.LazyFrame into its validate classmethod to validate that the contents match the schema definition. If we want to automatically coerce the column types to the types specified in the schema, we can pass cast=True.

invoices = pl.DataFrame({
    "invoice_id": ["001", "002", "003"],
    "admission_date": [date(2025, 1, 1), date(2025, 1, 5), date(2025, 1, 1)],
    "discharge_date": [date(2025, 1, 4), date(2025, 1, 7), date(2025, 1, 1)],
    "amount": [1000.0, 200.0, 400.0]
})

validated: dy.DataFrame[InvoiceSchema] = InvoiceSchema.validate(invoices, cast=True)

If any row in invoices is invalid, i.e., any rule defined on individual columns or the entire schema evaluates to False, a validation exception is raised. Otherwise, if all rows in invoices are valid, validate returns a validated data frame of type dy.DataFrame[InvoiceSchema].

Importantly, dy.DataFrame[InvoiceSchema] is a pure typing construct and one still deals with a pl.DataFrame at runtime. This has the benefit that dataframely can be adopted in a gradual fashion where any dy.DataFrame[...] can easily be passed to a method that accepts a pl.DataFrame (and vice versa by using type: ignore comments).

The biggest benefit is, however: the generic data frame type immediately tells the reader of the code what data they can expect to find in the data frame. This markedly improves the usefulness of mypy for data frame-based code: the type checker can now ensure that a data frame passed to a method fulfills certain preconditions wrt. its contents - without incurring a hidden performance hit at runtime.

Validating groups of data frames

Oftentimes, data frames (or rather "tables") are interdependent and proper data validation requires consideration of multiple tables that share a common primary key. dataframely enables users to define "collections" for groups of data frames with validation rules on the collection level. To create a collection, we first introduce a second schema for diagnosis data frames:

class DiagnosisSchema(dy.Schema):
    invoice_id = dy.String(primary_key=True)
    diagnosis_code = dy.String(primary_key=True, regex=r"[A-Z][0-9]{2,4}")
    is_main = dy.Bool(nullable=False)

    @dy.rule(group_by=["invoice_id"])
    def exactly_one_main_diagnosis() -> pl.Expr:
        return pl.col("is_main").sum() == 1

We can then create a collection that bundles invoices and the diagnoses that belong to these invoices:

# Introduce a collection for groups of schema-validated data frames
class HospitalClaims(dy.Collection):
    invoices: dy.LazyFrame[InvoiceSchema]
    diagnoses: dy.LazyFrame[DiagnosisSchema]

    @dy.filter()
    def at_least_one_diagnosis_per_invoice(self) -> pl.LazyFrame:
        return self.invoices.join(
            self.diagnoses.select(pl.col("invoice_id").unique()),
            on="invoice_id",
            how="inner",
        )

Notice how we can further define our expectations on the collection contents by adding validation across members of the collection using the @dy.filter decorator.

If we call validate on the collection, a validation exception will be raised if any of the input data frames does not satisfy its schema definition or the filters on the collection result in the removal of at least one row across any of the input data frames.

Soft-validation and validation failure introspection

While calling validate is useful to ensure correctness, in production pipelines, we typically do not want to raise an exception at run-time. To this end, dataframely provides the filter method to perform "soft-validation" of schemas and collections. filter returns the rows that pass validation and an additional FailureInfo object to inspect invalid rows:

good, failure = InvoiceSchema.filter(invoices, cast=True)

# Inspect the reasons for the failed rows
failure.counts()

# Inspect the co-occurrences of validation failures
failure.cooccurrence_counts()

Since filter does not raise an exception, we can safely use it in our production code and log the invalid rows to inspect them later.

Additional features

Throughout our journey with dataframely, we realized that, defining schemas and, thus, encoding expectations on data frame contents has various benefits beyond running validation. For example, we can automatically derive the SQL schema of a table if we want to write a data frame to a database. Another possibility is to automatically generate sample data for unit testing that adheres to the schema, thus letting test authors focus on test content rather than verbosely creating data frames. To learn about all the possibilities, check out the API documentation.

Experiences in practice

Understanding the structure and content of data frames is crucial when working with tabular data — a core requirement for the highly robust data pipelines we aim to build at QuantCo. dataframely has already brought us closer to that goal: today, we are successfully using dataframely in the day-to-day work of multiple teams across several clients for both, analytical and production pipelines.

Our data scientists and engineers love dataframely because

  • It improves the legibility, comprehensibility, and robustness of pipeline code
  • It increases code quality and confidence in code correctness through statically typed APIs and contracts
  • It enables code generation from data frame schema definitions (e.g., for SQL operations)
  • It allows introspecting pipeline failures more easily
  • It facilitates unit testing data pipelines through sample test data generation

We're excited to open-source dataframely and share it with the data engineering community. If you're working with complex data pipelines and looking to improve reliability, productivity, and peace of mind, we think you'll love it too.

Check out dataframely on GitHub and let us know what you think!