- Published on
dataframely — A declarative, 🐻❄️-native data frame validation library
At QuantCo, we are constantly trying to improve the quality of our code bases to ensure that they
remain easily maintainable. More recently, this often involved migrating data pipelines from
pandas to polars in order to achieve significant performance gains.
At the end of 2023, we started undertaking an effort to modernize a massive legacy codebase in our one of our longest-running projects. While doing that, we realized that our existing data frame processing code had an integral flaw: column names, data types, value ranges, and other invariants — none of it was obvious just from reading the code.
As a result, the typical approach for understanding a function’s behavior involved executing it on
client infrastructure — the only place the actual data is available. Then, we would manually step
through each pandas transformation to inspect the data before and after every change. Naturally,
this is tedious, error-prone, and far from efficient.
Once we’d rewritten a chain of transformations in polars, the absence of static type checking or
runtime validation on data frame contents meant that bugs were hard to catch. To ensure
correctness, we often had to run our entire pipeline end-to-end on large datasets - which required
significant time and compute resources.
Eventually, we realized that we needed a better way to describe, validate and reason about the content of the data frames in our data pipeline. We wanted to make invariants obvious while reading the code and actually enforce these invariants at runtime to ensure correctness.
Data frame validation to the rescue
A natural solution to this problem are data frame validation libraries. Already back in 2023, Python libraries existed that allowed defining data frame schemas and verifying that data frames comply with these schemas, i.e. fulfill predefined expectations.
In some projects, we had already been using pandera, a
widely known open-source library, to validate pandas data frames. Unfortunately, back in 2023,
pandera did not have any polars support and a notable polars-native alternative, namely
patito, was still in its infancy and could not be considered
production-ready.
However, even today, we’re still encountering several limitations with pandera and patito for
our use case. We concluded that they are inherent to their scope & design and cannot easily be
addressed by contributing to these projects - which, we still actively do regardless (e.g., we
maintain the conda-forge feedstock of pandera).
Specifically, pandera and patito are missing support for
- validation of interdependent data frames
- soft validation including introspection of failures
- test data generation from schemas
- strict static type checking for data frames
Introducing dataframely: A polars-native data frame validation library
To remedy the shortcomings of these libraries, we developed dataframely. dataframely is a
declarative data frame validation library with first-class support for polars data frames. Its
purpose is to make data pipelines written in polars (1) more robust by ensuring that data
meets expectations and (2) more readable by adding schema information to data frame type hints.
Talk is cheap, so, let’s have a look at some code examples.
Defining schemas
To get started with dataframely, you first define a schema. At QuantCo, we are often dealing with
insurance claims — for instance, we might create a schema for a data frame containing hospital
invoices:
import dataframely as dy
class InvoiceSchema(dy.Schema):
invoice_id = dy.String(primary_key=True)
admission_date = dy.Date(nullable=False)
discharge_date = dy.Date(nullable=False)
amount = dy.Decimal(nullable=False, min_exclusive=Decimal(0))
@dy.rule()
def discharge_after_admission() -> pl.Expr:
return pl.col("discharge_date") >= pl.col("admission_date")
While we can describe the data frame in terms of its columns and their data types, we can also encode expectations on the column level as well as across columns. For example, we can designate one (or multiple) column(s) as primary key or define a custom validation rule that acts across columns.
Validating a data frame
Once we’ve defined a schema, we can pass a pl.DataFrame or pl.LazyFrame into its validate
classmethod to validate that the contents match the schema definition. If we want to automatically
coerce the column types to the types specified in the schema, we can pass cast=True.
invoices = pl.DataFrame({
"invoice_id": ["001", "002", "003"],
"admission_date": [date(2025, 1, 1), date(2025, 1, 5), date(2025, 1, 1)],
"discharge_date": [date(2025, 1, 4), date(2025, 1, 7), date(2025, 1, 1)],
"amount": [1000.0, 200.0, 400.0]
})
validated: dy.DataFrame[InvoiceSchema] = InvoiceSchema.validate(invoices, cast=True)
If any row in invoices is invalid, i.e., any rule defined on individual columns or the entire
schema evaluates to False, a validation exception is raised. Otherwise, if all rows in invoices
are valid, validate returns a validated data frame of type dy.DataFrame[InvoiceSchema].
Importantly, dy.DataFrame[InvoiceSchema] is a pure typing construct and one still deals with a
pl.DataFrame at runtime. This has the benefit that dataframely can be adopted in a gradual
fashion where any dy.DataFrame[...] can easily be passed to a method that accepts a
pl.DataFrame (and vice versa by using type: ignore comments).
The biggest benefit is, however: the generic data frame type immediately tells the reader of the
code what data they can expect to find in the data frame. This markedly improves the usefulness of
mypy for data frame-based code: the type checker can now ensure that a data frame passed to a
method fulfills certain preconditions wrt. its contents - without incurring a hidden performance
hit at runtime.
Validating groups of data frames
Oftentimes, data frames (or rather “tables”) are interdependent and proper data validation requires
consideration of multiple tables that share a common primary key. dataframely enables users to
define “collections” for groups of data frames with validation rules on the collection level. To
create a collection, we first introduce a second schema for diagnosis data frames:
class DiagnosisSchema(dy.Schema):
invoice_id = dy.String(primary_key=True)
diagnosis_code = dy.String(primary_key=True, regex=r"[A-Z][0-9]{2,4}")
is_main = dy.Bool(nullable=False)
@dy.rule(group_by=["invoice_id"])
def exactly_one_main_diagnosis() -> pl.Expr:
return pl.col("is_main").sum() == 1
We can then create a collection that bundles invoices and the diagnoses that belong to these invoices:
# Introduce a collection for groups of schema-validated data frames
class HospitalClaims(dy.Collection):
invoices: dy.LazyFrame[InvoiceSchema]
diagnoses: dy.LazyFrame[DiagnosisSchema]
@dy.filter()
def at_least_one_diagnosis_per_invoice(self) -> pl.LazyFrame:
return self.invoices.join(
self.diagnoses.select(pl.col("invoice_id").unique()),
on="invoice_id",
how="inner",
)
Notice how we can further define our expectations on the collection contents by adding validation
across members of the collection using the @dy.filter decorator.
If we call validate on the collection, a validation exception will be raised if any of the input
data frames does not satisfy its schema definition or the filters on the collection result in the
removal of at least one row across any of the input data frames.
Soft-validation and validation failure introspection
While calling validate is useful to ensure correctness, in production pipelines, we typically do
not want to raise an exception at run-time. To this end, dataframely provides the filter method
to perform “soft-validation” of schemas and collections. filter returns the rows that pass
validation and an additional FailureInfo object to inspect invalid rows:
good, failure = InvoiceSchema.filter(invoices, cast=True)
# Inspect the reasons for the failed rows
failure.counts()
# Inspect the co-occurrences of validation failures
failure.cooccurrence_counts()
Since filter does not raise an exception, we can safely use it in our production code and log the
invalid rows to inspect them later.
Additional features
Throughout our journey with dataframely, we realized that, defining schemas and, thus, encoding
expectations on data frame contents has various benefits beyond running validation. For example, we
can automatically derive the SQL schema of a table if we want to write a data frame to a database.
Another possibility is to automatically generate sample data for unit testing that adheres to the
schema, thus letting test authors focus on test content rather than verbosely creating data frames.
To learn about all the possibilities, check out the
API documentation.
Experiences in practice
Understanding the structure and content of data frames is crucial when working with tabular data —
a core requirement for the highly robust data pipelines we aim to build at QuantCo. dataframely
has already brought us closer to that goal: today, we are successfully using dataframely in the
day-to-day work of multiple teams across several clients for both, analytical and production
pipelines.
Our data scientists and engineers love dataframely because
- It improves the legibility, comprehensibility, and robustness of pipeline code
- It increases code quality and confidence in code correctness through statically typed APIs and contracts
- It enables code generation from data frame schema definitions (e.g., for SQL operations)
- It allows introspecting pipeline failures more easily
- It facilitates unit testing data pipelines through sample test data generation
We’re excited to open-source dataframely and share it with the data engineering community. If
you’re working with complex data pipelines and looking to improve reliability, productivity, and
peace of mind, we think you’ll love it too.
Check out dataframely on GitHub and let us know what
you think!