Published on

diffly — A utility package for comparing polars data frames

Authors

Motivation

With the widespread adoption of polars as a high-performance data frame library in Python, we often find ourselves needing to compare two polars data frames to identify differences. In many of our projects at QuantCo, we need to build data pipelines that transform raw data from a data warehouse into a format suitable for training models. While you can write (unit) tests to verify that individual pieces of logic and information are correct, we also often find a need to make the final output of a data pipeline transparent. We refer to this as "snapshot testing on real data".

A helpful framework for data pipelines is to think about the output (tabular data and other artifacts) being a deterministic function of the input (data and code). In other words, the output can change in one of two ways:

  • Code changes: after making changes to the data pipeline code, how does this affect the output given the same input data?
  • Data changes: when running the same data pipeline code on a new state of input data, how does the output change?

Both of these questions are important in a development workflow, as one continuously iterates on the data pipeline while data is accumulating and changing over time. And luckily, both of these questions can be answered by making use of a data frame comparison library: you can compare the output of your data pipeline before and after code changes, or before and after changes to the input data.

Introducing diffly

Today, we're introducing diffly, a polars-native package for comparing tabular data. It enables highly efficient exploration of the differences between two data frames, leveraging polars' query optimizer. By committing to polars, it avoids unnecessary abstractions or compromises. For example, this results in full support for all polars data types, including lists, structs, and enums.

Compared to other open-source packages for comparing tabular data (such as sqlcompyre, tabulardelta, or datacompy), diffly also provides a simple API that can be used to interactively explore differences in data frames on top of inspecting static summary reports. Leveraging a "polars-native API", using diffly feels familiar and interactive queries profit from lazy execution.

Example

Suppose that we want to compare two small data frames left and right:

import polars as pl

left = pl.DataFrame({
    "id": ["a", "b", "c"],
    "value": [1.0, 2.0, 3.0],
})

right = pl.DataFrame({
    "id": ["a", "b", "d"],
    "value": [1.0, 2.5, 4.0],
})

Then, you can use diffly to generate a summary via:

from diffly import compare_frames

comparison = compare_frames(left, right, primary_key="id")

if not comparison.equal():
    summary = comparison.summary(
        top_k_column_changes=1,
        show_sample_primary_key_per_change=True
    )
    print(summary)
┏━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┓
Diffly Summary┗━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┛
   Primary key: id

 Schemas
 ▔▔▔▔▔▔▔
   Schemas match exactly (column count: 2).

 Rows
 ▔▔▔▔
   Left count             Right count
       3      (no change)      3

   ┏━┯━┯━┯━┯━┓
-----1  left only   (33.33%)
   ┠─┼─┼─┼─┼─┨╌╌╌┏━┯━┯━┯━┯━┓╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╮
   ┃ │ │ │ │ ┃ = ┃ │ │ │ │ ┃  1  equal       (50.00%)   ┠─┼─┼─┼─┼─┨╌╌╌┠─┼─┼─┼─┼─┨╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌├╴  2  joined
   ┃ │ │ │ │ ┃ ≠ ┃ │ │ │ │ ┃  1  unequal     (50.00%)   ┗━┷━┷━┷━┷━┛╌╌╌┠─┼─┼─┼─┼─┨╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╯
+++++1  right only  (33.33%)
                 ┗━┷━┷━┷━┷━┛

 Columns
 ▔▔▔▔▔▔▔
   ┌───────┬────────┬───────────────────────────┐
   │ value │ 50.00%2.0 -> 2.5 (1x, e.g. "b")   └───────┴────────┴───────────────────────────┘

In the summary, we can observe that

  • the schemas (column names and their data types) are unchanged
  • the number of rows is unchanged, but, identifying a row by the primary key id, we can see that
    • row a is equal in both data frames,
    • row b is unequal (the value changed from 2.0 to 2.5), and
    • row c is only present in the left data frame while
    • row d is only present in the right data frame
  • the change in row b is caused by value changing from 2.0 to 2.5

Such summaries are helpful to get a quick and visual overview of the similarities and differences between two data frames.

Interactive exploration

While the summary is a great starting point, you often want to analyze the differences in more detail with the goal of understanding how differences are related. For this use case, diffly provides an easy API that lets you explore the differences interactively. Below you can find some examples of how to use this API, but you can find more details in the documentation:

# Get all rows that only exist in the left (right) data frame
unequal_rows = comparison.left_only() # .right_only() for the right data frame

# Among the rows that exist in both data frames, get the fraction of rows that have equal values in the column "value"
fraction_equal = comparison.fraction_same(column="value") # returns 0.5 in our example

# Get the counts of value changes in the column "value"
value_changes = comparison.change_counts(column="value") # returns a data frame with the change from 2.0 to 2.5 of frequency 1 in our example

If the methods do not return a primitive type (like fraction_same() does), they return a polars data frame (like left_only() and change_counts() do), which can be used as a basis for further operations.

Integration into the QuantCo open-source ecosystem

We are excited that diffly already integrates with dataframely, QuantCo's Python package for building robust data pipelines: instead of only comparing a single data frame, you can also compare multiple data frames within a dataframely collection.

Final words

Throughout various projects, we have found that

(1) snapshot testing on real data is really helpful, (2) understanding diffs is crucial, and (3) both can be done with diffly's summary and interactive API, respectively.

We are excited to bring diffly to the open-source ecosystem and look forward to seeing how the community adopts it to improve the robustness of data pipelines. We are also looking forward to contributions to the project, so if you have any ideas or feedback, please feel free to reach out or contribute!