How do I work with multiple batches of data in GE?

In Great Expectations, a “Data Asset” is a logical collection of records, potentially split across different “batches.”

Often users mean one of three kinds of actions when they talk about “multibatch” workflows:

  1. Performing an action, such as validation, on batches in serial, for example evaluating each new day’s data against the same expectation suite. We support that use case with a “batch kwargs generator”.

    • Note: we are planning to change several of these names in the near future, including by renaming batch kwargs generators into “data connectors”.
  2. Using a collection of batches to compute statistics, for example to state an expectation about what the expected number of rows in a future single batch might be (building an anomaly detector). We call that “BatchLooping”.

    • Great Expectations does not currently have any built-in support for building expectations using batch looping, so to use that functionality you have to manually loop over data yourself using a batch kwargs generator.
  3. Stating expectations about the relationship between multiple batches of data at the same time, such as expecting that the average number of rows in a collection of batches is in some defined range.

  • Great Expectations does not currently support this use case, but we’re planning to build support in the future.

To use a batch kwargs generator with BigQuery, you would create batches by following the guide here: https://docs.greatexpectations.io/en/latest/guides/how_to_guides/creating_batches/how_to_load_a_database_table_view_or_query_result_as_a_batch.html and modifying the query_parameters and the target query to loop over desired days.

Unfortunately, it’s not currently possible to use the QueryBatchKwargsGenerator to identify available partitions, so you would need to know the batches/days you want to loop over externally.

6 Likes