Expectation which includes two tables

Hello I am trying to make an expectation which inludes two tables and the comparision between two tables. can anybody help me in getting started with it?
For example:
batch.expect_table_row_count_to_equal_other_table(other_table_name, result_format=None, include_config=True, catch_exceptions=None, meta=None)
How to specify my other_table_name in the above expecation using pandas/pyspark.
If not can you specify an example how to compare two tables using an expectation in pandas/pyspark ?
Thank You.

2 Likes

There are two possible solutions:

  1. To validate that the two tables have the same number of rows, you can use the Evaluation Parameters feature. It allows the number of rows in one table to become a parameter in an expectation about of the number of rows in the second table. This how-to guide shows how to use it: https://docs.greatexpectations.io/en/latest/guides/how_to_guides/creating_and_editing_expectations/how_to_create_expectations_that_span_multiple_batches_using_evaluation_parameters.html

  2. We recommend a design pattern we call “check asset” for the more general case of creating expectations across tables: https://docs.greatexpectations.io/en/latest/reference/spare_parts/how_to_use_check_assets.html
    Yes, it is possible to apply it to Pandas or Pyspark data frames/tables. You can join the original tables in your code and create expectations for the new table.

2 Likes

@eugene.mandel - the link on #2 is broken. Can you update please?

1 Like

What is the general approach to deal with data validation in much granular level?

For me, I have a lot of use case need to group by a few keys, then check the aggregate sum. Let say, I have some dynamic checking like, I need to check the last 8 weeks aggregate weekly sales, by product, store in two tables.

I have these needs because we had two pipeline generating data for our machine learning model, one come from historical snapshot table, one come from a live production table. These two tables has identical schema, but they don’t always match exactly to each other. They tend to agree on each other in older data, but larger difference in recent data (data latency etc).

So far, I find GE could be useful for validating some condition to make sure our machine learning pipeline does not mess up, but I have not figured out a nice way to handle these checking in more granular level. GE seems to dealing with data validation mostly on Table/Column level.

I am interested in general, how did people handle these kind of situations.

Looks like this file was deleted by a commit - I’ll see if it was intentional.

Hi @nok,

From your use case, I’m wondering if creating distributional expectations makes sense. If you need to check the last 8 weeks (a new batch) of sales by product, that sounds like a distribution over the product dimension. Are you expecting a certain distribution? Or do you really have granular sums that have to be matched for each partition? If you need to check a distribution matches the same distribution in a different table, then you can use evaluation parameters to accomplish that.
In pseudocode it would look something like this:

suite1 = expect_column_kl_divergence_to_be_less_than(A, None, None)
suite2 = expect_column_kl_divergence_to_be_less_than(B, suite1.kl_divergence.observed_value, threshold)

Then you have to execute suite 1 followed by suite 2 with the same run id.

Should I create expectation for every combinations of dimension that I need to check? Imagine if I have 100 countries, 1000 products, then it’s 100000 expectations

The A or B is refering to a column. Are you suggesting I should create batches that is partition by the original table by products?

Besides, does GE encourage using the yaml config or using python code to build these suites? I am not clear how JSON or YAML can handle these dynamic condition. Or I should only use notebook for these multi-batches operation? these notebook are no longer disposable that a simple JSON could describe.

1 Like

@nok - No, you don’t need to partition the data separately. I was just referring to each bin of a histogram as a partition.

Please read this documentation about distributional expectations. This should help you conceptualize what’s possible with Great Expectations in this use case. https://docs.greatexpectations.io/en/0.12.0/reference/core_concepts/expectations/distributional_expectations.html?highlight=distributional%20expectations

1 Like

Thanks a lot @bhcastleton, I don’t realize there is partition object before, I will look into it.

1 Like