Check nulls in subset of columns

Hi all, I wonder if I can combine batch.expect_column_values_to_not_be_null() with a subset of columns and a threshold like in pandas.

Are you looking for something like this?

batch.expect_column_values_to_not_be_null("my_1st_column", mostly=.9)
batch.expect_column_values_to_not_be_null("my_2nd_column", mostly=.8)

This would assert that my_1st_column is null no more than 10% of the time, my_2nd_column is null no more than 20% of the time, and my_3rd_column is never null.

Let’s say I have a table with route requests, columns include fromStation or fromAddress or fromGPS. Each request starts with either a station, an address or a GPS point, the other two colums are null.

So for a daily validation I could estimate the percentage I expect in each day for each column, but that my vary. I want to ensure that in each record one col of the subset [fromStation, fromAddress, fromGPS] is not null and the others are null.

I see—thanks for the clarification.

At the moment, we don’t have a single Expectation for this kind of check.

One path forward would be to define a custom Expectation. For this one, you’d want to use the multicolumn_map_expectation decorator.

If you do this and like what you end up with, we’d be happy to work with you to make it a PR into the main library.

An alternative would be to use a pattern we call a “check_df”: create an intermediate dataframe (the check_df), then apply a simple Expectation to the check_df. For example, you could sum the number of nulls in the three columns into a single column called “source_null_count” and then check_df.expect_column_values_to_be_in_set("source_null_count", [1]).

Thank you very much for the suggestions! I’ll check my schedule tomorrow and see what makes the most sense :slight_smile: .

Sounds doable, check the dev guidelines. do you prefer fork and pr or dev branch on the main repo?

Edit: :slight_smile: