Can Great Expectations separate sql database records that fail tests from my good data?

Great Expectations includes an unexpected_list as part of the results of validation if you request validation results in COMPLETE format. result_format — great_expectations documentation

If you’re running in pandas, you also have the option of getting the unexpected_index_list and using that to bring back the failing records. How can I use the return_format unexpected_index_list to select row from a PandasDataSet

As of GE 0.13.10, if you want to pull all the failing records in a sql table after validation, you can use the following work flow:

  1. Set the result_format to COMPLETE and run validations.
  2. Use the values in the unexpected_list to pass in to a temp table or a text list so you can use in a sql query.
  3. Use the following query to find your failing records:

WHERE ColumnWithFailingValues IN ( {pass in the values from the unexpected_list when result_format is set to COMPLETE} )

  1. Either copy or move these records to another table for triage or for cleaning.

The above work flow might not be optimal if there are many failing records. As of GE 0.13.10, Great Expectations doesn’t have the option to return the primary key or some sort of index on a sql table to return all the failing records. This would be a fantastic community contribution if anyone is interested in building it. What is needed is a configuration option to tell GE to keep track of a key value for every failing record during validation.

search terms: quarantine queues, separate bad records, separate bad data

1 Like