How to compare two tables with the UserConfigurableProfiler

Comparing two tables with the user_configurable_profiler

With this workflow, we will use GE’s user_configurable_profiler to gauge whether two tables are identical. The workflow is as follows:

  1. General GE set-up

  2. Creating a suite from the source data

    • Set up a batch for your source data - this is the data that you trust to be correct
    • Instantiate a profiler and run the build_suite method to build a suite from your source data
  3. Validating your migrated data with your suite

    • Set up a batch for your migrated data - this is the data that you would like to test
    • Run a validation on your migrated data using your created suite.

1. General GE set-up

Imports and setting up your data context

import great_expectations as ge
import snowflake
import sqlalchemy as sa

from great_expectations.profile.user_configurable_profiler import UserConfigurableProfiler

context = ge.data_context.DataContext( # Make sure to update your context root directory with your own directory
    context_root_dir='/my/context/root/directory/great_expectations'
)

2. Creating a suite from the source data

Set up a batch for your source data - this is the data that you trust to be correct

Create your blank suite

expectation_suite_name = "compare_two_tables"
suite = context.create_expectation_suite( 
    expectation_suite_name, overwrite_existing=True
)

Set up a batch with the source data you will use to create your initial expectations. This batch will consist of data that you trust to be accurate.

batch_kwargs_postgres = {
    "datasource": "my_postgres_db",
    "schema": "public",
    "table": "my_table",
    "data_asset_name": "my_table"
    # If you would like to create a batch from a query rather than a full table, you can use the `query` key
    # demonstrated below instead of the `schema`, `table`, and `data_asset_name` keys.
    # "query": "SELECT * FROM my_schema.my_table WHERE '1988-01-01' <= date AND date < '1989-01-01';
}

batch_postgres = context.get_batch(batch_kwargs_postgres, suite)
batch_postgres.head()

Instantiate a profiler and run the build_suite method to build a suite from your source data

Next, we will instantiate a profiler object, passing in our batch

profiler = UserConfigurableProfiler(dataset=batch_postgres)

Now we will call the build_suite method on our instantiated UserConfigurableProfiler to create our suite. This will produce a suite, and will output all of the expectations that the profiler created, as well as the information about column types and cardinality that it used to determine which expectations to create per column.

suite = profiler.build_suite()

The output will look something like the below:

Creating an expectation suite with the following expectations:

Table-Level Expectations
expect_table_columns_to_match_ordered_list
expect_table_row_count_to_be_between

Expectations by Column
Column Name: c_acctbal | Column Data Type: FLOAT | Cardinality: UNIQUE
expect_column_max_to_be_between
expect_column_mean_to_be_between
expect_column_median_to_be_between
expect_column_min_to_be_between
expect_column_proportion_of_unique_values_to_be_between
expect_column_quantile_values_to_be_between
expect_column_values_to_be_in_type_list
expect_column_values_to_not_be_null


Column Name: c_address | Column Data Type: STRING | Cardinality: UNIQUE
expect_column_proportion_of_unique_values_to_be_between
expect_column_values_to_be_in_type_list
expect_column_values_to_not_be_null


Column Name: c_comment | Column Data Type: STRING | Cardinality: UNIQUE
expect_column_proportion_of_unique_values_to_be_between
expect_column_values_to_be_in_type_list

3. Validating your migrated data with your suite

Set up a batch for your migrated data - this is the data that you would like to test

Next, we will set up the batch of data that we want to validate. This is the migrated data that we want to ensure is the same as our source data. It is important that we perform this step after creating our expectation suite, because this batch needs to include our newly created suite.

batch_kwargs_snowflake = {
    "datasource": "my_snowflake_db",
    "schema": "public",
    "table": "my_table",
    "data_asset_name": "my_table",
    # If you would like to create a batch from a query rather than a full table, you can use the `query` key
    # demonstrated below instead of the `schema`, `table`, and `data_asset_name` keys.
    # "query": "SELECT * FROM my_schema.my_table WHERE '1988-01-01' <= date AND date < '1989-01-01';
}

batch_snowflake = context.get_batch(batch_kwargs_snowflake, suite)
batch_snowflake.head()

Now that we have our suite, created on our source data, and the migrated batch of data that we would like to validate, we can actually run our validation.

results = context.run_validation_operator("action_list_operator", assets_to_validate=[batch_snowflake])
validation_result_identifier = results.list_validation_result_identifiers()[0]
print(f"The migrated suite passes validation: {results.success}")

Our output should look like the below:
The migrated suite passes validation: False

Occasionally, suites that look the same may fail due to idiosyncracies between back-ends. In cases like these, you may want to exclude a particular expectation, or a particular column. These configuration options, and others, are below.

If we would like, we can save our expectation suite and build data docs to take a closer look at the created suite.

context.save_expectation_suite(suite, expectation_suite_name)
context.build_data_docs()
context.open_data_docs(validation_result_identifier)

Configuring the user_configurable_profiler

General parameters

The user_configurable_profiler can take a few different parameters to further hone the results. For the purposes of comparing two tables, the most useful of these are:

  • excluded_expectations: Takes a list of expectation names which you want to exclude from the suite. If you find that certain expectation types are problematic across your database back-ends, you can specify an expectation to exclude

  • ignored_columns: Takes a list of columns for which you may not want to build expectations. You may want to use this if you have metadata which might not be the same between tables, if you find that certain columns are problematic across your database back-ends

  • table_expectations_only: Takes a boolean. If True, this will only create table-level expectations (i.e. ignoring all columns. Table-level expectations include expect_table_row_count_to_equal and `expect_table_columns_to_match_ordered_list

If you would like to make use of these parameters, you can specify them while instantiating your profiler.

excluded_expectations = ["expect_column_quantile_values_to_be_between"]
ignored_columns = ['c_comment', 'c_acctbal', 'c_mktsegment', 'c_name', 'c_nationkey', 'c_phone']
not_null_only = True
table_expectations_only = False
value_set_threshold = "unique"

suite = context.create_expectation_suite(
    expectation_suite_name, overwrite_existing=True
)
batch_postgres = context.get_batch(batch_kwargs_postgres, suite)

profiler = UserConfigurableProfiler(
    dataset=batch_postgres,
    excluded_expectations=excluded_expectations,
    ignored_columns=ignored_columns,
    not_null_only=not_null_only,
    table_expectations_only=table_expectations_only,
    value_set_threshold=value_set_threshold
)

suite = profiler.build_suite()

That’s it!

1 Like