GE with Databricks Delta

I want to understand how GE integrates with data bricks delta tables? If you have any sample notebook it will be helpful to understand.

Rich Louden wrote up a very detailed blog post on using GE with Databricks:

If you want GE to read data from delta, go into your project’s great_expectations.yml config file and add an Batch Kwargs Generator of this class to your datasource: S3GlobReaderBatchKwargsGenerator.

The reference for this class has a sample yml config that you can copy and modify.

Replace reader_method: parquet in the generator’s config with reader_method: delta


Some API’s have changed since that blog post was published. If you are using GE 0.9.0 (or higher),
please replace the build_expectations method from the post with this:

def build_expectations(database, asset_name, context):
  exp_suite_name = database + "." + asset_name + "." + str("%d-%m-%Y")) + "_expectations"
  data = spark.table(database + "." + asset_name)
  spark_data = SparkDFDataset(data)
  profile = spark_data.profile(BasicDatasetProfiler)
  context.save_expectation_suite(profile[0], exp_suite_name)
  sqlContext.uncacheTable(database + "." + asset_name)

The profile method returns a tuple (expectation suite, validation results), so you need to pass the first member of that tuple to save_expectation_suite.

Also, you don’t have to create an empty expectation suite before profiling.

Great post!
Is it correct to assume that

from great_expectations.datasource.generator.databricks_generator import DatabricksTableBatchKwargsGenerator


from great_expectations.datasource.batch_kwargs_generator import DatabricksTableBatchKwargsGenerator

in the latest release of GE?

Thanks a lot!

Almost! Here’s the full path to import.

from great_expectations.datasource.batch_kwargs_generator import DatabricksTableBatchKwargsGenerator

1 Like