GE with Databricks Delta

I want to understand how GE integrates with data bricks delta tables? If you have any sample notebook it will be helpful to understand.

Rich Louden wrote up a very detailed blog post on using GE with Databricks: https://www.unsupervised-learnings.co.uk/post/setting-your-data-expectations-data-profiling-and-testing-with-the-great-expectations-library/

If you want GE to read data from delta, go into your project’s great_expectations.yml config file and add an Batch Kwargs Generator of this class to your datasource: S3GlobReaderBatchKwargsGenerator.

The reference for this class has a sample yml config that you can copy and modify.

Replace reader_method: parquet in the generator’s config with reader_method: delta

NOTE:

Some API’s have changed since that blog post was published. If you are using GE 0.9.0 (or higher),
please replace the build_expectations method from the post with this:

def build_expectations(database, asset_name, context):
  
  exp_suite_name = database + "." + asset_name + "." + str(datetime.today().strftime("%d-%m-%Y")) + "_expectations"
  
  data = spark.table(database + "." + asset_name)
  
  spark_data = SparkDFDataset(data)
  
  profile = spark_data.profile(BasicDatasetProfiler)
  
  context.save_expectation_suite(profile[0], exp_suite_name)
  
  sqlContext.uncacheTable(database + "." + asset_name)

The profile method returns a tuple (expectation suite, validation results), so you need to pass the first member of that tuple to save_expectation_suite.

Also, you don’t have to create an empty expectation suite before profiling.