GE in Databricks

Hi all, I am trying to integrate GE with Databricks following and I would like to save our own expectations without using the full profiling expectations to fit our needs.
But I am running into trouble when saving the expectation suite:
context.save_expectation_suite(discard_failed_expectations = False)
save_expectation_suite_usage_statistics() got an unexpected keyword argument 'discard_failed_expectations’Am I missing something? Thank you!

This blog post is indeed awesome!

Looks like a mistake sneaked in into this code snippet in a recent edit:

spark_data.expect_column_values_to_be_in_set("DELIVERY_GROUP_LATEST", set(['IP Legacy', 'IP AM SP&C']))


context.save_expectation_suite(discard_failed_expectations = False)

It should be “spark_data.save_expectation_suite(discard_failed_expectations = False)”, not context.

Thanks @eugene.mandel!
I think things are getting mixed up in my config, here is the error I get

Unable to save config: filepath or data_context must be available.

I had to change some of the config options in the yml to make it work initially, but I am not sure I have the correct syntax:

    class_name: SparkDFDatasource
        class_name: DatabricksTableBatchKwargsGenerator
      class_name: SparkDFDataset

Thanks for the help!

I see - “spark_data. save_expectation_suite” can be called with these arguments if spark_data is associated with a Data Context object. Usually, we obtain a data asset (in this case “spark_data”) from Data Context, but to make it work in this code snippet, we can provide a reference to a DataContext when we create the spark_data object:

instead of :
spark_data = SparkDFDataset(data)

spark_data = SparkDFDataset(data, data_context=context)

I think I’m almost there…
I have now an expectation json file with only the one I specified, but I can’t seem to define the output filename, its always saved in default.json

data ="/xxxx/business_classification_dim_parquet")
data_asset_name = "business_classification_dim_parquet"

context.create_expectation_suite(data_asset_name, True)

spark_data = SparkDFDataset(data, data_context=context)

# create expectations:
spark_data.expect_table_columns_to_match_ordered_list(["xxxx", "xxx"])


# save expectations

Thank you so much!

Ok I think I figured it out, this seems to work now:


Now one last question, is it possible to read delta tables instead of parquet? I added

reader_method: delta

in the config file, but it doesn’t seem that can be replaced by

Thanks again @eugene.mandel !

1 Like