How to instantiate a Data Context on an Databricks Spark cluster

This article is for comments to:

https://docs.greatexpectations.io/en/latest/guides/how_to_guides/configuring_data_contexts/how_to_instantiate_a_data_context_on_a_databricks_spark_cluster.html

Please comment +1 if this How-to Guide is important to you.

1 Like

Currently the above linked instructions for the EMR Spark cluster are sufficient for Databricks. Please reply here if they do not work for you, and if they do please like this post!

Thanks for instructions… a few more details would help :slight_smile:

Hi @mjboothaus! We just released a document on instantiating a data context with Databricks: https://docs.greatexpectations.io/en/latest/guides/how_to_guides/configuring_data_contexts/how_to_instantiate_a_data_context_on_a_databricks_spark_cluster.html
Check it out and let us know if it answers your questions! If not, please reply here with questions.

1 Like

Hi @anthony

Just had a query. I tried to follow the steps mentioned. I am getting the following error:

ModuleNotFoundError: No module named ‘black’

Would you have any guidance related to the same.
I am using Databricks runtime 7 with Spark 3.

Thanks and Regards
Saurav Chakraborty

Hi @Saurav! In the latest release 0.12.5 we moved black to a purely dev dependency, however it is used within Great Expectations. Until we are able to release the next version, you can do a pip install black. Apologies for the confusion.

@Saurav - FYI we just released Great Expectations v0.12.6 which adds black back into our requirements.txt file. So you should be able to use this newest version without issue.

Its working now. Thanks!

2 Likes

I’m trying to populate the “batch_kwargs_generators” field in the example code and I’m getting an error that I find hard to decipher.

Right now, I have the following:

datasource_spark = {
    "data": {
        "data_asset_type": {
            "class_name": "SparkDFDataset",
            "module_name": "great_expectations.dataset",
        },
        "spark_config": dict(pyspark.SparkConf().getAll()),
        "class_name": "SparkDFDatasource",
        "module_name": "great_expectations.datasource",
        "batch_kwargs_generators": {
            "data": {
                "class_name": "QueryBatchKwargsGenerator",
                "queries": {
                    "testquery": "SELECT * FROM knmi_weather_table",
                },
            }
        },
    },
}

And I’m using this in the DataContextConfig like so:

project_config = DataContextConfig(
    config_version=2,
    plugins_directory=None,
    config_variables_file_path=None,
    datasources=datasource_spark,
    stores={
   .....

I based this approach on the example found here.

Whereas the following query works:

my_batch = context.get_batch(
  batch_kwargs={
    "datasource": "data",
    "query": "SELECT * FROM knmi_weather_table LIMIT 100"
  },
  expectation_suite_name="my_new_suite"
)

Running the command below

my_batch = context.get_batch(
  batch_kwargs={
    "datasource": "data",
  },
  expectation_suite_name="my_new_suite"
)

fails with error “BatchKwargsError: Unrecognized batch_kwargs for spark_source”.

And I’m not sure how I can remedy this // which kwargs are not accepted.

Thanks in advance, and please let me know if I can provide additional info!

Best,

Jasper.

The “query” batch kwarg is the likely cause of the error. Great Expectations validates only DataFrames on Spark. Validating the result sets of queries works only with Datasources of type SqlAlchemyDatasource.
Using a QueryBatchKwargsGenerator won’t work with a Spark Datasource for the same reason.

This notebook shows an example of validating DataDrames in Spark:

Hi Eugene,

Thanks for your reply.

I’m not entirely sure I follow you. The ‘query’ option does work. What doesn’t work is trying this approach.

Thanks!

Best,

Jasper.

Jasper, my bad - I gave you a wrong answer. We do support queries in Spark Datasources.

The “BatchKwargsError: Unrecognized batch_kwargs for spark_source” error is raised in SparkDFDatasource, because it cannot find “path”, “query” or “dataset” in the batch_kwargs.

Since you configured a QueryBatchKwargsGenerator to manage your queries, you should use it go generate batch_kwargs that you need, so the snippet should look like this:

my_batch_kwargs = context.build_batch_kwargs("name_of_my_spark_datasource", "name_of_generator", "name_of_my_query")
my_batch = context.get_batch(
  batch_kwargs=my_batch_kwargs,
  expectation_suite_name="my_new_suite"
)

Hello,

Can you share example code on how to use spark sql in batch_kwargs_generators ? thank you.

@jmp Please see the “Additional notes” section in this guide and the code snippet in my previous comment.