Inquiring minds want to know.
Quick copy of my post in the great expectations slack channel:
For people wondering how to use GE with dask (distributed), and dockerized, here’s a quick write-up on how I did it:
- quick and dirty way to install great expectations on the dask workers:
def install_ge(): import os os.system("pip install great_expectations") dask_client.register_worker_callbacks(install_ge)
- cast the pandas dataframes underlying the dask dataframe to GE.PandasDataset and run the validation suite
def run_suite(data, data_name, suite): def run_partition(data_in): return pd.Series(PandasDataset(data_in).validate(expectation_suite=suite)["results"]) results = data.map_partitions(run_partition).persist() for result in results.compute(): # do something with the results, like aggregating them or the like pass
More complex expectations won’t really work this way, where you compare e.g. the relative amounts of things, because they might not be scattered evenly.