Pros and Cons of implementing DQ within Data Flow vs after the data is loaded in Datalake

Hello Team,

I am confused with the implementation of GE. Should it be implemented within the Data Pipeline like NiFi / Airflow before the data is loaded to the Target system or should it be implemented once all the data is loaded in DataLake like Snowflake or S3.

Can you highlight some of the pros and cons of using the same either ways?

Also , Will the above approach change for Streaming Data vs Batch Data load? or will it be same?


1 Like

You’re really hitting the mark with good questions today. :slight_smile:

Great Expectations supports both of these patterns, and they’re both widely implemented by teams that have deployed Great Expectations.

Let’s call them the Within-Pipeline versus Within-Store patterns.

Within-pipeline testing

In general, Within-Pipeline will give you more control, since you can do things like:

  1. halt data processing if you discover serious errors in data validation
  2. trigger followup processing depending on specific validation results (e.g. if aggregate expectations show significant feature drift in an ML model)
  3. directly control the processing of bad rows (e.g. kick off notifications, etc. for triage workflows)

It’s also often easier to configure things like run_ids directly within the pipeline, so that your validation records are fully integrated with your pipeline execution.

The main downside that I’ve seen is that it sometimes takes a little more work to configure Within-Pipeline validation, since you have to configure Checkpoints and deploy them throughout your pipeline. This isn’t a huge amount of work (and we’re working to streamline it further), but it can take a little bit more to get started.

Within-store testing

The advantages on the Within-Store side are mostly about faster setup and fewer required permissions.

For example, once you’ve configured a SQL datasource, you can iterate over all the important tables and Profile them to generate candidate tests suites and descriptive stats. (GE’s Profilers are still pretty rough, but we’re actively working to make them better. In the meantime, you can always extend/improve them yourself.)

Similarly we sometimes see teams that don’t have access to production pipeline code. It might be managed by a different team, or even a data vendor. In that case, Within-store testing might be the only feasible option. Or it might be a quick way to set up a proof-of-concept to convince the upstream team to grant more access to enable a more powerful Within-Pipeline implementation.

To generalize a little bit, I’d say that the most common pattern is starting Within Store, then adding Within Pipeline checkpoints over time.

Thanks for prompt replies, I will get back if i have further questions.

1 Like