You’re really hitting the mark with good questions today.
Great Expectations supports both of these patterns, and they’re both widely implemented by teams that have deployed Great Expectations.
Let’s call them the Within-Pipeline versus Within-Store patterns.
In general, Within-Pipeline will give you more control, since you can do things like:
- halt data processing if you discover serious errors in data validation
- trigger followup processing depending on specific validation results (e.g. if aggregate expectations show significant feature drift in an ML model)
- directly control the processing of bad rows (e.g. kick off notifications, etc. for triage workflows)
It’s also often easier to configure things like
run_ids directly within the pipeline, so that your validation records are fully integrated with your pipeline execution.
The main downside that I’ve seen is that it sometimes takes a little more work to configure Within-Pipeline validation, since you have to configure Checkpoints and deploy them throughout your pipeline. This isn’t a huge amount of work (and we’re working to streamline it further), but it can take a little bit more to get started.
The advantages on the Within-Store side are mostly about faster setup and fewer required permissions.
For example, once you’ve configured a SQL datasource, you can iterate over all the important tables and Profile them to generate candidate tests suites and descriptive stats. (GE’s Profilers are still pretty rough, but we’re actively working to make them better. In the meantime, you can always extend/improve them yourself.)
Similarly we sometimes see teams that don’t have access to production pipeline code. It might be managed by a different team, or even a data vendor. In that case, Within-store testing might be the only feasible option. Or it might be a quick way to set up a proof-of-concept to convince the upstream team to grant more access to enable a more powerful Within-Pipeline implementation.
To generalize a little bit, I’d say that the most common pattern is starting Within Store, then adding Within Pipeline checkpoints over time.