A super-simple alternative introduction to Great Expectations

In a recent Slack conversation, Ian pointed out that the documentation for Great Expectations is kinda overwhelming.

It leans heavily into setting up GE in production, which means you’ll need abstractions like DataContexts and ValidationOperators, etc.

However, if you’re just getting your feet wet with the library, it’s pretty overwhelming. Maybe there’s a simpler way.

Ian threw together two super simple scripts to get started. They only use Expectations and the validate command. Nothing else.

What do you think? Would this be a good way to help people get oriented to Great Expectations at the very beginning?

MVP Great Expectations in pandas:

import great_expectations as ge

# Build up expectations on a sample dataset and save them
train = ge.read_csv("data/npi.csv")
train.expect_column_values_to_not_be_null("NPI")
train.save_expectation_suite("npi_csv_expectations.json")

# Load in a new dataset and test them
test = ge.read_csv("data/npi_new.csv")
validation_results = test.validate(expectation_suite="npi_csv_expectations.json")

if validation_results["success"]:
    print ("giddy up!")
else:
    raise Exception("oh shit.")

MVP Great Expectations for SQLAlchemy:

import os
from great_expectations.dataset import SqlAlchemyDataset
from sqlalchemy import create_engine

db_string = "postgres://{user}:{password}@{host}:{port}/{dbname}".format(
    user=os.environ["DB_USER"],
    password=os.environ["DB_PASSWORD"],
    port=os.environ["DB_PORT"],
    dbname=os.environ["DB_DBNAME"],
    host=os.environ["DB_HOST"],
)

db_engine = create_engine(db_string)

# Build up expectations on a table and save them
sql_dataset = SqlAlchemyDataset(table_name='my_table', engine=db_engine)
sql_dataset.expect_column_values_to_not_be_null("id")
sql_dataset.save_expectation_suite("postgres_expectations.json")

# Load in a subset of the table and test it
sql_query = """
    select *
    from my_table
    where created_at between date'2019-11-07' and date'2019-11-08'
"""
new_sql_dataset = SqlAlchemyDataset(custom_sql=sql_query, engine=db_engine)
validation_results = new_sql_dataset.validate(
    expectation_suite="postgres_expectations.json"
)

if validation_results["success"]:
    ...
1 Like