I am just getting started with GE and was wondering if there are elegant ways of validating wide form data (20k columns) or should they be transformed to long-form? I want to avoid moving them to long-form as some properties are lost in long-form. The data frame has samples x gene measures.
Thanks for posting the question here! I don’t think we have a go-to answer for this yet, but here are my thoughts. I would consider two things here: 1) ease of creating expectations and 2) performance when validating data 3) Limiting output to relevant data
I’m assuming you have different types of data in each column, usually something like numeric types, strings, booleans, etc. The automated profiler that the
great_expectations suite scaffold command uses should be able to provide reasonable expectations for any size data in theory, since I’m not sure how well it would handle such a large number of columns. Plus, it might not necessarily generate exactly the expectations you want. But if you haven’t tried using
suite scaffold yet, I’d say give it a shot!
Another option is to generate expectations programmatically by looping over your columns and then creating a default type of expectations depending on either the data type and/or naming conventions for your column. Something like (forgive my hacky pseudo-code):
batch = ... # create a batch for c in columns: if type(c) == integer: x = min(batch.c) y = max(batch.c) batch.expect_column_value_to_be_between(column=c, min_value=x, max_value=y) if(c.endswith("_flag"): # assume we have columns named "abc_flag" that are coded 1/0 or Y/N # create some expectations for a boolean flag, e.g. expect_column_values_to_be_in_set
This basically means that you need to have a good grasp of what expectations you’d want for what data type or naming convention. As a heads up, we’re working on making the profiler much better at doing these kinds of things!
Assuming you have 20k columns with a few expectations each, you might want to think about creating multiple expectation suites and then running validation in parallel for efficiency.
In terms of limiting the amount of output to look at, one option would be to split your expectations into separate suites. E.g. you have one “important” suite for expectations on columns that you want to look at on a regular basis, and a “less important” suite for columns you want to run validation on but not necessarily look at. In addition, Data Docs has a button on the left that lets the user toggle the validation results to only show failed validations. I’m not aware of a way to do that programmatically yet!
Let me know if this points you in the right direction. Also, would love to hear from other users who deal with this kind of data.