I am just getting started with GE and was wondering if there are elegant ways of validating wide form data (20k columns) or should they be transformed to long-form? I want to avoid moving them to long-form as some properties are lost in long-form. The data frame has samples x gene measures.
Thanks for posting the question here! I don’t think we have a go-to answer for this yet, but here are my thoughts. I would consider two things here: 1) ease of creating expectations and 2) performance when validating data 3) Limiting output to relevant data
I’m assuming you have different types of data in each column, usually something like numeric types, strings, booleans, etc. The automated profiler that the
great_expectations suite scaffold command uses should be able to provide reasonable expectations for any size data in theory, since I’m not sure how well it would handle such a large number of columns. Plus, it might not necessarily generate exactly the expectations you want. But if you haven’t tried using
suite scaffold yet, I’d say give it a shot!
Another option is to generate expectations programmatically by looping over your columns and then creating a default type of expectations depending on either the data type and/or naming conventions for your column. Something like (forgive my hacky pseudo-code):
batch = ... # create a batch for c in columns: if type(c) == integer: x = min(batch.c) y = max(batch.c) batch.expect_column_value_to_be_between(column=c, min_value=x, max_value=y) if(c.endswith("_flag"): # assume we have columns named "abc_flag" that are coded 1/0 or Y/N # create some expectations for a boolean flag, e.g. expect_column_values_to_be_in_set
This basically means that you need to have a good grasp of what expectations you’d want for what data type or naming convention. As a heads up, we’re working on making the profiler much better at doing these kinds of things!
Assuming you have 20k columns with a few expectations each, you might want to think about creating multiple expectation suites and then running validation in parallel for efficiency.
In terms of limiting the amount of output to look at, one option would be to split your expectations into separate suites. E.g. you have one “important” suite for expectations on columns that you want to look at on a regular basis, and a “less important” suite for columns you want to run validation on but not necessarily look at. In addition, Data Docs has a button on the left that lets the user toggle the validation results to only show failed validations. I’m not aware of a way to do that programmatically yet!
Let me know if this points you in the right direction. Also, would love to hear from other users who deal with this kind of data.
Thanks for the pointers. They are incredibly helpful. I am almost able to run the expectations. It would be great if output for a particular kind of expectation could be grouped. Eg: I want to do a type check on the columns. Currently, it generates 20k separate expectations and things become unmanageable to navigate. If I could just have one expectation for checking null, that would work out.
These would make data docs useful again if I use solution 1. Every single expectation generates individual expectations. If I do just 5 checks per column, I get 100k expectations which makes the output too huge. I cannot even go and click failed only. The browser crashes before I am able to do it.
That’s a great point @aborah - Data Docs is currently not really designed to handle very large numbers of Expectations in a Suite. I can imagine something like grouping all expectations by type and only showing them for failed columns perhaps (which would still be an issue if a lot of them fail). Feel free to file a github issue with a design proposal for this!