For example, if I have 5 columns [a, b, c, d, e] that are always the same on a dataset but sometimes I also have additional columns so that my column list could look like this [a, b, c, d, e, x, y, z] or it could look like this [a, b, c, d, e, t, s, r], is there a way to validate that the first 5 columns are always there without worrying about the additional columns?
As of v 0.13.2, we don’t have functionality in
expect_table_columns_to_match_ordered_list that would allow you to ignore columns after the ordered list. So the workaround would be to create an expect_column_to_exist expectation for each predefined column, setting the column_index number to expect each column in the respective column position needed. This will check that the first 5 columns are there as expected and will ignore the rest. This workaround won’t be ideal for huge tables with hundreds of columns. A contribution would be welcome to add a configuration option to the expect_table_columns_to_match_ordered_list expectation so that it can ignore additional columns.
@bhcastleton - it’s not exactly the same, but you could use
expect_table_columns_to_match_set(column_set=[a, b, c, d, e], exact_match=False) to make sure that the columns [a, b, c, d, e] exist. By setting exact_match=False, this expectation will still return success=True if there are extra columns in the dataset that are not in the column_set.
The difference between this and what you are describing is that this method does not check order of these columns.
Nice @anthony! Yes, that solution might work better in cases where there are lots of columns. Great idea.
Thanks for the suggestions. I played around a bit and submitted a pr: https://github.com/great-expectations/great_expectations/pull/2200
Mostly did this as an exercise to see if I can get to grips with parts of the codebase, so I won’t be offended if it needs work