Redshift Performance Issues

This post is adapted from our Github issue page seen here: Redshift Performance Issues #898

This issue captures discussion and evaluation of performance issues related to using Redshift with Great Expectations.

Currently, there are three broad known issues around Redshift integration:

  1. Introspection performance.
    • The SqlAlchemyDataset obtains metadata regarding columns for data it represents. The sqlalchemy_redshift obtains and caches all column data when asked for any, which causes a significant (wasted) overhead.
  • The TableGenerator obtains metadata regarding all tables and views for the database to which it is connected. Similarly to above, the sqlalchemy_redshift driver obtains and caches much more data than is required.
  1. Type awareness.
  • The SqlAlchemyDataset uses the database driver module to check type names in the expect_column_values_to_be_of_type and expect_column_values_to_be_in_type_list expectations. The sqlalchemy_redshift driver does not export those types until after version 0.7.5; it may not also suppress postgresql types that are not actually supported in Redshift (e.g. inet).
  1. Approximation
  • Redshift does not support the SQL percentile_disc function, but does offer an alternative approximate percentile_disc. Currently, build_continuous_partition_object will use the approximated version in the event that allow_relative_error is set to True, however the error may not be desirable or clear to all users.

All issues are handled on our Github Issues page. Feel free to contribute to this discussion here: Redshift Performance Issues #898