This post is adapted from our Github issue page seen here: Redshift Performance Issues #898
This issue captures discussion and evaluation of performance issues related to using Redshift with Great Expectations.
Currently, there are three broad known issues around Redshift integration:
- Introspection performance.
SqlAlchemyDatasetobtains metadata regarding columns for data it represents. The
sqlalchemy_redshiftobtains and caches all column data when asked for any, which causes a significant (wasted) overhead.
TableGeneratorobtains metadata regarding all tables and views for the database to which it is connected. Similarly to above, the sqlalchemy_redshift driver obtains and caches much more data than is required.
- Type awareness.
SqlAlchemyDatasetuses the database driver module to check type names in the
expect_column_values_to_be_in_type_listexpectations. The sqlalchemy_redshift driver does not export those types until after version 0.7.5; it may not also suppress postgresql types that are not actually supported in Redshift (e.g. inet).
- Redshift does not support the SQL
percentile_discfunction, but does offer an alternative
approximate percentile_disc. Currently,
build_continuous_partition_objectwill use the approximated version in the event that
allow_relative_erroris set to
True, however the error may not be desirable or clear to all users.
All issues are handled on our Github Issues page. Feel free to contribute to this discussion here: Redshift Performance Issues #898