Page History
The ETL pathway (extraction, transformation & loading) can take up a notable amount of time for large datasets, which can sometimes lead to bottlenecks in the sense that e.g. a nightly loading of data is no longer possible. There are several possible optimizations which are not specific to i2b2, but have long been established in the classic data warehouse field.
Disabling Index Maintenance during Loading
As mentioned in the query optimization page, the i2b2 database schema contains several indexes to speed up query performance. When data is being loaded into indexed tables, the databases has to simultaneously update or extend both the actual table row data as well as the related index structures, in order to maintain index consistency at all times. Depending on the number of indexes this can incur a notable performance impact due to concurrent write operations across different areas of the physical hard drives. This effect becomes especially large when tables are not incrementally extended (like in a transactional setting), but truncated and fully loaded (like in a typical data warehouse setting).
...
Info |
---|
The script should be invoked with the credentials of the schema containing the OBSERVATION_FACT table Please note that the script already contains both the optimized indexes as well as the LOCAL option for partitioning, which may need to be removed depending on availability and use of the partitioning feature. |
Extending the COMMIT interval
Modern relational databases typically provide transactions to bundle write operations (INSERT, UPDATE) which are temporarily collected and then stored together when a "COMMIT" command is issued. Besides ensuring data consistency, this approach also allows the database to more efficiently store larger amounts of data at one time when compared to individually writing rows onto the disk.
...
Note |
---|
Please note that the database will store all statements running up to the next commit in a temporary area (e.g. the TEMP tablespace), which needs to be adequately sized to avoid tablespace overflows. |
Using a dedicated bulk loader
Many databases provide a dedicated "bulk loader" tool which is optimized for loading large amounts of source data "en bloc" into a table, e.g. Oracle SQL*Loader. Such loaders often provide additional performance boosts through bypassing transaction- and consistency mechanisms (e.g. "DIRECT PATH" and "UNRECOVERABLE" options), which effectively stream data directly onto the disk. In a data warehouse setting, these kinds of optimizations are generally considered safe to use as the database is not used in a transactional setting and write operations are centrally controlled by an ETL pathway. Use of a bulk loader with enabled optimizations can lead to notable performance improvements of up to 100x when compared to individual INSERT statements.
Bulk loaders usually require CSV source files containing the raw data to be loaded and a separate controlfile control file containing all required information about the destination table and columns as well as optimization options to be applied.
...
Info |
---|
It should be noted that bulk loaders are database specific. If compatibility across several databases is required, it may be necessary to either create several adapted versions of the ETL pathway (affecting only the "final" step of actually loading generated CSV files) or to not use a bulk loader. E.g., the IDRT pathways stick to JDBC components to achieve greater cross-database compatibility) |
Partially Loading Fact Data
Truncating and fully reloading fact table data is standard practice in data warehousing. However, for large tables this approach may not be feasible e.g. due to loading duration exceeding the available downtime of the warehouse.
Delta updates are an alternative approach which relies on timestamps or update flags in the source data to selectively insert new rows or update existing rows in the destination table. Depending on the volume of data changed between loading intervals this approach can notably reduce loading duration. However, a reliable source of timestamps/flags is required, which may be difficult to prove (is the source application reliably maintaining all creation or update timestamps with all possible transcationstransactions?). Alternatively, trigger-based mechanisms are possible to reliably carry over INSERTs or UPDATEs from a source system into the data warehouse. Delta updates are highly specific to the local setting and relevant source data, so their implementation details are out of scope of this guide.
...