IDRT - Integrated Data Repository Toolkit
Space shortcuts
Space Tools
IDRT - Integrated Data Repository Toolkit IDRT

Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

In order to validate the optimizations described in the page on query optimization, several tests were carried out applying individual as well as combined measures. The methods and results are described on this page both to give an estimate of the performance benefits that can be achieved as well as allow other groups to repeat these measurements in their own setting.

Validation Environment

Validation was carried out using a large dataset (19GB CSV clinical raw data, resulting in 63GB after loading, including indexes). To eliminate side-effects due to caching mechanisms both of the underlying operating systems as well as the database, a virtualized environment (Oracle VirtualBox) was use on a standalone PC without concurrently running applications. Separate virtual machines (VMs) were created for the database (Oracle 11g) and the i2b2 application server. Snapshots were taken after loading, and after each measurement the VMs were returned to those snapshots before applying the next optimization method or combination thereof.

For each measurement, queries were manually constructed in the i2b2 webclientWeb Client, started, and the run time displayed by the application noted. For each query and optimization method, 2 separate measurements were carried out (returning to the snapshots in between), with the average of both measurements taken as the final value. The queries were constructed to reflect the impact of the optimization methods (e.g. one query combined rare and common data elements to show the effect of query strategy optimization, another query combined low- and high volume data elements to show the effect of partitioning; another query contained modifiers to show the effect of index optimization).

Validation Results

Query strategy optimization (through population of the C_TOTALNUM column) yielded performance improvements of 3x for one query, but led to unchanged or degraded performance in 3 other queries. This effect was surprising and may be related to the assumption that the data element frequencies measured on the overall dataset may no longer be accurate for consecutive query steps, which apply only to chosen subsets of the population.

...

The combination of all optimizations yielded performance gains of up to 78x across all queries. The slightly diminished yield compared to 80x with partitioning may be due to negative effects of the query strategy optimization.

Conclusions

Index optimization yielded the highest performance gains, which can be achieved with little change to the basic setup of i2b2: only one index needs to be replaced, and data does not need to be reloaded. Also, there is no enterprise or partioning partitioning feature license required for index optimization. Partitioning may provide further performance gains with larger datasets, which was not verified in our project.

IDRT - Integrated Data Repository Toolkit IDRT