Universal vs project-focused
One of the basic considerations when starting to design an i2b2 project ontology is the primary target audience.
Depth of the navigation
In most cases, it is adviceable to just stick to the original depth of navigation. But depending on the source format the data came from, this can lead to very deep hierarchies. Examplary for an ODM import, one will find seven levels of hierarchy for a clinical trial:
- Study
- MetaDataVersions
- Events
- Forms
- ItemGroups
- Items
- CodeLists
- Items
- ItemGroups
- Forms
- Events
- MetaDataVersions
Some levels can be simply deleted, e.g. if there is only one MetaDataVersion or one event, they can be spared.
The first level "Ontology" can also be renamed to something more expressive.
Another idea is to abandon hierarchy levels that where useful for data collection, but not for presentation. More than 250 concepts, on the other hand, should not be listed under a single i2b2 folder for reasons of usability.
Splitting large value sets without natural hierarchy
A concept might have a large number of possible values without a normative hierarchy. Examples are code systems like zip codes, genetic information, or costs for billing. In this case, it is not feasible to represent every possible code: 1 Euro, 2 Euro, 3 Euro, ...
The basic idea is to find an artifical, but yet reasonable substitute ontology. A possible solution is shown in the Boston Demodata: not every plausible patient age is coded right below age, instead, there is an intermediate level for every decade (0-9 years, 10-19 years, ...). So, one can select a bulk of ages with one click.
Splitting ages into decades might be suboptimal for some use cases in clinical research. For instance when recruiting participants for clinical trials, inclusion criteria hardly match decades. In most scenarios, it would be better to have categories like 0-17 years (harder regulations for trials with minors), 18-80 years and 81-130 years (special screening for elderly). Categories should have subcategories where appropriate: 18-80 years might be further splitted into 18-49, 50-65, 66-80 years.
Costs could have categories at ten thousands, thousands, hundereds and so on. Alphanumeric codes could have categories defined by the first letter (0-9, A-Z). Postal codes could have states and counties as classifying attributes.