Skip to content

Ontology Integration in RDBMS

Damion Dooley edited this page Jul 4, 2018 · 15 revisions

An ontology can be implemented within or alongside a traditional relational database management system (RDBMS) to varying degrees. At a minimum, use of ontology ids provides disambiguation about the content of a record and offers standards-driven choices; at the other extreme a full ontology representation implementation directly enables federated data querying according to Semantic Web standards. Lesser implementations require extra work to enable ontology-driven querying but may be satisfactory for data exchange.

The approaches below all depend on the reliable mapping of database content to ontology term identifiers available online as Uniform Resource Identifiers (URIs, like http://purl.obolibrary.org/obo/HP_0012735. This can be shortened with the use of namespace prefixes, e.g. "HP:0012735" which data export and reporting functions can expand back to a term's full URI. Note that the JSON-LD data format is encouraged because it is suited to prefix documentation - see compact IRIs. OBOFoundry maintains a list of its ontology prefixes in JSON-LD here.

Tagging or embedding

Ontology terms can be placed directly within words or phrases of text they have been matched to. Such tagged text can be placed directly within database fields, and can be extracted for search indexing or display hyperlinking; some care is required to ensure text expressions don't conflict with the tagging mechanism.

BioCompute objects (https://hive.biochemistry.gwu.edu/htscsrs/biocompute) illustrate this, e.g. https://hive.biochemistry.gwu.edu/tst/biocompute/view/1298/) where ontology terms are imbedded in brackets in the name field:

{
"object_id": 1298, 
"type_id": 381, 
"name": "Identification of recombinant antihemophilic factor VII [UniProt:P00451] inhibitor SNPs [SO:0000694] in human [taxID:9606] blood [UBERON:0000178] extracted from patients with hemophilia A [DOID:12134]", 
"title": "Identification of recombinant antihemophilic factor", 
"version": "1.0", 
... 

Lookup tables

A database field that holds controlled vocabulary items may be provided with a lookup table of values - names of countries, disease symptoms etc. These lists can be restructured to use key-value pairs instead, where an ontology identifier is a key for each textual value. (Lookup tables may already have numeric keys which can be converted to ontology URI's.)

This can be enhanced for multilingual access by including a language type column. Lookup tables can be driven or augmented by online ontology-sourced multilingual label, synonym and definition content obtained from sources like OLS, Ontobee, Ontofox, and BioPortal.

Example database transformation

table_symptom

patient id symptom date
123 cough 2017/11/20

table_symptom_lookup (used in data entry menus)

label
cough
fever
headache

becomes ...

table_symptom

patient id symptom date
123 HP:0012735 2017/11/20

table_symptom_lookup

id label
HP:0012735 cough
HP:0001945 fever
HP:0002315 headache

If convenient, consider merging all term lookup terms into a single manageable "table_lookup" with id, label, and list name columns. As well, this lookup table could potentially have synonyms and term definitions updated periodically from an ontology lookup service. Note though that often such definitions are geared to the logical definition of a term rather than common use, so a further "user interface definition" that you curate may be appropriate too.

id label language list_name synonyms definition ui_definition
HP:0012735 cough en symptom A sudden, audible expulsion of air from the lungs through a partially closed glottis, preceded by inhalation. A sudden, audible expulsion of air from the lungs.
HP:0003326 myalgia en symptom muscle pain Pain in muscle.

ontofetch.py

We have an ontofetch.py script listed in the scripts/ folder of this repository which will process terms of a given OBOFoundry RDF/XML-formatted ontology (by file or URL address) into json and tabular data output for use in software application categorical variable selection lists for example. Assuming that the ontology's terms are positioned under owl:Thing, this script fetches term id, parent_id, label, definition, deprecated status, term replaced_by id (if any), and a number of other fields, for use in software applications and databases. It is a good place to start to see what an ontology provides for raw terminology in a basic flat format. In the future we will extend this script to include synchronization of a local vocabulary lookup table that would hold terms from more than one source ontology. A discussion piece on basic ontology-driven vocabulary-software is available at: https://docs.google.com/document/d/1H8ch2PX-YzCw1IYv5gXPqA-Qqjc9jb_WggzzpaskkjU/edit?usp=sharing

Record key-value pairs

Contextual data fields may even be distinguished by ontology identifiers which can act as keys for stored values associated with a given record. This works well with semi-structured data, and can be applied to table column names too. In the basic JSON example below, "symptom" has identifier OGMS:0000020, "date of birth" has identifier EFO:0004950

{
    'patient_case_id': 123, 
    'OGMS:0000020': ['HP:0001945', 'HP:0012735'],
    'EFO:0004950': '1998-03-01',
...

The JSON-LD format

Using the Semantic-web friendly JSON-LD format enables a simpler translation of content to "triple store" datasets, but only if additional semantics are taken care of. For example, the OGMS:0000020 "symptom" entity referenced above is defined as a quality of a patient, whereas to support conversion of JSON-LD object content directly to subject-predicate-object triples, an object property like "has quality" or "has symptom" would be a better choice. JSON-LD enables the definition of object properties and attributes. (JSON-LD even allows one to use a "@language" attribute to qualify a text field value as being in a particular language in a way very similar to how RDF allows annotation of @lang).

A simple JSON-LD example leaning heavily on schema.org: https://cedric-dumont.com/2014/12/13/using-ontologies-and-json-ld-to-describe-a-product/. An example showing a simple "toy" ontology itself expressed in JSON-LD: https://gist.github.com/stain/7690362. A paper on using a scientific measurement oriented ontology to describe data points and datasets: SciData: a data model and ontology for semantic representation of scientific data - see figures 14/15/16 for a simplified representation.

Auxiliary records

A given database record can be associated with an auxiliary data structure that captures the ontology terms used to describe that record. This hybrid system allows a traditional RDBM system to be augmented with ontology driven information without having to change the RDBM component design.

The example above pertains to this situation when a traditional relational database record exists for each patient_case_id.

Graph-based database content

The traditional relational database structure is replaced with Resource Description Framework (RDF) "triple store" graph content in which each table, linked record and field are restructured as a graph of nodes, links, and associated values. When an internet-accessible "SPARQL endpoint" is included, this makes federated data querying possible.

An example triple store database engine is https://www.blazegraph.com/ which powers Wikidata.

Hybrid approach

A hybrid solution involves a combination of the above approaches, for example, by compiling a graph-based database on a periodic basis from the content of auxiliary records.

Accessing relational database content as RDF

There are various platforms for providing relational database content in RDF format. A Postgress table field can be an array of values, e.g. a record may have fields and values [name:dan, parents:{ruby,bob}, children:{mary,lou,sandra}] which makes for easier conversion to triples. R2RML is a standard for mapping relational databases to RDF, with various implementations, e.g. r2rml-parser. Ontop provides SPARQL access to a SQL database via configurable R2RML mapping files that convert SPARQL to SQL queries on-the-fly. D2RQ is a tool for accessing relational databases as read-only RDF Graphs. Conversion of tabular data to RDF can also be achieved with Karma.

General design notes

Note that as one tries to mirror the structural component of ontologies (enabled by one's choice of relations that link data together), differences in approach between ontology communities become apparent. There are a variety of upper level ontologies to choose from, each with their own fundamental relations, and other nuances, such as having separate terms for phenomena like 'temperature' and datums like 'temperature datum'.

Ontology-driven data projects usually adopt a philosophy of encouraging either a minimal set of relation types, or a greater number of nuanced relation types. A schema with few relations (e.g. just "subclass of, has part, has quality") is easier to know how to query, but entities have to be represented by more compound parts. On the other hand, allowing a greater number of nuanced relations lessens the number of compound components that an entity needs, but the unique schema relations then must be learned in order to query. For example a person can have a "has birthday" object property which can hold a datetime value. Alternately a person can be associated with a "birth event" which is a subclass of an "Event" class (as are "death event, wedding event, 1st tooth out event etc."), and which has a "has value" data property that holds a datetime value. Each approach entails different logical expressivity or reasoning capability, and may require different data structure conversion steps when integrating with 3rd party systems.