Define JSON-Schema for the OBO YAML metadata format #663

cmungall · 2018-07-02T22:59:33Z

We should use JSON-Schema to define what is permissible for the yaml metadata files. We would run a schema validator in Travis-CI. This may still be augmented by procedural checks (e.g. check the tracker URL is not a 404) but the core structure can be verified.

If someone wants to speak up for any alternatives:

kwalify @kltm @dougl1sqrd
ShEx @balhoff (this would leverage the existing conversion to JSON-LD)
??

kltm · 2018-07-02T23:05:00Z

kwalify

pros:
- standard part of ubuntu-based distros
- easy-to-use syntax
- already used by various GO projects
- easy to use with travis
cons:
- not a standard
- can be hard to get a hold of without fiddling with ruby gems outside of ubuntu/debian
- fairly simple, not going much beyond regexps and set inclusion

jamesaoverton · 2018-07-03T01:02:03Z

We use kwalify for the PURL system. I was initially quite happy with it, but this nasty issue has undermined my confidence: OBOFoundry/purl.obolibrary.org#290

kltm · 2018-07-03T01:15:01Z

@jamesaoverton If I'm reading correctly, I think there may be an easy workaround (OBOFoundry/purl.obolibrary.org#290 (comment))

beckyjackson · 2018-08-21T20:32:38Z

Hi @cmungall - I started a first-draft of the actual JSON-schema based on @rvita's initial work on the ontology metadata. I wrote a quick python script to implement it, but I wasn't sure how you wanted to configure this with the other tests that are already being run on Travis.

Also, I wanted to get some feedback on validation of the license field. We could enforce this validation by mapping the URLs of the licenses to their appropriate labels, but that would require maintaining a list of all the licenses. It would be nice, though, to enforce the official label, as some groups use https://creativecommons.org/licenses/by/4.0/ with the label CC-BY, and others with the label CC-BY 4. Whereas, on the Creative Commons site, the official label is CC BY 4.0. If we do this, there will be many violations of this rule and I'm not sure if that's something we want to enforce... Otherwise, we could just validate that the url portion of the license is in URI format and that the label exists.

cmungall · 2018-08-21T21:20:51Z

I think using a python lib for the json schema validation is a good idea.

We could change the license field to take a URI rather than object as parameter. Some of the initial modeling decisions were driven by what made things easy in the UI within the jekyll framework.

OBOFoundry.github.io/_layouts/ontology_detail.html

Lines 125 to 143 in 27c66d3

    
           <dt>License</dt> 
        
           <dd> 
        
             {% if page.license %} 
        
             <a href="{{page.license.url}}">{{page.license.label}}</a> 
        
             {% if page.license.alert %} 
        
             <div class="alert alert-danger" role="alert"> 
        
               <span class="glyphicon glyphicon-exclamation-sign" aria-hidden="true"></span> 
        
               <span class="sr-only">Warning:</span> 
        
               {{page.license.alert}} 
        
             </div> 
        
             {% endif %} 
        
             {% else %} 
        
             <div class="alert alert-danger" role="alert"> 
        
               <span class="glyphicon glyphicon-exclamation-sign" aria-hidden="true"></span> 
        
               <span class="sr-only">Warning:</span> 
        
               Not entered 
        
             </div> 
        
             {% endif %} 
        
           </dd>

OBOFoundry.github.io/_includes/ontology_table.html

Lines 25 to 27 in d774c1e

    
           {% if ont.license %} 
        
           <a href="{{ont.license.url}}"><img width="100px" src="{{ont.license.logo}}" alt="{{ont.license.label}}"/></a> 
        
           {% endif %}

But this was probably a poor decision. We could make a onetime change, and synchronize this with a UI change in the jekyll templates. To avoid repetitive code we could have a yaml object that has a lookup between URIs and shortlabels and logos in _config_header.yml and use that in the code above.

jamesaoverton · 2018-08-22T12:54:39Z

I think that license should be a controlled vocabulary field with enumerated values: CC0, CC-BY, each with a few versions. It's easier for humans to read/write a label. We can be sticklers about which labels are accepted.

There are some exceptions that we're discussing and have to make some decisions about. They will need URIs and labels for now.

cmungall · 2018-09-05T01:51:51Z

So it looks like we are baking in the license choices into the json schema. I can see some advantages to this. This feels like we are conflating syntactic/structural schema conformance with principle conformance.

Some reasons we may want to separate:

we may want to grandfather in some ontologies that have other licenses
obsolete ontologies may not have compatible licenses

is it possible to include simple logic in a json-schema? I would be in favor of restricting licenses at the schema level if it were possible to say something like IF obo_conformant=true THEN license in [CC-BY, CC-0]

This also goes for other kinds of principle conformance checks - e.g. checking the code has documented users.

Also I just remembered that over a year ago I wrote this:
https://github.com/OBOFoundry/OBOFoundry.github.io/blob/master/util/auto-foundry-check.py

To address the auto-review proposed in #288

This is largely supplanted by the json-schema in #710. However, it also checks the usage field and reports if the ontology has no documented usages. I feel this is something we want to check at the principle-conformance level rather than structure-conformance level, so some additional dumb python on top of the schema checking may be useful

cmungall · 2018-09-05T01:54:56Z

So it looks like we are baking in the license choices into the json schema. I can see some advantages to this. This feels like we are conflating syntactic/structural schema conformance with principle conformance.

Some reasons we may want to separate:

we may want to grandfather in some ontologies that have other licenses
obsolete ontologies may not have compatible licenses

is it possible to include simple logic in a json-schema? I would be in favor of restricting licenses at the schema level if it were possible to say something like IF obo_conformant=true THEN license in [CC-BY, CC-0]

This also goes for other kinds of principle conformance checks - e.g. checking the code has documented users.

Also I just remembered that over a year ago I wrote this:
https://github.com/OBOFoundry/OBOFoundry.github.io/blob/master/util/auto-foundry-check.py

To address the auto-review proposed in #288

This is largely supplanted by the json-schema in #710. However, it also checks the usage field and reports if the ontology has no documented usages. I feel this is something we want to check at the principle-conformance level rather than structure-conformance level, so some additional dumb python on top of the schema checking may be useful

beckyjackson · 2018-09-05T13:37:02Z

I'm not set on the licenses in the schema - I like James's idea of enforcing what labels are accepted, though. JSON schema includes if/then statements, so we could do something like that. I like that this solves the issue of non-conforming ontologies, but going forwards enforcing the CC-BY/CC-0 license (as James said, enforce the labels accepted).

kltm · 2018-09-05T20:16:12Z

This may be of interest:
https://github.com/reusabledata/reusabledata/blob/master/scripts/source.schema.yaml#L132
For other projects, we've used SPDX (plus a couple of custom terms), but I'd be interested if there were other standards here.

beckyjackson · 2018-09-17T13:01:57Z

As we've been reviewing how this test is working out, we realized that there are some ontologies that may not necessarily conform to all the checks, but we want to "grandfather" in.

One solution to this is to have different validation levels, as a tag in the metadata like is_obsolete. @jamesaoverton and I were thinking of using level 1 and level 2, where 1 is the "lite" version of the schema and 2 is the "full" version. Any grandfathered ontologies could be marked with validation_level: 1 and any ontology without a tag will default to level 2. Going forward, all new ontologies would need to meet the level 2 validation.

With the numbers, it would also be easy to add in increasingly strict validation by adding another level.

The full validation would be the current schema - plus a required github tag for the contact. The lite (or basic, whatever we want to call it) validation would not require contact.github, and would not check for allowed values for licenses, only that the license.url and license.label exists.

This is kind of similar to what @cmungall was saying with the obo_conformant tag, but this is just marking the "grandfathered" ontologies.

cmungall · 2020-02-03T21:52:05Z

Do we plan to extend the schema files to cover the entire contents on the yaml? Currently there are mini-schemas for a subset of fields which is a great start, but bad metadata can still sneak through, e.g. #1107. I can help with this just checking this was the plan.

jamesaoverton · 2020-02-04T14:42:35Z

My assumption was that our metadata is open-ended/open-world, like most semantic web stuff. So certain fields must conform to the schema, but we ignore stuff that we don't recognize. Only recognized fields will end up on the HTML pages, but I guess everything is in the RDF/Turtle/JSONLD versions.

It should be easy enough to enforce a whitelist of fields, if we want. There are advantages to that.

cmungall · 2020-02-04T15:16:12Z

For the obo managed metadata I think being stricter is better. It's easy for us to extend the schema if we need new fields FWIW, on other projects we have been moving towards stricter closed shex schemas as the open ended nature bites us as we try and build robust software around our sem web infrastructure.. but ymmv

…

On Tue, Feb 4, 2020, 06:42 James A. Overton ***@***.***> wrote: My assumption was that our metadata is open-ended/open-world, like most semantic web stuff. So certain fields must conform to the schema, but we ignore stuff that we don't recognize. Only recognized fields will end up on the HTML pages, but I guess everything is in the RDF/Turtle/JSONLD versions. It should be easy enough to enforce a whitelist of fields, if we want. There are advantages to that. — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#663?email_source=notifications&email_token=AAAMMOIQSOP7S7CJX7VOBBTRBF5FXA5CNFSM4FIAG7LKYY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOEKX3UYI#issuecomment-581941857>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AAAMMOOPP5V4VK33PB5XVFTRBF5FXANCNFSM4FIAG7LA> .

jamesaoverton · 2020-02-05T15:27:42Z

Ok, that's fine with me.

jamesaoverton · 2020-02-28T12:55:49Z

On #789 @althonos noticed that I dropped the activity_status field, which had been on all entries. But util/validate-metadata.py just quits once it sees is_obsolete: true, so this was not detected. The script should run some of the schemas on obsolete ontologies. It would be nice to cleanly connect this to #1126.

jamesaoverton · 2020-02-28T13:00:33Z

@althonos also pointed me to his Rust library for working with OBO registry entries.

Hey @althonos, is this a complete list of all the fields used across all OBO registry entries?

https://github.com/althonos/obofoundry.rs/blob/master/src/lib.rs#L140

jamesaoverton · 2020-02-28T13:33:43Z

Somewhat related to validation: I'd kinda like to lint the registry files for consistent formatting and key order.

althonos · 2020-02-28T13:43:21Z

Hey @althonos, is this a complete list of all the fields used across all OBO registry entries?

Yes, I extracted that as-is from the ontologies.jsonld, since the library I used for parsing reports any unknown field, so I did the following loop to come to that list:

run the parser on http://www.obofoundry.org/registry/ontologies.jsonld
add any unknown missing field to the list, making them mandatory
make any known missing field optional
repeat

I have a CI setup which reruns on http://www.obofoundry.org/registry/ontologies.jsonld everyday, so that I can experimentally check if something changed in the schema, but until #789 it was quite stable.

jamesaoverton · 2020-02-28T13:50:46Z

Thanks @althonos, that's useful information.

I just regenerated the ontologies.* files, so your next CI run should pass.

jamesaoverton · 2020-02-28T14:47:51Z

Adding to my comment above #663 (comment) and #1126 (comment)

I'd like the JSON schema files to include some sort of 'applies_to' field that will let us figure out when that schema applies to an ontology.

I'm torn between forcing a simple an ordered list (foundry, active, inactive, orphaned, obsolete) and allowing some sort of mix-and-match. This is connected to long-term plans about using the dashboard information to classify ontologies.

cmungall · 2020-02-28T21:56:42Z

My inclination would be keep foundry status as orthogonal to the other categories.

…

On Fri, Feb 28, 2020 at 6:47 AM James A. Overton ***@***.***> wrote: Adding to my comment above #663 (comment) <#663 (comment)> and #1126 (comment) <#1126 (comment)> I'd like the JSON schema files to include some sort of 'applies_to' field that will let us figure out when that schema applies to an ontology. I'm torn between forcing a simple an ordered list (foundry, active, inactive, orphaned, obsolete) and allowing some sort of mix-and-match. This is connected to long-term plans about using the dashboard information to classify ontologies. — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#663?email_source=notifications&email_token=AAAMMOPC5RNZZ4MZBWS6RKTRFEPZRA5CNFSM4FIAG7LKYY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOENIYLSQ#issuecomment-592545226>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AAAMMOIWZNZZUMANPG7C2H3RFEPZRANCNFSM4FIAG7LA> .

nlharris · 2022-05-24T00:43:47Z

just wondering if there is any change in status here

matentzn · 2022-05-24T05:01:45Z

I feel the Essence of this ticket has been achieved and new issues should get new tickets. Closing this now.

jamesaoverton assigned beckyjackson Aug 16, 2018

beckyjackson mentioned this issue Aug 31, 2018

Add metadata validation test #710

Closed

jamesaoverton mentioned this issue Feb 28, 2020

obsolete mirnao #789

Merged

nlharris added the ontology metadata Issues related to ontology metadata label Jul 14, 2020

matentzn closed this as completed May 24, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Define JSON-Schema for the OBO YAML metadata format #663

Define JSON-Schema for the OBO YAML metadata format #663

cmungall commented Jul 2, 2018

kltm commented Jul 2, 2018

jamesaoverton commented Jul 3, 2018

kltm commented Jul 3, 2018 •

edited

Loading

beckyjackson commented Aug 21, 2018

cmungall commented Aug 21, 2018

jamesaoverton commented Aug 22, 2018

cmungall commented Sep 5, 2018

cmungall commented Sep 5, 2018

beckyjackson commented Sep 5, 2018

kltm commented Sep 5, 2018

beckyjackson commented Sep 17, 2018

cmungall commented Feb 3, 2020 •

edited

Loading

jamesaoverton commented Feb 4, 2020

cmungall commented Feb 4, 2020 via email

jamesaoverton commented Feb 5, 2020

jamesaoverton commented Feb 28, 2020

jamesaoverton commented Feb 28, 2020

jamesaoverton commented Feb 28, 2020

althonos commented Feb 28, 2020

jamesaoverton commented Feb 28, 2020

jamesaoverton commented Feb 28, 2020

cmungall commented Feb 28, 2020 via email

nlharris commented May 24, 2022

matentzn commented May 24, 2022

Define JSON-Schema for the OBO YAML metadata format #663

Define JSON-Schema for the OBO YAML metadata format #663

Comments

cmungall commented Jul 2, 2018

kltm commented Jul 2, 2018

jamesaoverton commented Jul 3, 2018

kltm commented Jul 3, 2018 • edited Loading

beckyjackson commented Aug 21, 2018

cmungall commented Aug 21, 2018

jamesaoverton commented Aug 22, 2018

cmungall commented Sep 5, 2018

cmungall commented Sep 5, 2018

beckyjackson commented Sep 5, 2018

kltm commented Sep 5, 2018

beckyjackson commented Sep 17, 2018

cmungall commented Feb 3, 2020 • edited Loading

jamesaoverton commented Feb 4, 2020

cmungall commented Feb 4, 2020 via email

jamesaoverton commented Feb 5, 2020

jamesaoverton commented Feb 28, 2020

jamesaoverton commented Feb 28, 2020

jamesaoverton commented Feb 28, 2020

althonos commented Feb 28, 2020

jamesaoverton commented Feb 28, 2020

jamesaoverton commented Feb 28, 2020

cmungall commented Feb 28, 2020 via email

nlharris commented May 24, 2022

matentzn commented May 24, 2022

kltm commented Jul 3, 2018 •

edited

Loading

cmungall commented Feb 3, 2020 •

edited

Loading