Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Define JSON-Schema for the OBO YAML metadata format #663

Closed
cmungall opened this issue Jul 2, 2018 · 24 comments
Closed

Define JSON-Schema for the OBO YAML metadata format #663

cmungall opened this issue Jul 2, 2018 · 24 comments
Assignees
Labels
ontology metadata Issues related to ontology metadata

Comments

@cmungall
Copy link
Contributor

cmungall commented Jul 2, 2018

We should use JSON-Schema to define what is permissible for the yaml metadata files. We would run a schema validator in Travis-CI. This may still be augmented by procedural checks (e.g. check the tracker URL is not a 404) but the core structure can be verified.

If someone wants to speak up for any alternatives:

  • kwalify @kltm @dougl1sqrd
  • ShEx @balhoff (this would leverage the existing conversion to JSON-LD)
  • ??
@kltm
Copy link
Contributor

kltm commented Jul 2, 2018

kwalify

  • pros:
    • standard part of ubuntu-based distros
    • easy-to-use syntax
    • already used by various GO projects
    • easy to use with travis
  • cons:
    • not a standard
    • can be hard to get a hold of without fiddling with ruby gems outside of ubuntu/debian
    • fairly simple, not going much beyond regexps and set inclusion

@jamesaoverton
Copy link
Member

We use kwalify for the PURL system. I was initially quite happy with it, but this nasty issue has undermined my confidence: OBOFoundry/purl.obolibrary.org#290

@kltm
Copy link
Contributor

kltm commented Jul 3, 2018

@jamesaoverton If I'm reading correctly, I think there may be an easy workaround (OBOFoundry/purl.obolibrary.org#290 (comment))

@beckyjackson
Copy link
Contributor

Hi @cmungall - I started a first-draft of the actual JSON-schema based on @rvita's initial work on the ontology metadata. I wrote a quick python script to implement it, but I wasn't sure how you wanted to configure this with the other tests that are already being run on Travis.

Also, I wanted to get some feedback on validation of the license field. We could enforce this validation by mapping the URLs of the licenses to their appropriate labels, but that would require maintaining a list of all the licenses. It would be nice, though, to enforce the official label, as some groups use https://creativecommons.org/licenses/by/4.0/ with the label CC-BY, and others with the label CC-BY 4. Whereas, on the Creative Commons site, the official label is CC BY 4.0. If we do this, there will be many violations of this rule and I'm not sure if that's something we want to enforce... Otherwise, we could just validate that the url portion of the license is in URI format and that the label exists.

@cmungall
Copy link
Contributor Author

I think using a python lib for the json schema validation is a good idea.

We could change the license field to take a URI rather than object as parameter. Some of the initial modeling decisions were driven by what made things easy in the UI within the jekyll framework.

<dt>License</dt>
<dd>
{% if page.license %}
<a href="{{page.license.url}}">{{page.license.label}}</a>
{% if page.license.alert %}
<div class="alert alert-danger" role="alert">
<span class="glyphicon glyphicon-exclamation-sign" aria-hidden="true"></span>
<span class="sr-only">Warning:</span>
{{page.license.alert}}
</div>
{% endif %}
{% else %}
<div class="alert alert-danger" role="alert">
<span class="glyphicon glyphicon-exclamation-sign" aria-hidden="true"></span>
<span class="sr-only">Warning:</span>
Not entered
</div>
{% endif %}
</dd>

{% if ont.license %}
<a href="{{ont.license.url}}"><img width="100px" src="{{ont.license.logo}}" alt="{{ont.license.label}}"/></a>
{% endif %}

But this was probably a poor decision. We could make a onetime change, and synchronize this with a UI change in the jekyll templates. To avoid repetitive code we could have a yaml object that has a lookup between URIs and shortlabels and logos in _config_header.yml and use that in the code above.

@jamesaoverton
Copy link
Member

I think that license should be a controlled vocabulary field with enumerated values: CC0, CC-BY, each with a few versions. It's easier for humans to read/write a label. We can be sticklers about which labels are accepted.

There are some exceptions that we're discussing and have to make some decisions about. They will need URIs and labels for now.

@cmungall
Copy link
Contributor Author

cmungall commented Sep 5, 2018

So it looks like we are baking in the license choices into the json schema. I can see some advantages to this. This feels like we are conflating syntactic/structural schema conformance with principle conformance.

Some reasons we may want to separate:

  • we may want to grandfather in some ontologies that have other licenses
  • obsolete ontologies may not have compatible licenses

is it possible to include simple logic in a json-schema? I would be in favor of restricting licenses at the schema level if it were possible to say something like IF obo_conformant=true THEN license in [CC-BY, CC-0]

This also goes for other kinds of principle conformance checks - e.g. checking the code has documented users.

Also I just remembered that over a year ago I wrote this:
https://github.com/OBOFoundry/OBOFoundry.github.io/blob/master/util/auto-foundry-check.py

To address the auto-review proposed in #288

This is largely supplanted by the json-schema in #710. However, it also checks the usage field and reports if the ontology has no documented usages. I feel this is something we want to check at the principle-conformance level rather than structure-conformance level, so some additional dumb python on top of the schema checking may be useful

1 similar comment
@cmungall
Copy link
Contributor Author

cmungall commented Sep 5, 2018

So it looks like we are baking in the license choices into the json schema. I can see some advantages to this. This feels like we are conflating syntactic/structural schema conformance with principle conformance.

Some reasons we may want to separate:

  • we may want to grandfather in some ontologies that have other licenses
  • obsolete ontologies may not have compatible licenses

is it possible to include simple logic in a json-schema? I would be in favor of restricting licenses at the schema level if it were possible to say something like IF obo_conformant=true THEN license in [CC-BY, CC-0]

This also goes for other kinds of principle conformance checks - e.g. checking the code has documented users.

Also I just remembered that over a year ago I wrote this:
https://github.com/OBOFoundry/OBOFoundry.github.io/blob/master/util/auto-foundry-check.py

To address the auto-review proposed in #288

This is largely supplanted by the json-schema in #710. However, it also checks the usage field and reports if the ontology has no documented usages. I feel this is something we want to check at the principle-conformance level rather than structure-conformance level, so some additional dumb python on top of the schema checking may be useful

@beckyjackson
Copy link
Contributor

I'm not set on the licenses in the schema - I like James's idea of enforcing what labels are accepted, though. JSON schema includes if/then statements, so we could do something like that. I like that this solves the issue of non-conforming ontologies, but going forwards enforcing the CC-BY/CC-0 license (as James said, enforce the labels accepted).

@kltm
Copy link
Contributor

kltm commented Sep 5, 2018

This may be of interest:
https://github.com/reusabledata/reusabledata/blob/master/scripts/source.schema.yaml#L132
For other projects, we've used SPDX (plus a couple of custom terms), but I'd be interested if there were other standards here.

@beckyjackson
Copy link
Contributor

As we've been reviewing how this test is working out, we realized that there are some ontologies that may not necessarily conform to all the checks, but we want to "grandfather" in.

One solution to this is to have different validation levels, as a tag in the metadata like is_obsolete. @jamesaoverton and I were thinking of using level 1 and level 2, where 1 is the "lite" version of the schema and 2 is the "full" version. Any grandfathered ontologies could be marked with validation_level: 1 and any ontology without a tag will default to level 2. Going forward, all new ontologies would need to meet the level 2 validation.

With the numbers, it would also be easy to add in increasingly strict validation by adding another level.

The full validation would be the current schema - plus a required github tag for the contact. The lite (or basic, whatever we want to call it) validation would not require contact.github, and would not check for allowed values for licenses, only that the license.url and license.label exists.

This is kind of similar to what @cmungall was saying with the obo_conformant tag, but this is just marking the "grandfathered" ontologies.

@cmungall
Copy link
Contributor Author

cmungall commented Feb 3, 2020

Do we plan to extend the schema files to cover the entire contents on the yaml? Currently there are mini-schemas for a subset of fields which is a great start, but bad metadata can still sneak through, e.g. #1107. I can help with this just checking this was the plan.

@jamesaoverton
Copy link
Member

My assumption was that our metadata is open-ended/open-world, like most semantic web stuff. So certain fields must conform to the schema, but we ignore stuff that we don't recognize. Only recognized fields will end up on the HTML pages, but I guess everything is in the RDF/Turtle/JSONLD versions.

It should be easy enough to enforce a whitelist of fields, if we want. There are advantages to that.

@cmungall
Copy link
Contributor Author

cmungall commented Feb 4, 2020 via email

@jamesaoverton
Copy link
Member

Ok, that's fine with me.

@jamesaoverton
Copy link
Member

On #789 @althonos noticed that I dropped the activity_status field, which had been on all entries. But util/validate-metadata.py just quits once it sees is_obsolete: true, so this was not detected. The script should run some of the schemas on obsolete ontologies. It would be nice to cleanly connect this to #1126.

@jamesaoverton
Copy link
Member

@althonos also pointed me to his Rust library for working with OBO registry entries.

Hey @althonos, is this a complete list of all the fields used across all OBO registry entries?

https://github.com/althonos/obofoundry.rs/blob/master/src/lib.rs#L140

@jamesaoverton
Copy link
Member

Somewhat related to validation: I'd kinda like to lint the registry files for consistent formatting and key order.

@althonos
Copy link
Member

Hey @althonos, is this a complete list of all the fields used across all OBO registry entries?

Yes, I extracted that as-is from the ontologies.jsonld, since the library I used for parsing reports any unknown field, so I did the following loop to come to that list:

I have a CI setup which reruns on http://www.obofoundry.org/registry/ontologies.jsonld everyday, so that I can experimentally check if something changed in the schema, but until #789 it was quite stable.

@jamesaoverton
Copy link
Member

Thanks @althonos, that's useful information.

I just regenerated the ontologies.* files, so your next CI run should pass.

@jamesaoverton
Copy link
Member

Adding to my comment above #663 (comment) and #1126 (comment)

I'd like the JSON schema files to include some sort of 'applies_to' field that will let us figure out when that schema applies to an ontology.

I'm torn between forcing a simple an ordered list (foundry, active, inactive, orphaned, obsolete) and allowing some sort of mix-and-match. This is connected to long-term plans about using the dashboard information to classify ontologies.

@cmungall
Copy link
Contributor Author

cmungall commented Feb 28, 2020 via email

@nlharris nlharris added the ontology metadata Issues related to ontology metadata label Jul 14, 2020
@nlharris
Copy link
Contributor

just wondering if there is any change in status here

@matentzn
Copy link
Contributor

I feel the Essence of this ticket has been achieved and new issues should get new tickets. Closing this now.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
ontology metadata Issues related to ontology metadata
Projects
None yet
Development

No branches or pull requests

7 participants