Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support conversion of JATS format into DoclingDocument #893

Closed
ceberam opened this issue Feb 5, 2025 · 1 comment · Fixed by #967
Closed

Support conversion of JATS format into DoclingDocument #893

ceberam opened this issue Feb 5, 2025 · 1 comment · Fixed by #967
Assignees
Labels
enhancement New feature or request xml issue related to supported schema-specific XML formats

Comments

@ceberam
Copy link
Contributor

ceberam commented Feb 5, 2025

Requested feature

The Journal Article Tag Suite (JATS) format is a common XML format in which publishers and archives can exchange journal content. The JATS provides a set of XML elements and attributes for describing the textual and graphical content of journal articles as well as some non-article material such as letters, editorials, and book and product reviews.

Several publishers distribute documents in a structured XML format according to JATS, including PubMed Central, pre-print repositories such as bioRxiv and medRxiv, and journals in PLOS.

Currently, docling supports the conversion of PubMed Central articles, as described in the Supported formats section, but it may need to be refactored to generalize to other JATS articles and the current standard 1.4.

The feature request is about extending docling conversion to any structured document in JATS format, for instance, by generalizing the current backend conversion of PubMed Central documents.

Alternatives

Since a JATS parsing implementation in docling already exists, there is no alternative with lower effort

@ceberam ceberam added the enhancement New feature or request label Feb 5, 2025
@ceberam ceberam self-assigned this Feb 5, 2025
@ceberam ceberam added the xml issue related to supported schema-specific XML formats label Feb 5, 2025
@ceberam
Copy link
Contributor Author

ceberam commented Feb 10, 2025

Some enhancements/fixes will also need to be addressed:

  • Support lists
  • Respect reading order
  • Add a space when removing line breaks
  • Include other sections (back matters)
  • Separate authors and affiliations as they are typically rendered in PMC
  • Support equations in blocks and in tables
  • Missing text in table captions
  • Certain elements may be contained in different doc parts (e.g., ref-list)
  • Parsing errors (e.g., in citations when etal tag is present)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request xml issue related to supported schema-specific XML formats
Projects
None yet
Development

Successfully merging a pull request may close this issue.

1 participant