You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
This is an exploratory issue to kick-off investigations into the different libraries. Feel free to create sub-issues exploring specific libraries. Be mindful of the specific JSON fields and metadata we may want include (corresponding to Twiga database schemas). Specific metadata include page numbers, chapter numbers, etc. Intelligent chunking strategies is also a very relevant topic to be explored.
Sample PDFs:
I recommend testing your PDF parser with TIE resources, e.g. the Geography Form Two textbook. For faster development, you may want to split the PDF to around ~10-20 pages.
The text was updated successfully, but these errors were encountered:
The key component in the resource ingestion pipeline is PDF parsing. There are plenty of libraries one could use to implement this including:
This is an exploratory issue to kick-off investigations into the different libraries. Feel free to create sub-issues exploring specific libraries. Be mindful of the specific JSON fields and metadata we may want include (corresponding to Twiga database schemas). Specific metadata include page numbers, chapter numbers, etc. Intelligent chunking strategies is also a very relevant topic to be explored.
Helpful resources & inspiration:
Sample PDFs:
I recommend testing your PDF parser with TIE resources, e.g. the Geography Form Two textbook. For faster development, you may want to split the PDF to around ~10-20 pages.
The text was updated successfully, but these errors were encountered: