Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

try unstructuredloader to parse a geography textbook #4

Open
jurmy24 opened this issue Feb 12, 2025 · 2 comments · May be fixed by #8
Open

try unstructuredloader to parse a geography textbook #4

jurmy24 opened this issue Feb 12, 2025 · 2 comments · May be fixed by #8
Assignees
Labels
good first issue Good for newcomers

Comments

@jurmy24
Copy link
Member

jurmy24 commented Feb 12, 2025

Steps:

  1. Create a pdf-loader folder in the repo and add an unstructured_loader.pdf file
  2. Write the file using unstructured's docs to take as input a pdf and as output a markdown document (if that's what they suggest)

Sidenote: you should create a folder in your repo called data/ and store the actual pdf there so that you can retrieve it from the code. You should then upload the result as a markdown document in the same folder. If git starts tracking it then add data/ to the gitignore. it might already be there though.

@jurmy24 jurmy24 added the good first issue Good for newcomers label Feb 12, 2025
@jurmy24
Copy link
Member Author

jurmy24 commented Feb 12, 2025

Note, this doesn't interfere with any other code so you can work freely

@ghoullishly
Copy link

@alvaro-mazcu and I started working on this today. We parsed the first chapter of the geography book and started looking into what kind of metadata we can use. Also tested parsing a math paper to explore formula parsing

@ghoullishly ghoullishly linked a pull request Mar 5, 2025 that will close this issue
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
good first issue Good for newcomers
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants