Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add binary content extraction #74

Merged
merged 8 commits into from
Aug 21, 2024

Conversation

navarone-feekery
Copy link
Collaborator

@navarone-feekery navarone-feekery commented Aug 8, 2024

Add binary content extraction support.
When a binary content file (such as a PDF) is encountered, Crawler will download the file and create an ES doc for ingestion.
The binary content of the file is added as a base64-encoded string to the _attachment field. The actual decoding is done through ES pipelines, which takes the decoded value and adds it to the body field on the doc. The _attachment field is then removed.

This can be enabled or disabled through the following configuration fields

# This example config will allow PDFs and MS word docs to be extracted

content_extraction_enabled: true
content_extraction_mime_types:
  - application/pdf
  - application/msword

This feature additionally closes #71

Checklists

Pre-Review Checklist

  • This PR does NOT contain credentials of any kind, such as API keys or username/passwords (double check crawler.yml.example and elasticsearch.yml.example)
  • This PR has a meaningful title
  • This PR links to all relevant GitHub issues that it fixes or partially addresses
    • If there is no GitHub issue, please create it. Each PR should have a link to an issue
  • this PR has a thorough description
  • Covered the changes with automated tests
  • Tested the changes locally
  • Added a label for each target release version (example: v0.1.0)
  • Considered corresponding documentation changes
  • Ran make notice if any dependencies have been added

Release Note

Crawler can now extract binary content from files it encounters during a crawl.

@navarone-feekery navarone-feekery requested a review from a team August 8, 2024 17:06
@navarone-feekery navarone-feekery marked this pull request as draft August 8, 2024 17:06
@navarone-feekery

This comment was marked as resolved.

@navarone-feekery navarone-feekery marked this pull request as ready for review August 9, 2024 11:28
Copy link
Member

@seanstory seanstory left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

looking great. Good tests as always. :)

@navarone-feekery navarone-feekery requested review from seanstory and a team August 20, 2024 13:29
@navarone-feekery navarone-feekery enabled auto-merge (squash) August 21, 2024 18:27
@navarone-feekery navarone-feekery merged commit 5a04a00 into main Aug 21, 2024
2 checks passed
@navarone-feekery navarone-feekery deleted the navarone/add-binary-content-extraction branch August 21, 2024 18:31
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Enable quick storage of PDF file size and name using the web cralwer
2 participants