Add binary content extraction #74

navarone-feekery · 2024-08-08T17:06:33Z

Add binary content extraction support.
When a binary content file (such as a PDF) is encountered, Crawler will download the file and create an ES doc for ingestion.
The binary content of the file is added as a base64-encoded string to the _attachment field. The actual decoding is done through ES pipelines, which takes the decoded value and adds it to the body field on the doc. The _attachment field is then removed.

This can be enabled or disabled through the following configuration fields

# This example config will allow PDFs and MS word docs to be extracted

content_extraction_enabled: true
content_extraction_mime_types:
  - application/pdf
  - application/msword

This feature additionally closes #71

Checklists

Pre-Review Checklist

This PR does NOT contain credentials of any kind, such as API keys or username/passwords (double check crawler.yml.example and elasticsearch.yml.example)
This PR has a meaningful title
This PR links to all relevant GitHub issues that it fixes or partially addresses
- If there is no GitHub issue, please create it. Each PR should have a link to an issue
this PR has a thorough description
Covered the changes with automated tests
Tested the changes locally
Added a label for each target release version (example: v0.1.0)
Considered corresponding documentation changes
Ran make notice if any dependencies have been added

Release Note

Crawler can now extract binary content from files it encounters during a crawl.

seanstory

looking great. Good tests as always. :)

lib/crawler/data/crawl_result/content_extractable_file.rb

lib/crawler/document_mapper.rb

lib/crawler/http_executor.rb

Add binary content extraction

30ccd5d

navarone-feekery added v0.2.0 release_note labels Aug 8, 2024

navarone-feekery requested a review from a team August 8, 2024 17:06

navarone-feekery marked this pull request as draft August 8, 2024 17:06

This comment was marked as resolved.

Sign in to view

navarone-feekery added 5 commits August 9, 2024 10:46

Fix lint

9e0cf5f

Refactor document_mapper and add tests

15082f1

Add tests for document mapper

fbcab86

Remove unused code

101a64d

Fix hash key typing for docs

b5eecd7

navarone-feekery marked this pull request as ready for review August 9, 2024 11:28

seanstory reviewed Aug 9, 2024

View reviewed changes

lib/crawler/data/crawl_result/content_extractable_file.rb Outdated Show resolved Hide resolved

lib/crawler/document_mapper.rb Show resolved Hide resolved

lib/crawler/http_executor.rb Show resolved Hide resolved

Add more binary content fields

b47dddb

navarone-feekery requested review from seanstory and a team August 20, 2024 13:29

seanstory approved these changes Aug 21, 2024

View reviewed changes

Merge branch 'main' into navarone/add-binary-content-extraction

8b8f4bf

navarone-feekery enabled auto-merge (squash) August 21, 2024 18:27

navarone-feekery merged commit 5a04a00 into main Aug 21, 2024
2 checks passed

navarone-feekery deleted the navarone/add-binary-content-extraction branch August 21, 2024 18:31

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add binary content extraction #74

Add binary content extraction #74

navarone-feekery commented Aug 8, 2024 •

edited

Loading

This comment was marked as resolved.

seanstory left a comment

Add binary content extraction #74

Add binary content extraction #74

Conversation

navarone-feekery commented Aug 8, 2024 • edited Loading

Checklists

Pre-Review Checklist

Release Note

This comment was marked as resolved.

seanstory left a comment

Choose a reason for hiding this comment

navarone-feekery commented Aug 8, 2024 •

edited

Loading