Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Allow for full HTML extraction #204

Merged
merged 4 commits into from
Feb 6, 2025
Merged

Conversation

navarone-feekery
Copy link
Collaborator

Closes #198

Enable full HTML extraction through a config option.
The logic for this was already present in Crawler code, we were just missing two pieces: a configuration option, and to include the full HTML in the JSON to be ingested to ES.

Changes

  • Add configuration full_html_extraction_enabled which defaults to false
  • If enabled, include the full HTML in the crawl result doc under field full_html
  • If disabled, don't include the field full_html at all
  • (Bonus) add purge_crawl_enabled to example YAML file as it was missing

@navarone-feekery navarone-feekery requested a review from a team as a code owner February 6, 2025 10:24
@navarone-feekery navarone-feekery changed the title Navarone/ingest raw html Allow for full HTML extraction Feb 6, 2025
@navarone-feekery navarone-feekery enabled auto-merge (squash) February 6, 2025 10:43
@navarone-feekery navarone-feekery merged commit 281068e into main Feb 6, 2025
2 checks passed
@navarone-feekery navarone-feekery deleted the navarone/ingest-raw-html branch February 6, 2025 10:48
github-actions bot pushed a commit that referenced this pull request Feb 6, 2025
### Closes #198

Enable full HTML extraction through a config option.
The logic for this was already present in Crawler code, we were just
missing two pieces: a configuration option, and to include the full HTML
in the JSON to be ingested to ES.

#### Changes

- Add configuration `full_html_extraction_enabled` which defaults to
`false`
- If enabled, include the full HTML in the crawl result doc under field
`full_html`
- If disabled, don't include the field `full_html` at all
- (Bonus) add `purge_crawl_enabled` to example YAML file as it was
missing
Copy link

github-actions bot commented Feb 6, 2025

💚 Backport PR(s) successfully created

Status Branch Result
0.2 #208

This backport PR will be merged automatically after passing CI.

navarone-feekery added a commit that referenced this pull request Feb 6, 2025
Backports the following commits to 0.2:
 - Allow for full HTML extraction (#204)

Co-authored-by: Navarone Feekery <13634519+navarone-feekery@users.noreply.github.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Allow raw HTML to be ingested
2 participants