Allow for full HTML extraction #204

navarone-feekery · 2025-02-06T10:24:22Z

Closes #198

Enable full HTML extraction through a config option.
The logic for this was already present in Crawler code, we were just missing two pieces: a configuration option, and to include the full HTML in the JSON to be ingested to ES.

Changes

Add configuration full_html_extraction_enabled which defaults to false
If enabled, include the full HTML in the crawl result doc under field full_html
If disabled, don't include the field full_html at all
(Bonus) add purge_crawl_enabled to example YAML file as it was missing

### Closes #198 Enable full HTML extraction through a config option. The logic for this was already present in Crawler code, we were just missing two pieces: a configuration option, and to include the full HTML in the JSON to be ingested to ES. #### Changes - Add configuration `full_html_extraction_enabled` which defaults to `false` - If enabled, include the full HTML in the crawl result doc under field `full_html` - If disabled, don't include the field `full_html` at all - (Bonus) add `purge_crawl_enabled` to example YAML file as it was missing

github-actions · 2025-02-06T10:49:11Z

💚 Backport PR(s) successfully created

Status	Branch	Result
✅	0.2	#208

This backport PR will be merged automatically after passing CI.

Backports the following commits to 0.2: - Allow for full HTML extraction (#204) Co-authored-by: Navarone Feekery <13634519+navarone-feekery@users.noreply.github.com>

navarone-feekery added 3 commits February 6, 2025 11:17

Enable full HTML extraction

2c48a78

Rename config field

042a419

Fix lint

6ca2a52

navarone-feekery added auto-backport v0.2.1 labels Feb 6, 2025

navarone-feekery requested a review from a team as a code owner February 6, 2025 10:24

navarone-feekery changed the title ~~Navarone/ingest raw html~~ Allow for full HTML extraction Feb 6, 2025

artem-shelkovnikov approved these changes Feb 6, 2025

View reviewed changes

Merge branch 'main' into navarone/ingest-raw-html

42e2a3b

navarone-feekery enabled auto-merge (squash) February 6, 2025 10:43

navarone-feekery merged commit 281068e into main Feb 6, 2025
2 checks passed

navarone-feekery deleted the navarone/ingest-raw-html branch February 6, 2025 10:48

github-actions bot mentioned this pull request Feb 6, 2025

[0.2] Allow for full HTML extraction (#204) #208

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Allow for full HTML extraction #204

Allow for full HTML extraction #204

navarone-feekery commented Feb 6, 2025

github-actions bot commented Feb 6, 2025

Allow for full HTML extraction #204

Allow for full HTML extraction #204

Conversation

navarone-feekery commented Feb 6, 2025

Closes #198

Changes

github-actions bot commented Feb 6, 2025

💚 Backport PR(s) successfully created