Add purge crawl feature #65

navarone-feekery · 2024-07-26T15:16:54Z

Summary

Add the Purge Crawl feature, which allows Crawler to delete outdated docs from the index.

Purge crawls can only run if the output sink is elasticsearch. It can be disabled via the config yaml file by setting purge_crawl_enabled: false.

Purge crawls are performed after the primary crawl is finished. Crawler will perform a _search query against the ES index to find all documents that have a last_crawled_at earlier than the primary crawl's start time.
Crawler then sends all of these docs to another "purge" crawl. During this crawl, no links are extracted. The only intention is to determine if the pages that these docs represent still exist and are accessible. Pages that return a non-200 response, or are blocked by existing crawl rules, can be deleted.

Crawler will also re-index the docs it finds during the purge crawl, to ensure any changes are reflected and last_crawled_at is up-to-date.

When the purge crawl is complete, any docs that are remaining are deleted from the index. This is done in one attempt using the delete_by_query.

I will add documentation in another PR.

Checklists

Pre-Review Checklist

This PR does NOT contain credentials of any kind, such as API keys or username/passwords (double check crawler.yml.example and elasticsearch.yml.example)
This PR has a meaningful title
This PR links to all relevant GitHub issues that it fixes or partially addresses
- If there is no GitHub issue, please create it. Each PR should have a link to an issue
this PR has a thorough description
Covered the changes with automated tests
Tested the changes locally
Added a label for each target release version (example: v0.1.0)

Release Note

Crawler will now purge outdated documents from the index at the end of the crawl. This is enabled by default. You can disable this by adding purge_docs_enabled: false to the crawler's yaml config file.

seanstory

few minor nits and questions, but this looks great.

lib/crawler/api/config.rb

lib/crawler/coordinator.rb

lib/crawler/output_sink/elasticsearch.rb

lib/es/client.rb

seanstory · 2024-08-21T19:26:43Z

spec/lib/crawler/coordinator_spec.rb

@@ -38,6 +38,7 @@
  let(:sink) { Crawler::OutputSink::Mock.new(crawl_config) }
  let(:crawl_queue) { Crawler::Data::UrlQueue::MemoryOnly.new(crawl_config) }
  let(:seen_urls) { Crawler::Data::SeenUrls.new }
+  let(:crawl_result) { FactoryBot.build(:html_crawl_result, content: '<p>BOO!</p>') }


Sometimes I wonder if you leave things like this in here just to see if we're actually reading/reviewing closely. :)

I'm definitely not that sly, they're just easter eggs so I can laugh when I have to fix a test in a year :)

navarone-feekery added v0.2.0 release_note labels Jul 26, 2024

navarone-feekery added 2 commits August 5, 2024 14:53

Add docs purge phase

30fc1ba

Introduce purge crawl stage

b66a7a2

navarone-feekery force-pushed the navarone/add-purge-docs branch 2 times, most recently from f0f4b44 to e067a77 Compare August 7, 2024 09:38

Light clean up

c009ddf

navarone-feekery force-pushed the navarone/add-purge-docs branch from e067a77 to c009ddf Compare August 7, 2024 09:38

navarone-feekery added 6 commits August 7, 2024 18:16

Add pagination to purge search

4f42d64

Rename purge docs to purge crawl

8cd011a

Another clean up

53b1f26

Add tests

37f7b47

Fix crawl end logging

484c2f9

Fix lint

158eb2f

navarone-feekery changed the title ~~Add docs purge phase~~ Add purge crawl feature Aug 8, 2024

Fix more tests

01933e7

navarone-feekery marked this pull request as ready for review August 8, 2024 11:25

navarone-feekery requested a review from a team August 8, 2024 11:25

navarone-feekery added 3 commits August 8, 2024 13:25

Merge branch 'main' into navarone/add-purge-docs

90653cc

Re-instate crawl end specs

2d49e10

Remove blank line

faf62c5

This comment was marked as resolved.

Sign in to view

seanstory reviewed Aug 9, 2024

View reviewed changes

navarone-feekery mentioned this pull request Aug 21, 2024

Make purge crawl pagination lazy #102

Open

navarone-feekery added 2 commits August 21, 2024 15:17

Fixes from code review

2fa88ff

Merge branch 'main' into navarone/add-purge-docs

fc797a7

navarone-feekery requested review from seanstory and a team August 21, 2024 13:19

navarone-feekery added 2 commits August 21, 2024 15:31

Fix test

ee28c0d

Fix test

83ef121

seanstory approved these changes Aug 21, 2024

View reviewed changes

Merge branch 'main' into navarone/add-purge-docs

f50456b

navarone-feekery enabled auto-merge (squash) August 22, 2024 00:28

navarone-feekery merged commit 28ed2ec into main Aug 22, 2024
2 checks passed

navarone-feekery deleted the navarone/add-purge-docs branch August 22, 2024 00:33

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add purge crawl feature #65

Add purge crawl feature #65

navarone-feekery commented Jul 26, 2024 •

edited

Loading

This comment was marked as resolved.

seanstory left a comment

seanstory Aug 21, 2024

seanstory Aug 21, 2024

navarone-feekery Aug 22, 2024

Add purge crawl feature #65

Add purge crawl feature #65

Conversation

navarone-feekery commented Jul 26, 2024 • edited Loading

Summary

Checklists

Pre-Review Checklist

Release Note

This comment was marked as resolved.

seanstory left a comment

Choose a reason for hiding this comment

seanstory Aug 21, 2024

Choose a reason for hiding this comment

seanstory Aug 21, 2024

Choose a reason for hiding this comment

navarone-feekery Aug 22, 2024

Choose a reason for hiding this comment

navarone-feekery commented Jul 26, 2024 •

edited

Loading