Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add purge crawl feature #65

Merged
merged 18 commits into from
Aug 22, 2024
Merged

Add purge crawl feature #65

merged 18 commits into from
Aug 22, 2024

Conversation

navarone-feekery
Copy link
Collaborator

@navarone-feekery navarone-feekery commented Jul 26, 2024

Summary

Add the Purge Crawl feature, which allows Crawler to delete outdated docs from the index.

Purge crawls can only run if the output sink is elasticsearch. It can be disabled via the config yaml file by setting purge_crawl_enabled: false.

Purge crawls are performed after the primary crawl is finished. Crawler will perform a _search query against the ES index to find all documents that have a last_crawled_at earlier than the primary crawl's start time.
Crawler then sends all of these docs to another "purge" crawl. During this crawl, no links are extracted. The only intention is to determine if the pages that these docs represent still exist and are accessible. Pages that return a non-200 response, or are blocked by existing crawl rules, can be deleted.

Crawler will also re-index the docs it finds during the purge crawl, to ensure any changes are reflected and last_crawled_at is up-to-date.

When the purge crawl is complete, any docs that are remaining are deleted from the index. This is done in one attempt using the delete_by_query.

I will add documentation in another PR.

Checklists

Pre-Review Checklist

  • This PR does NOT contain credentials of any kind, such as API keys or username/passwords (double check crawler.yml.example and elasticsearch.yml.example)
  • This PR has a meaningful title
  • This PR links to all relevant GitHub issues that it fixes or partially addresses
    • If there is no GitHub issue, please create it. Each PR should have a link to an issue
  • this PR has a thorough description
  • Covered the changes with automated tests
  • Tested the changes locally
  • Added a label for each target release version (example: v0.1.0)

Release Note

Crawler will now purge outdated documents from the index at the end of the crawl. This is enabled by default. You can disable this by adding purge_docs_enabled: false to the crawler's yaml config file.

@navarone-feekery navarone-feekery force-pushed the navarone/add-purge-docs branch 2 times, most recently from f0f4b44 to e067a77 Compare August 7, 2024 09:38
@navarone-feekery navarone-feekery force-pushed the navarone/add-purge-docs branch from e067a77 to c009ddf Compare August 7, 2024 09:38
@navarone-feekery navarone-feekery changed the title Add docs purge phase Add purge crawl feature Aug 8, 2024
@navarone-feekery navarone-feekery marked this pull request as ready for review August 8, 2024 11:25
@navarone-feekery navarone-feekery requested a review from a team August 8, 2024 11:25
@navarone-feekery

This comment was marked as resolved.

Copy link
Member

@seanstory seanstory left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

few minor nits and questions, but this looks great.

@navarone-feekery navarone-feekery requested review from seanstory and a team August 21, 2024 13:19
@@ -38,6 +38,7 @@
let(:sink) { Crawler::OutputSink::Mock.new(crawl_config) }
let(:crawl_queue) { Crawler::Data::UrlQueue::MemoryOnly.new(crawl_config) }
let(:seen_urls) { Crawler::Data::SeenUrls.new }
let(:crawl_result) { FactoryBot.build(:html_crawl_result, content: '<p>BOO!</p>') }
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👻 !

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sometimes I wonder if you leave things like this in here just to see if we're actually reading/reviewing closely. :)

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm definitely not that sly, they're just easter eggs so I can laugh when I have to fix a test in a year :)

@navarone-feekery navarone-feekery enabled auto-merge (squash) August 22, 2024 00:28
@navarone-feekery navarone-feekery merged commit 28ed2ec into main Aug 22, 2024
2 checks passed
@navarone-feekery navarone-feekery deleted the navarone/add-purge-docs branch August 22, 2024 00:33
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants