-
Notifications
You must be signed in to change notification settings - Fork 16
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add purge crawl feature #65
Conversation
f0f4b44
to
e067a77
Compare
e067a77
to
c009ddf
Compare
This comment was marked as resolved.
This comment was marked as resolved.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
few minor nits and questions, but this looks great.
@@ -38,6 +38,7 @@ | |||
let(:sink) { Crawler::OutputSink::Mock.new(crawl_config) } | |||
let(:crawl_queue) { Crawler::Data::UrlQueue::MemoryOnly.new(crawl_config) } | |||
let(:seen_urls) { Crawler::Data::SeenUrls.new } | |||
let(:crawl_result) { FactoryBot.build(:html_crawl_result, content: '<p>BOO!</p>') } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
👻 !
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sometimes I wonder if you leave things like this in here just to see if we're actually reading/reviewing closely. :)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm definitely not that sly, they're just easter eggs so I can laugh when I have to fix a test in a year :)
Summary
Add the Purge Crawl feature, which allows Crawler to delete outdated docs from the index.
Purge crawls can only run if the output sink is
elasticsearch
. It can be disabled via the config yaml file by settingpurge_crawl_enabled: false
.Purge crawls are performed after the primary crawl is finished. Crawler will perform a
_search
query against the ES index to find all documents that have alast_crawled_at
earlier than the primary crawl's start time.Crawler then sends all of these docs to another "purge" crawl. During this crawl, no links are extracted. The only intention is to determine if the pages that these docs represent still exist and are accessible. Pages that return a non-200 response, or are blocked by existing crawl rules, can be deleted.
Crawler will also re-index the docs it finds during the purge crawl, to ensure any changes are reflected and
last_crawled_at
is up-to-date.When the purge crawl is complete, any docs that are remaining are deleted from the index. This is done in one attempt using the
delete_by_query
.I will add documentation in another PR.
Checklists
Pre-Review Checklist
crawler.yml.example
andelasticsearch.yml.example
)v0.1.0
)Release Note
Crawler will now purge outdated documents from the index at the end of the crawl. This is enabled by default. You can disable this by adding
purge_docs_enabled: false
to the crawler's yaml config file.