Lock bulk queue while processing indexing request #45

navarone-feekery · 2024-06-11T10:17:58Z

Closes #42

The coordinator currently doesn't care about the state of the bulk queue, and will add a crawl result to the queue pool whenever it is finished crawling a page.
This is not thread-safe as multiple threads attempting to add to the queue causes overwrites and potentially lost data.
This is more noticeable since implementing retries with exponential backoff for ES indexing.

This change adds a mutex lock to the write method in the Elasticsearch sink. If write is called, the lock is enabled, and nothing can be added to the pool until the lock is lifted.

This effectively pauses crawl requests when ES indexing is overloaded. This should only impact performance if an ES instance is unhealthy or there are network issues. I think this is acceptable as the user should investigate these things anyway.

Pre-Review Checklist

This PR does NOT contain credentials of any kind, such as API keys or username/passwords (double check crawler.yml.example and elasticsearch.yml.example)
This PR has a meaningful title
This PR links to all relevant GitHub issues that it fixes or partially addresses
- If there is no GitHub issue, please create it. Each PR should have a link to an issue
this PR has a thorough description
Covered the changes with automated tests
Tested the changes locally
Added a label for each target release version (example: v0.1.0)

lib/crawler/api/crawl.rb

jedrazb

I'm not a ruby expert but left some comments :)

lib/crawler/output_sink/elasticsearch.rb

lib/utility/bulk_queue.rb

lib/crawler/output_sink/elasticsearch.rb

navarone-feekery · 2024-06-12T12:01:24Z

@jedrazb @artem-shelkovnikov thanks for the review, I switched to using mutix.synchronize to lock the entire write process.
I realised through doing this that the entire bulk queue wasn't thread safe so the scope of the PR expanded a little (instead of locking the queue flush, now it locks the entire write process).

lib/utility/es_client.rb

artem-shelkovnikov · 2024-06-13T07:48:10Z

lib/crawler/coordinator.rb

-        unless outcome.is_a?(Hash)
-          error = "Expected to return an outcome object from the sink, returned #{outcome.inspect} instead"
-          raise ArgumentError, error
+      Timeout.timeout(SINK_LOCK_TIMEOUT) do


Is it a library function, or built-in ruby one?

Is this one thread-safe?

Asking cause of these scary articles:

https://stackoverflow.com/questions/25803089/is-ruby-2-1-2-timeout-still-not-thread-safe

https://jvns.ca/blog/2015/11/27/why-rubys-timeout-is-dangerous-and-thread-dot-raise-is-terrifying/

Also - how does SINK_LOCK_TIMEOUT interact with retries for sink? Can it timeout before sink finishes retrying?

Timeout is built-in ruby and it appears it wouldn't be thread-safe there 🤦🏻 sorry I never thought something built-in could be so bad, it seems everywhere recommends not to use it.

Also - how does SINK_LOCK_TIMEOUT interact with retries for sink? Can it timeout before sink finishes retrying?

There's no interaction, it's just a flat timeout.
There are two retries that would happen in sink.

First is acquiring the lock, which isn't really a timeout, mutex.synchronize just waits for the lock to be lifted.

Second is the flush (ES request) retries. In a worst-case scenario of 4 attempts each timing out at 10 seconds, it would take around 54 seconds (10 + 12 + 14 + 18) to release the lock. If the flush fails the entire payload will be dropped so the next executor that acquires the lock won't reattempt this request, it'll add its crawl result to the now-empty queue. This means the executors waiting for the lock for these 54ish seconds should clear up quickly afterwards.

If we remove Timeout I think this could be reworked to use a mutex.try_lock > mutex.unlock block instead of mutex.synchronize. If the lock can't be acquired it sleeps for 1 second, up to a max of maybe 120 retries (so approx. two minutes), before throwing an error. How does that sound?

Oh it happens :)

I think indeed using mutex is best. A question raised in my head while reading through it - what happens if Elasticsearch is overloaded and start throttling?

So you have 10 threads that are getting page content and throwing it into single sink. Sink starts to slow down, but do threads keep extracting content? What happens if your timeout it 120 seconds, but sink took 118 seconds to unlock, will all threads waiting for it be able to write into it anyway?

So you have 10 threads that are getting page content and throwing it into single sink. Sink starts to slow down, but do threads keep extracting content?

No, content extraction should pause during this for all threads. The executors are idling waiting for the lock.

What happens if your timeout it 120 seconds, but sink took 118 seconds to unlock, will all threads waiting for it be able to write into it anyway?

If the remaining executors can do write everything in 2 seconds, yes. The write is usually very fast if there's no flush required, so I think most would go through in this case, but there's always a possibility of some being dropped.
I think that's unavoidable, and as long as it's logged clearly users can do something about it (e.g. look at why Elasticsearch throttles the Crawler).

@artem-shelkovnikov I've changed things a bit so Timeout isn't used, I thought it made more sense to go by lock acquisition attempts instead (this also means if a thread acquires a lock late and takes a long time itself, it won't be unnecessarily terminated early).

I've taken a look and haven't found anything broken with this approach. I'm also not super good with concurrent operations, so some stuff might go south regardless, this is something that's just worth testing at some point (Like ingest a large site into a super small Elasticsearch instance and observe), that can of course be done later when you feel it's the right moment!

### Closes #42 Add a mutex lock to the `write` method in the `Elasticsearch` sink. If `write` is called, the lock is enabled, and nothing can be added to the pool until the lock is lifted.

Backports the following commits to 0.1: - Lock bulk queue while processing indexing request (#45) Co-authored-by: Navarone Feekery <13634519+navarone-feekery@users.noreply.github.com>

navarone-feekery added 2 commits June 11, 2024 12:06

Increase Metric/AbcSize limit to 20

39e2199

Add lock to bulk queue when processing

0432e2b

navarone-feekery added v0.1.0 v0.1.1 auto-backport labels Jun 11, 2024

Add queue locking tests

eb64f44

navarone-feekery marked this pull request as ready for review June 11, 2024 11:51

navarone-feekery requested a review from a team June 11, 2024 11:51

navarone-feekery commented Jun 11, 2024

View reviewed changes

lib/crawler/api/crawl.rb Show resolved Hide resolved

jedrazb reviewed Jun 11, 2024

View reviewed changes

lib/crawler/output_sink/elasticsearch.rb Outdated Show resolved Hide resolved

lib/crawler/output_sink/elasticsearch.rb Outdated Show resolved Hide resolved

lib/utility/bulk_queue.rb Outdated Show resolved Hide resolved

artem-shelkovnikov reviewed Jun 11, 2024

View reviewed changes

lib/crawler/output_sink/elasticsearch.rb Outdated Show resolved Hide resolved

artem-shelkovnikov reviewed Jun 11, 2024

View reviewed changes

lib/crawler/output_sink/elasticsearch.rb Outdated Show resolved Hide resolved

Use atomic queue locking

8715a4f

navarone-feekery requested review from artem-shelkovnikov and jedrazb June 12, 2024 12:01

navarone-feekery added v0.2.0 and removed v0.1.0 labels Jun 12, 2024

navarone-feekery commented Jun 12, 2024

View reviewed changes

lib/utility/es_client.rb Show resolved Hide resolved

navarone-feekery added 2 commits June 12, 2024 14:28

Undo method name change

fa8f445

Fix tests

d53ee7c

artem-shelkovnikov reviewed Jun 13, 2024

View reviewed changes

Remove timeout

80b143c

navarone-feekery requested a review from artem-shelkovnikov June 13, 2024 14:30

artem-shelkovnikov approved these changes Jun 13, 2024

View reviewed changes

navarone-feekery merged commit 290b8e8 into main Jun 17, 2024
2 checks passed

navarone-feekery deleted the navarone/42-fix-exponential-backoff branch June 17, 2024 07:13

github-actions bot mentioned this pull request Jun 17, 2024

[0.1] Lock bulk queue while processing indexing request (#45) #50

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Lock bulk queue while processing indexing request #45

Lock bulk queue while processing indexing request #45

navarone-feekery commented Jun 11, 2024 •

edited

Loading

jedrazb left a comment

navarone-feekery commented Jun 12, 2024

artem-shelkovnikov Jun 13, 2024 •

edited

Loading

artem-shelkovnikov Jun 13, 2024

navarone-feekery Jun 13, 2024 •

edited

Loading

artem-shelkovnikov Jun 13, 2024

navarone-feekery Jun 13, 2024

navarone-feekery Jun 13, 2024

artem-shelkovnikov Jun 13, 2024

Lock bulk queue while processing indexing request #45

Lock bulk queue while processing indexing request #45

Conversation

navarone-feekery commented Jun 11, 2024 • edited Loading

Closes #42

Pre-Review Checklist

jedrazb left a comment

Choose a reason for hiding this comment

navarone-feekery commented Jun 12, 2024

artem-shelkovnikov Jun 13, 2024 • edited Loading

Choose a reason for hiding this comment

artem-shelkovnikov Jun 13, 2024

Choose a reason for hiding this comment

navarone-feekery Jun 13, 2024 • edited Loading

Choose a reason for hiding this comment

artem-shelkovnikov Jun 13, 2024

Choose a reason for hiding this comment

navarone-feekery Jun 13, 2024

Choose a reason for hiding this comment

navarone-feekery Jun 13, 2024

Choose a reason for hiding this comment

artem-shelkovnikov Jun 13, 2024

Choose a reason for hiding this comment

navarone-feekery commented Jun 11, 2024 •

edited

Loading

artem-shelkovnikov Jun 13, 2024 •

edited

Loading

navarone-feekery Jun 13, 2024 •

edited

Loading