Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix: operator downtime #1073

Conversation

MarcosNicolau
Copy link
Collaborator

@MarcosNicolau MarcosNicolau commented Sep 24, 2024

Changes
This pr adds a way to recover the operator's tasks from downtime. The way it works now is:

  1. We store the latest block at which a batch was finished processing.
  2. On startup we read the block and query the avs_reader for all the tasks that have not been responded to starting from that block minus 100 (this is to account for the case that a higher block number task finished first than a previous one, since batches can be processed in parallel). This is to be improved with an algorithm by @Oppen.
  3. We process the batches as per usual.

A new check is also added to not send already responded blocks in the avs_subscriber 'get_latest_task` interval. This problem happened at startup specially, it would always run the latest task doesn't matter how old it was.

Test

To make sure everything works, we can run the following test: setup

  1. Setup a local testnet as usual: make anvil_start_with_block, make aggregator_start, make batcher_start_local
  2. Now for the operators, we'll start four different ones:
    • 1: make operator_register_and_start
    • 2: make operator_register_and_start CONFIG_FILE=config-files/config-operator-1.yml
    • 3: make operator_register_and_start CONFIG_FILE=config-files/config-operator-2.yml
    • 4: make operator_register_and_start CONFIG_FILE=config-files/config-operator-3.yml
  3. Send some proofs to the batcher to make sure everything is working: make batcher_send_plonk_bls12_381_burst
  4. Stop operator 2, 3 and 4
  5. Again send more proofs : make batcher_send_risc0_burst (it is better if you send many batches)
  6. You should see that aggregator is waiting for more verifications to send the response
  7. Start operator 2 and 3 again:
    • make operator_start CONFIG_FILE=config-files/config-operator-1.yml
    • make operator_start CONFIG_FILE=config-files/config-operator-2.yml
  8. You should see quorum is reached and the aggregator sends the response
  9. Start operator 4: make operator_start CONFIG_FILE=config-files/config-operator-3.yml
  10. You should see that it doesn't in fact perform the tasks as the batches have already been responded to.

Deploy

Warning

This PR adds the following parameter to the operator configuration last_processed_batch_filepath: '<>'

Advances on #962
Closes #978

@MarcosNicolau MarcosNicolau marked this pull request as ready for review September 25, 2024 14:51
@MarcosNicolau MarcosNicolau self-assigned this Sep 25, 2024
@MarcosNicolau MarcosNicolau requested a review from Oppen September 26, 2024 18:08
@MarcosNicolau MarcosNicolau requested a review from Oppen September 26, 2024 20:48
@MarcosNicolau MarcosNicolau requested a review from uri-99 October 1, 2024 21:02
Copy link
Contributor

@uri-99 uri-99 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

final detail and i think its ready to approve

@uri-99
Copy link
Contributor

uri-99 commented Oct 2, 2024

also add the new .json file to .gitignore to avoid having files with random variables being pushed to the repo

@Oppen
Copy link
Collaborator

Oppen commented Oct 3, 2024

OK, seems to work here.

…operator-downtime-can-break-eventual-consistency
@uri-99 uri-99 merged commit d814f2b into staging Oct 3, 2024
2 checks passed
@uri-99 uri-99 deleted the 962-fixoperator-aggregator-aggregator-or-operator-downtime-can-break-eventual-consistency branch October 3, 2024 20:50
PatStiles pushed a commit that referenced this pull request Oct 7, 2024
Co-authored-by: Urix <43704209+uri-99@users.noreply.github.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants