Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

peer, main, netsync, blockchain: parallel block downloads #2226

Open
wants to merge 11 commits into
base: master
Choose a base branch
from

Conversation

kcalvinalvin
Copy link
Collaborator

@kcalvinalvin kcalvinalvin commented Aug 7, 2024

This PR modifies netsync.Manager so that all the header-first blocks downloaded before the last checkpoint is done out of order by utilizing query.WorkManager from neutrino.

Gonna put it in draft for now as testing is sorta difficult and I'm not convinced it's downloading blocks fasuter for mainnet. By my testing it works just fine in testnet but mainnet seems to be slow when downloading blocks. Still identifying where the bottleneck is and will make adjustments accordingly.

If anyone else would like to give this a try please let me know if you see speed ups or slow downs from this PR.

Parallel block download architecture

Block Download Window

We mainly follow the "block download window" model that bitcoin core is using. This model
allows the blocks to be downloaded out of order inside the given window.

In the given blocks below from block 01 to 15, if the window size is 4, the download sequence
might look like so:

     window
       |
 -------------
 |           |
b01 b02 b03 b04 b05 b06 b07 b08 b09 b10 b11 b12 b13 b14 b15

Any one of these blocks can be downloaded in parallel and these blocks are then queued to multiple
peers. Once b01 is downloaded, the block window moves forward, resulting in something that looks
like this:

         window
           |
     -------------
     |           |
b01 b02 b03 b04 b05 b06 b07 b08 b09 b10 b11 b12 b13 b14 b15

If b02 was already downloaded before we've downloaded b01, then the block window will shift by 2,
resulting in something that looks like so:

             window
               |
         -------------
         |           |
b01 b02 b03 b04 b05 b06 b07 b08 b09 b10 b11 b12 b13 b14 b15

The parameters used in the code:

maxBlockDownloadWindow = 1024

GetData size per peer

When requesting blocks from peers, we'll use a getdata message. This can have a maximum of 50,000 block hashes.
If we ask for more blocks from a single peer, we can reduce the latency since we only send a single getdata message
for many blocks. However, this may also result in a single peer slowing us down as that peer may be a slow peer.

By playing around with the paramter, I came up with 32 max blocks per peer. This results in insignificant slowdown
for a single fast peer while still being small enough to not be slowed down too much by a single peer.

The parameters used are:

maxInFlightBlocksPerPeer = 32

Peer selection

Currently there's not an explicit peer selection logic in btcd itself. There's a PR on the neutrino side that allows
btcd to disconnect off of peers when they are not able to deliver a block before the timeout of 2 seconds.
lightninglabs/neutrino#308

I've tested different types of peer selection methods such as:

1: disconnecting peers when they're timing out.
2: disconnecting peers when they're slower than a certain speed.
3: a comparative system where the peer that delievers the block slowest 2 consecutives times in a given window is disconnected.

and tried these with various tweaks to them as well but the underlying problem seems to be that we're just not connecting to
fast enough peers.

The peer selection logic only eliminates us of slow peers. If we're connected to 8 (the default amount of peers) slow peers,
then the IBD is slow no matter what. I thought this was a problem with btcd itself but testing on the latest version of Bitcoin
core showed that it's just as slow as well.

My conclusion

I believe the current PR that I have is the simplest way of achieving ibd with parallel block downloads.
Since it's not any slower than the latest version of Bitcoin Core, I think it's worthwhile to merge what
we have now and in follow up PRs make changes to the peer selection logic from the connmanager side.

@coveralls
Copy link

coveralls commented Aug 7, 2024

Pull Request Test Coverage Report for Build 12081011051

Details

  • 13 of 426 (3.05%) changed or added relevant lines in 6 files are covered.
  • 14 unchanged lines in 2 files lost coverage.
  • Overall coverage decreased (-0.4%) to 56.866%

Changes Missing Coverage Covered Lines Changed/Added Lines %
peer/peer.go 2 18 11.11%
server.go 0 18 0.0%
blockchain/accept.go 10 33 30.3%
netsync/blocklogger.go 0 54 0.0%
netsync/manager.go 0 302 0.0%
Files with Coverage Reduction New Missed Lines %
peer/peer.go 5 73.37%
netsync/manager.go 9 0.0%
Totals Coverage Status
Change from base Build 11941199528: -0.4%
Covered Lines: 29895
Relevant Lines: 52571

💛 - Coveralls

@kcalvinalvin
Copy link
Collaborator Author

Seems like the slowdowns are coming from a single peer that's slowing down the block processing. peerWorkManager having capabilities to completely disconnect a peer would help in this.

@Roasbeef
Copy link
Member

Seems like the slowdowns are coming from a single peer that's slowing down the block processing. peerWorkManager having capabilities to completely disconnect a peer would help in this.

Interesting, is the issue that the single peer is assigned blocks uniformly and is always the last one to send (haven't looked in PR at detail yet). I have this tracking issue in the neutrino repo for adding stuff like dynamic tuning, better work assignment, and also work stealing. With work stealing, the faster peers would steal the block request from the work queue of the slow peer, with faster peers helping us to be as slow as the fastest peer.

@kcalvinalvin kcalvinalvin force-pushed the 2024-04-01-parallel-ibd branch 2 times, most recently from c51d31a to 2cbb562 Compare September 6, 2024 05:51
@kcalvinalvin kcalvinalvin marked this pull request as ready for review September 9, 2024 23:41
@saubyk
Copy link
Collaborator

saubyk commented Oct 3, 2024

cc: @Crypt-iQ @ProofOfKeags for review

@kcalvinalvin kcalvinalvin force-pushed the 2024-04-01-parallel-ibd branch 2 times, most recently from 8f28947 to 06209a4 Compare November 29, 2024 07:56
@kcalvinalvin kcalvinalvin force-pushed the 2024-04-01-parallel-ibd branch from 06209a4 to d5b78aa Compare December 10, 2024 04:18
candidate

Since we can use all the peers we could get for ibd, don't add peers
that are not sync candidates when we're still not current.
query.Peer is used for downloading blocks out of order during headers
first download.  Methods SubscribeRecvMsg() and OnDisconnect() are added
to abide by the interface.
ConnectedPeers returns all the currently connected peers.  This is used
to provide the query.WorkManager with all the currently connected peers
from the netsync package.
handleBlockMsg used to check that the block header is both valid and
then process the blocks as they come in.  It's now refactored so that
it also handles blocks that are not in order.  For out of order block
downloads handleBlockMsg would mark the block as an orphan but it's now
refactored to handle those cases.

Whenever a block that's not the next from the chain tip is received,
it's now temporarily stored in memory until the next block from the
chain tip is received.  And then all the blocks that are in sequence are
processed.
peerDisconnectMsg is added so that we can access the peerStates map and
disconnect peers with just a string of their address without risking a
concurrent access of the map.
checkpointedBlocksQuery is a helper to create []*query.Request which can
be passed off to query.Workmanager to query for wire.Messages to
multiple peers.  This is useful for downloading blocks out of order from
multiple peers during ibd.
peerSubscription is added to Manager which will allow it subscribers to
receive peers through the channel whenever the Manager is aware of a new
peer that it's been connected to.  This is useful to alert
query.Workmanager that a new peer that's been connected to is eligible
to download blocks from.
ConnectedPeers returns all the currently connected peers and any new
peer that's additionally connected through the returned channel.  This
method is required for query.Workmanager as it needs ot receive peers
that it can request blocks from.
The blocks that were requested from headers are now sent over to
query.Workmanager where it will rank peers based on their speed and
request blocks from them accordingly.  This allows for quicker block
downloads as:
1: Workmanager will prioritize faster peers.
2: Workmanager is able to ask from multiple peers.
Resetting the requestedBlocks state in headersFirst is problematic since
we may be banning peers that are still good.
@kcalvinalvin
Copy link
Collaborator Author

Pushed all the code that I have and updated the original comment with extra information

@saubyk saubyk added this to the v0.25 milestone Feb 18, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants