Excessive memory consumption when syncing a long way up to the `canonical head` #3207

mjfh · 2025-04-10T11:24:12Z

Since PR #3191 the Nimbus EL has an annoying memory problem in the FC module as the syncer does not update base anymore while importing blocks. This happens at least when the syncer has to catch up a long way.

Previously, there was a kludge related to the syncer which used the forkChoice() function for base update.

Now base can only be updated if the CL triggers a forkChoiceUpdated which has no effect if the update is out of scope for the FC module, which in turn happens when syncing for an old or pristine database state.

In fact, this leads to a similar situation to when mainnet was unable to finalise transactions globally.

For the attached screenshot, I ran the syncer overnight (with turned off CL) and had the following memory usage in the morning

78.9GiB virtual (from metrics screen)
41.4GiB physical (from metrics screen)
22GiB extra swap space freed after stopping the process

As it seems, a big machine can handle the situation to an extend but the execution throughput decreases.

The text was updated successfully, but these errors were encountered:

KolbyML · 2025-04-10T15:04:49Z

@mjfh is #3202 the PR you meant to link?

mjfh · 2025-04-10T15:35:06Z

Ops, was the wrong one -- lol
Thanks for noticing

mjfh · 2025-04-10T15:38:34Z

was somehow related to issue 3202 :)

jangko · 2025-04-22T00:17:38Z

fixed by #3204

jangko · 2025-04-22T11:23:30Z

Looks like the the problem not cured thoroughly. Need more investigation.

advaita-saha · 2025-04-22T11:34:48Z

Looks like the the problem not cured thoroughly. Need more investigation.

has the memory usage improved than before or exactly same as before ?

jangko · 2025-04-24T10:29:19Z

When syncing with hoodi using an empty database, initially everything looks ok, the base can move forward.
(apply #3237).

But when the gap is wider, the base stop moving. I'm not sure why CL suddenly request sync from head < 10K, then jump to > 200K+.

The problem is syncer downloading forward from known FC base(even though the segment request is reverse), but FC expecting the syncer to download backward from head.

Of course the FC expectation not satisfied by the syncer because the finalized hash(pendingFCU) not resolved into latestFinalizedNumber.

IIRC from discord discussion we agree on the syncer have two phase:

Download headers from head to known base, and put it into cache. Probably also start another new session if CL request new target.
Then download the block or body forward and import into FC.

That is what assume how the syncer works. But looks like not like that.

mjfh · 2025-04-24T12:07:59Z

When syncing with hoodi using an empty database, initially everything looks ok, the base can move forward. (apply #3237).

I observed the same in general although there was an outlier on hoodi when the CL was not fully in sync.

mjfh · 2025-04-24T12:09:04Z

[..]
IIRC from discord discussion we agree on the syncer have two phase:
* Download headers from head to known base, and put it into cache. Probably also start another new session if CL request new target.

* Then download the block or body forward and import into FC.
That is what assume how the syncer works. But looks like not like that.

That is exactly how it works apart from the fact that the CL cannot start a new syncer session while the current one is running.

jangko · 2025-04-25T10:26:14Z

Sync session 1
base=5324 head=5535 target=8539
download headers 8539..5536
resolved fin = 8468
download bodies 5536..8539

Sync session 2
base=8320 head=8539 target=9227
download headers 9227..8540
resolved fin = 9141
download bodies 8540..9227

Sync session 3
base=#8988 head=#9227 target=#9531
download headers 9531..9228
resolved fin = 9437
download bodies 9228..9531

Sync session 4
base=#9292 head=#9531 target=#259894
download headers 259894..9532
resolved fin = 9894    # <------------- ???????? 
download bodies 9532..... way past resolved fin, base is not moving anymore during this session lifetime.

EL=nimbus
CL=nimbus

Both FC and syncer expect CL give finalized hash near the target, not near the head.

The above sync sessions happen when I sync with hoodi. The question is, why CL send a finalized hash far from target? Considering this fact, the syncer cannot just ignore the finalized block if CL behave like this.

tersec · 2025-04-25T11:13:47Z

Do you have the actual FCs the CL is sending?

mjfh · 2025-04-25T11:15:30Z

The body download starts with a block number where the header has a parent on the FC module -- no finalised header involved here. In practice, this first block number is often the largest such (unless some RPCs squeezed in.)

This state (that the collected chain has a FC module parent) is signalled by the header cache module.

My take was that the syncer should (and does) neither know nor care about the finalised hash and its block header resolution.

tersec · 2025-04-25T11:15:47Z

To add, in general, the CLs will send fCUs corresponding to whatever they think the current (head, safe, finalized) EL blocks are. They don't, per, se, have a notion of "target".

mjfh · 2025-04-25T11:20:32Z

To add, in general, the CLs will send fCUs corresponding to whatever they think the current (head, safe, finalized) EL blocks are. They don't, per, se, have a notion of "target".

The name target is used for syncer logging to tell a sort of comprehensive story. It is the local target the syncer attempts to reach.

tersec · 2025-04-25T11:25:13Z

Yeah, I understand. But in general, in a well-functioning network, the (head, safe, finalized) epochs in fCU are usually (not always) (n, n -1 , n-2).

Is that being seen here?

jangko · 2025-04-27T11:24:37Z

Here what is being seen:

H=Head, B=Base,F=Finalized

Few early sessions/short session:
B......F.H # F is near H

Then CL will send very long session:
B..F..............................H # F is near B

During this long session, CL will gradually update F forward with random steps.
The steps is small, for example:
B=50K, H=270K, F=52K, steps: 27....54

Then around F=77K, the CL stop updating F.

If CL keeps updating F, we can formulate a strategy. But because it stop updating, the excessive memory consumption will always repeat.

don't know how other CL behave.

tersec · 2025-04-27T15:10:53Z

If you look at the CL logs (e.g., look at the nimbus-eth2 Slot start logs to compare the head and finalized epochs), is F lagging H there too?

jangko · 2025-04-28T01:04:39Z

is F lagging H there too?

INF 2025-04-28 07:32:24.047+07:00 Slot start topics="beacnde" sync="15h01m (25.97%) 4.0891slots/s (DDPQDQDDDP:77631)/opt - lc: e81c4219:298910" finalized=2424:3abe601d delay=56ms444us575ns slot=298912 epoch=9341 peers=16 head=87c988c0:77642

Looks like CL send finalized hash depends on the progress of CL sync. CL-F epoch=2424, CL-H epoch=9341, progress=25,97%.

If CL sees EL already sync past it's own progress, it will stop sends new H and F.

I want to propose changes to EL syncer:
Instead of download headers interleaving with block bodies, we need to separate the syncer into two parts:

Syncer-H: responsible to downloads block headers backward, but it can start a new session without waiting for block bodies.
Syncer-B: responsible to downloads block bodies forward after the F resolved. This syncer will download until F, then stop. IF a new F resolved, it will resume download until this new F. repeat this when the distance between H and F > D.
If distance is small enough, download bodies until H.

The reason for this complication is to keep the CL sends new F without EL progressing too much beyond CL progress percentage.

But there is one problem, should D be calculated dynamically, or it is a constant?

If calculated, based on what?
If it is a constant, what is the value?
Or can we remove this completely, and the CL still thinks we are in sync?

Note:

There is no changes to what syncer should know. It merely doing what CL told it to do: download blocks.
But how the F resolved involves both FC and HeaderChainCache(HCC). The header chain stored in database can be modified slightly to also store hash to number every time a new header is stored.
Should we also integrate FC with HCC, so the only one responsible for resolving F is still FC?
FC still the one who decide when to move the base forward.

arnetheduck · 2025-04-28T05:07:46Z

finalized=2424:3abe601d

this is the epoch number, ie 2424*32 = slot 77568 and the head in this log is at 77642 - there is no (significant) gap.

the epoch=9341 in the log is the "wall clock", while head is how far the CL has synced.

arnetheduck · 2025-04-28T05:31:08Z

then stop

this is not where the issue lies, generally .. ie something else is preventing finalized from being updated. There's no reason for the CL to "hold back" finalized updates, but more broadly, the proposed algorithm wouldn't work when the chain is not finalizing - without finality, the gap between H and F is expected to grow (and we'll solve that by keeping block bodies on disk also for non-final blocks).

tersec · 2025-04-28T05:41:34Z

It's because of the LC -- the LC gets head but (correctly) doesn't update finalized. This isn't a bug, it's by design. The EL sync should handle it properly.

jangko · 2025-04-28T07:56:35Z

the epoch=9341 in the log is the "wall clock", while head is how far the CL has synced.

That where the problem is. CL sends FCU to EL:

H: a block hash from epoch 9341, far from both CL "synced head" and F.
F: a block hash from epoch 2424, near from CL "synced head".

And this create a huge gap in EL. EL knows nothing about CL "synced head".

The algorithm will works. If there is no finality, the "B-Syncer" will do nothing, it will keep waiting for a valid F from CL.

mjfh added Sync Prevents or affects sync with Ethereum network EL labels Apr 10, 2025

jangko closed this as completed Apr 22, 2025

jangko reopened this Apr 22, 2025

tersec mentioned this issue Apr 22, 2025

alpha release tracking #3222

Open

3 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Excessive memory consumption when syncing a long way up to the `canonical head` #3207

Excessive memory consumption when syncing a long way up to the `canonical head` #3207

mjfh commented Apr 10, 2025 •

edited

Loading

KolbyML commented Apr 10, 2025

mjfh commented Apr 10, 2025

mjfh commented Apr 10, 2025

jangko commented Apr 22, 2025

jangko commented Apr 22, 2025

advaita-saha commented Apr 22, 2025

jangko commented Apr 24, 2025 •

edited

Loading

mjfh commented Apr 24, 2025

mjfh commented Apr 24, 2025 •

edited

Loading

jangko commented Apr 25, 2025 •

edited

Loading

tersec commented Apr 25, 2025

mjfh commented Apr 25, 2025

tersec commented Apr 25, 2025

mjfh commented Apr 25, 2025

tersec commented Apr 25, 2025

jangko commented Apr 27, 2025

tersec commented Apr 27, 2025

jangko commented Apr 28, 2025 •

edited

Loading

arnetheduck commented Apr 28, 2025 •

edited

Loading

arnetheduck commented Apr 28, 2025

tersec commented Apr 28, 2025

jangko commented Apr 28, 2025 •

edited

Loading

Excessive memory consumption when syncing a long way up to the canonical head #3207

Excessive memory consumption when syncing a long way up to the canonical head #3207

Comments

mjfh commented Apr 10, 2025 • edited Loading

KolbyML commented Apr 10, 2025

mjfh commented Apr 10, 2025

mjfh commented Apr 10, 2025

jangko commented Apr 22, 2025

jangko commented Apr 22, 2025

advaita-saha commented Apr 22, 2025

jangko commented Apr 24, 2025 • edited Loading

mjfh commented Apr 24, 2025

mjfh commented Apr 24, 2025 • edited Loading

jangko commented Apr 25, 2025 • edited Loading

tersec commented Apr 25, 2025

mjfh commented Apr 25, 2025

tersec commented Apr 25, 2025

mjfh commented Apr 25, 2025

tersec commented Apr 25, 2025

jangko commented Apr 27, 2025

tersec commented Apr 27, 2025

jangko commented Apr 28, 2025 • edited Loading

arnetheduck commented Apr 28, 2025 • edited Loading

arnetheduck commented Apr 28, 2025

tersec commented Apr 28, 2025

jangko commented Apr 28, 2025 • edited Loading

Excessive memory consumption when syncing a long way up to the `canonical head` #3207

Excessive memory consumption when syncing a long way up to the `canonical head` #3207

mjfh commented Apr 10, 2025 •

edited

Loading

jangko commented Apr 24, 2025 •

edited

Loading

mjfh commented Apr 24, 2025 •

edited

Loading

jangko commented Apr 25, 2025 •

edited

Loading

jangko commented Apr 28, 2025 •

edited

Loading

arnetheduck commented Apr 28, 2025 •

edited

Loading

jangko commented Apr 28, 2025 •

edited

Loading