Remove provider checks at startup #5337

mangas · 2024-04-11T09:09:13Z

New chain now created with ChainIdentifier::default() on start
Adapter genesis+net_version now checked by ProviderManager
Adapter checks will be disabled temporarily after an intermittent failure
Adapter won't be returned by the ProviderManager if the ident returned is not the expected value.
BlockIngestor now takes a ChainClient and tries to get an adapter with each poll, this means it won't stop if the first poll fails
node/src/main.rs refactored so that most of the init code is automatically executed by the new Networks type.
graphman run now uses the same init code as the node, should be able to run using any chain
EthereumNetworks type removed
FirehoseNetworks type removed
New chains require a working adapter for creation

Future improvements:

Cache genesis once it's set
Cache chain ident once it's set

zorancv · 2024-05-31T17:22:53Z

graph/src/components/adapter.rs

+    }
+
+    /// get_all will trigger the verification of the endpoints for the provided chain_id, hence the
+    /// async. If this is undesirable, check `get_all_verified` as an alternatives that does not


Spelling: get_all_unverified

zorancv · 2024-06-02T19:00:36Z

server/index-node/src/resolver.rs

@@ -593,7 +593,7 @@ impl<S: Store> IndexNodeResolver<S> {
            }
            BlockchainKind::Starknet => {
                let unvalidated_subgraph_manifest =
-                    UnvalidatedSubgraphManifest::<graph_chain_substreams::Chain>::resolve(
+                    UnvalidatedSubgraphManifest::<graph_chain_starknet::Chain>::resolve(


zorancv

Nice that you did refactoring along with functional improvements.

incrypto32 · 2024-06-03T15:35:48Z

node/src/main.rs

+                        .cleanup_ethereum_shallow_blocks(eth_network_names)
+                        .unwrap();
+                }
+                None => todo!(),


Will this ever be reached? Is unreachable!() more suitable here?

yep good catch I've updated and added an explanation

node/src/network_setup.rs

incrypto32 · 2024-06-06T11:34:00Z

chain/ethereum/src/network.rs

+        #[cfg(debug_assertions)]
+        call_only_adapters.iter().for_each(|a| {
+            a.is_call_only();
+        });


Any specific reason we need this here?

just debug time sanity check

chain/ethereum/src/network.rs

lutter · 2024-06-07T18:23:48Z

chain/ethereum/src/ethereum_adapter.rs

+    ) -> Result<(), Error> {
+        if call_only {
+            warn!(logger, "Call only providers not supported"; "provider" => provider);
+            return Err(anyhow!("Call only providers not supported"));


It would be nicer to put the provider into the error, too; it'll save some hunting through logs if this ever happens

This code hasn't been added in this PR, I need to check why this difference is showing, might have been part of a rebase

lutter · 2024-06-07T18:25:30Z

chain/ethereum/src/ingestor.rs


        // Get chain head ptr from store
        let head_block_ptr_opt = self.chain_store.cheap_clone().chain_head_ptr().await?;

        // To check if there is a new block or not, fetch only the block header since that's cheaper
        // than the full block. This is worthwhile because most of the time there won't be a new
        // block, as we expect the poll interval to be much shorter than the block time.
-        let latest_block = self.latest_block().await?;
+        let latest_block = self.latest_block(&logger, &eth_adapter).await?;


Very minor, but logger is already a &Logger, so no need to pass &logger here

lutter · 2024-06-07T18:57:02Z

chain/ethereum/src/ingestor.rs

+            .cheapest()
+            .await
+            .ok_or_else(|| graph::anyhow::anyhow!("unable to get eth adapter"))
+    }


How stable is this selection of an adapter? Meaning: if everything is working fine, will we be always using the same adapter? This is important since switching between providers can produce pretty confusing results if one of them is lagging behind another (i.e., the chain head could be jumping back and forth) or if they are on different branches.

It also feels a little weird that most of the methods now take an eth_adapter argument, but the ingestor also has a way to select one. That might be an indication that there is some abstraction missing for adapter selection.

do_pool will loop inside IIRC and will only exit if there was an error, So if this will only be called on a retry. Before, if the adapter was broken it would never be replaced. Where are these methods taken an eth_adapter? They are probably called after selection, most place that do adapter selection should have a ChainClient IIRC and then inner methods just get the simpler eth_adapter so they don't need to deal with error. That's the general pattern, this was one of the few places where this behavior hadn't been implemented.

I meant methods like latest_block or ingest_block just preceding this.

sorry I'm not sure I follow, once the ingestor fetches the adapter it will re-use the same one until there is an error, in which case we would get a new one which goes through the selection that's already implemented everywhere else. Did I miss some part of it?

lutter · 2024-06-07T19:07:57Z

chain/ethereum/src/network.rs

 impl EthereumNetworkAdapters {
-    pub fn new(retest_percent: Option<f64>) -> Self {
+    pub fn empty() -> Self {


This should be #[cfg(debug_assertions)] as it's only used in tests, and having a scarier name like empty_for_testing might be good, too, to assure the casual reader that this isn't used in production code.

Had to revert to renaming because of the build in release mode step on CI that builds the test crate for some reason

lutter · 2024-06-07T19:16:45Z

chain/ethereum/src/network.rs

+            .unwrap_or_default();
+
+        Self::available_with_capabilities(all, required_capabilities)
+    }


It seems to me that these two methods and available_with_capabilities should be methods on the ProviderManager.

What should those methods do for firehose or substreams?

lutter · 2024-06-07T21:22:50Z

chain/ethereum/src/network.rs

+            .manager
+            .get_all(&self.chain_id)
+            .await
+            .unwrap_or_default();


Is the intent behind the .unwrap_or_default() that we'll return an empty iterator? It might be better if this returned Result<impl Iterator<Item = &EthereumNetworkAdapter> + '_, SomeError> to force the caller to handle that condition. If not, the method should at least have a comment on the circumstances under which this returns an empty iterator.

added a comment

lutter · 2024-06-07T21:23:05Z

chain/ethereum/src/network.rs

+        let all = self
+            .manager
+            .get_all_unverified(&self.chain_id)
+            .unwrap_or_default();


Same comment on the unwrap_or_default as above

lutter · 2024-06-07T21:38:50Z

graph/src/firehose/endpoints.rs

+            }
+            _ => AvailableCapacity::High,
+        }
+    }


Why does this discretize the SubgraphLimit? I would have thought you want to select the provider with the biggest available capacity and/or the lowest usage. But it seems this message first classifies providers into 3 groups, and presumably something than picks one from one of the groups.

This code wasn't modified by this, it was just moved, probably because of a rebase and I don't really want to open the scope of the review to that code since it's not part of the PR

lutter · 2024-06-07T21:42:44Z

graph/src/firehose/endpoints.rs

+            .map_or(Ok(None), |key| {
+                key.parse::<MetadataValue<Ascii>>().map(Some)
+            })
+            .expect("Firehose key is invalid");


All these expect and panics will cause a configuration error to stop graph-node from starting, and AFAICT, we don't even have tooling to check that part of the configuration. So a typo like url = "htps://some.where.com/" will make it impossible to start graph-node. These issues should result in an error, and that part of the config should be ignored, but the rest of graph-node should start up successfully.

same as the previous comment, this wasn't changed, this code has been reviewed elsewhere and hasn't been modified

I didn't realize that this code was moved over. I went through the PR to look at places where it can produce a panic, as we've had several instances now where panics can cause huge operational issues and we need to avoid them as much as possible.

lutter · 2024-06-07T21:47:01Z

node/src/chain.rs

+                (k, BlockchainKind::Substreams) => k,
+                (k, _) => k,
+            })
+            .expect("each chain should have at least 1 adapter");


It would be better to log an error and ignore that chain instead of aborting startup of the entire process.

If this is a valid state of configuration where an empty section can be provided then sure, I would also like to know what we should do with that setup.

Do you have a suggestion for the error message?

If reading the config file validates that there is at least one provider, logs an error and removes the chain from the config otherwise, it would be ok to leave the expect there to say something like validation should have checked we have at least one provider. Otherwise, the error should be something like Chain {name} does not have any providers configured. Ignoring it

lutter · 2024-06-07T21:54:27Z

I agree we can have a similar approach for those and just remove them from the specific implementations. I don't think I will try to tackle that on this PR but happy to chase that separately 😛

When do you think that will happen? We also still have this issue outstanding. So I am a bit hesitant to approve this also on the promise of a future improvement.

mangas · 2024-06-08T10:53:45Z

I agree we can have a similar approach for those and just remove them from the specific implementations. I don't think I will try to tackle that on this PR but happy to chase that separately 😛

When do you think that will happen? We also still have this issue outstanding. So I am a bit hesitant to approve this also on the promise of a future improvement.

The way I see it, we can either learn to live with these "out of scope" improvements in a way that will be addressed somewhere in the future, since they are, after all, out of the scope of this particular PR or we need to essentially accept that whenever we write any PR we will need to fix the world in one go. I think it's already a really bad practice to have a PR large as this as the standard and would prefer to avoid it when possible.

I also don't think whoever works on a specific piece should be "owning" any future improvements since most of those require time that someone needs to spend but anyone can really spend that time.

In this case if you are unwilling to accept the compromise then I guess I would like to understand what you are willing to accept from this PR so that I can get the needed functionality merged in.

lutter · 2024-06-10T17:46:35Z

I agree we can have a similar approach for those and just remove them from the specific implementations. I don't think I will try to tackle that on this PR but happy to chase that separately 😛

When do you think that will happen? We also still have this issue outstanding. So I am a bit hesitant to approve this also on the promise of a future improvement.

The way I see it, we can either learn to live with these "out of scope" improvements in a way that will be addressed somewhere in the future, since they are, after all, out of the scope of this particular PR or we need to essentially accept that whenever we write any PR we will need to fix the world in one go. I think it's already a really bad practice to have a PR large as this as the standard and would prefer to avoid it when possible.

I agree that such a large PR is bad practice, and it would have been better to split it into multiple PRs focused on specific aspects (like 'restructure code, no functional change', 'Delay provider checking') especially when the individual commits are not really reviewable by themselves and the commit messages give no hint at what the commit tries to achieve (a message of 'fix tests' does not give me confidence that the implementation followed a plan)

Fixing 'out of scope' improvements is part and parcel of working with a large, mature codebase; there will always be things we understand better now than we did when the code was first written. The work for aggregations for example was 70-80% just refactoring how we deal with the subgraph schema to bring some sanity to that code; the actual functionality was a relatively small amount of the overall work.

I also don't think whoever works on a specific piece should be "owning" any future improvements since most of those require time that someone needs to spend but anyone can really spend that time.

I think this is a little facile, and sounds a lot like "I don't want to deal with it, somebody else should do it", especially in light of PR 4916 which copy-pasted some very intricate code and which I approved with the understanding that you would fix that - it's been 6 months.

The attitude of "let's commit this and we'll fix issues with it later (never)" is what makes the code around ingestion from providers so complex: the code wasn't in the greatest shape when we had just RPC, we then shoehorned Firehose and substreams on top of it, and now how any of that works is incredibly hard to follow from the code as it is scattered across a million places. This code needs an owner who has a keen interest in improving it - that requires rethinking how we do that, writing up a plan, and actually doing it. Without somebody who understands this code at a deep level, fixing things in isolation will only make matters worse.

In this case if you are unwilling to accept the compromise then I guess I would like to understand what you are willing to accept from this PR so that I can get the needed functionality merged in.

At a minimum, please file tickets for the various issues I pointed out (like crashing on misconfiguration) We also need to figure out who will own the whole front side of the house so that we can make progress on improving this code.

lutter

Approving this with the understanding that we will schedule a more thorough revamp of the whole ingestion side of graph-node soon.

mangas · 2024-06-18T13:27:36Z

Approving this with the understanding that we will schedule a more thorough revamp of the whole ingestion side of graph-node soon.

Turned both cards into issues so I can link them here:

#5499
#5500

They are, however, still not prioritised

#3937

graphprotocol#3937

mangas marked this pull request as draft April 11, 2024 09:09

mangas changed the title ~~Remove provider checks at startup~~ Remove provider checks at startup [WIP] Apr 11, 2024

mangas force-pushed the filipe/remove-start-provider-checks branch 3 times, most recently from 4d95fd8 to 64963c5 Compare April 15, 2024 16:20

mangas force-pushed the filipe/remove-start-provider-checks branch 2 times, most recently from b486f8a to 1c43770 Compare April 22, 2024 13:23

mangas force-pushed the filipe/remove-start-provider-checks branch 8 times, most recently from 9ad4249 to d6c76f8 Compare May 13, 2024 10:41

mangas marked this pull request as ready for review May 13, 2024 15:11

mangas force-pushed the filipe/remove-start-provider-checks branch from f15c7dd to 002def6 Compare May 13, 2024 15:20

mangas changed the title ~~Remove provider checks at startup [WIP]~~ Remove provider checks at startup May 13, 2024

mangas force-pushed the filipe/remove-start-provider-checks branch 3 times, most recently from cdfca0f to c4886a3 Compare May 14, 2024 09:32

mangas requested a review from lutter May 14, 2024 09:35

mangas force-pushed the filipe/remove-start-provider-checks branch from c4886a3 to c9784a8 Compare May 14, 2024 15:11

fordN requested review from incrypto32 and zorancv and removed request for lutter May 14, 2024 15:56

mangas force-pushed the filipe/remove-start-provider-checks branch from c828c7d to d3a0e68 Compare May 16, 2024 10:30

fordN requested review from lutter and removed request for incrypto32 May 16, 2024 15:41

mangas force-pushed the filipe/remove-start-provider-checks branch from 808fd97 to ec0306a Compare May 17, 2024 14:38

zorancv reviewed May 31, 2024

View reviewed changes

zorancv reviewed Jun 2, 2024

View reviewed changes

zorancv approved these changes Jun 3, 2024

View reviewed changes

incrypto32 reviewed Jun 3, 2024

View reviewed changes

mangas force-pushed the filipe/remove-start-provider-checks branch from 380de39 to 738af4b Compare June 6, 2024 08:13

incrypto32 approved these changes Jun 6, 2024

View reviewed changes

lutter requested changes Jun 7, 2024

View reviewed changes

mangas force-pushed the filipe/remove-start-provider-checks branch from 9549e19 to a568ed5 Compare June 10, 2024 10:27

mangas requested a review from lutter June 10, 2024 10:59

mangas force-pushed the filipe/remove-start-provider-checks branch 3 times, most recently from 06523cf to 7f9388f Compare June 13, 2024 08:25

lutter approved these changes Jun 13, 2024

View reviewed changes

mangas force-pushed the filipe/remove-start-provider-checks branch 3 times, most recently from 492c073 to 6428635 Compare June 18, 2024 13:23

mangas force-pushed the filipe/remove-start-provider-checks branch 3 times, most recently from b9b59ec to b2c51cf Compare June 19, 2024 10:13

Remove provider checks at startup

0d321df

#3937

mangas force-pushed the filipe/remove-start-provider-checks branch from b2c51cf to 0d321df Compare June 19, 2024 11:50

mangas merged commit 9b9b8f9 into master Jun 19, 2024
7 checks passed

mangas deleted the filipe/remove-start-provider-checks branch June 19, 2024 12:01

azf20 mentioned this pull request Jun 20, 2024

[Bug] graph-node stops retrying when block ingestion stops due to RPC issues #5313

Open

3 tasks

YaroShkvorets mentioned this pull request Jul 17, 2024

Fix genesis block fetching for substreams #5548

Merged

codebuster22 pushed a commit to chain-labs/ag-graph-node that referenced this pull request Jul 23, 2024

Remove provider checks at startup (graphprotocol#5337)

a1e34a3

graphprotocol#3937

Remove provider checks at startup #5337

Remove provider checks at startup #5337

Conversation

mangas commented Apr 11, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

zorancv left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

mangas Jun 10, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

lutter commented Jun 7, 2024

mangas commented Jun 8, 2024 • edited Loading

lutter commented Jun 10, 2024

lutter left a comment

Choose a reason for hiding this comment

mangas commented Jun 18, 2024

mangas commented Apr 11, 2024 •

edited

Loading

mangas Jun 10, 2024 •

edited

Loading

mangas commented Jun 8, 2024 •

edited

Loading