-
Notifications
You must be signed in to change notification settings - Fork 1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Remove provider checks at startup #5337
Conversation
4d95fd8
to
64963c5
Compare
b486f8a
to
1c43770
Compare
9ad4249
to
d6c76f8
Compare
f15c7dd
to
002def6
Compare
cdfca0f
to
c4886a3
Compare
c4886a3
to
c9784a8
Compare
c828c7d
to
d3a0e68
Compare
808fd97
to
ec0306a
Compare
graph/src/components/adapter.rs
Outdated
} | ||
|
||
/// get_all will trigger the verification of the endpoints for the provided chain_id, hence the | ||
/// async. If this is undesirable, check `get_all_verified` as an alternatives that does not |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Spelling: get_all_unverified
@@ -593,7 +593,7 @@ impl<S: Store> IndexNodeResolver<S> { | |||
} | |||
BlockchainKind::Starknet => { | |||
let unvalidated_subgraph_manifest = | |||
UnvalidatedSubgraphManifest::<graph_chain_substreams::Chain>::resolve( | |||
UnvalidatedSubgraphManifest::<graph_chain_starknet::Chain>::resolve( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice fix.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice that you did refactoring along with functional improvements.
node/src/main.rs
Outdated
.cleanup_ethereum_shallow_blocks(eth_network_names) | ||
.unwrap(); | ||
} | ||
None => todo!(), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Will this ever be reached? Is unreachable!() more suitable here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yep good catch I've updated and added an explanation
380de39
to
738af4b
Compare
#[cfg(debug_assertions)] | ||
call_only_adapters.iter().for_each(|a| { | ||
a.is_call_only(); | ||
}); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Any specific reason we need this here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
just debug time sanity check
) -> Result<(), Error> { | ||
if call_only { | ||
warn!(logger, "Call only providers not supported"; "provider" => provider); | ||
return Err(anyhow!("Call only providers not supported")); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It would be nicer to put the provider
into the error, too; it'll save some hunting through logs if this ever happens
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This code hasn't been added in this PR, I need to check why this difference is showing, might have been part of a rebase
chain/ethereum/src/ingestor.rs
Outdated
|
||
// Get chain head ptr from store | ||
let head_block_ptr_opt = self.chain_store.cheap_clone().chain_head_ptr().await?; | ||
|
||
// To check if there is a new block or not, fetch only the block header since that's cheaper | ||
// than the full block. This is worthwhile because most of the time there won't be a new | ||
// block, as we expect the poll interval to be much shorter than the block time. | ||
let latest_block = self.latest_block().await?; | ||
let latest_block = self.latest_block(&logger, ð_adapter).await?; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Very minor, but logger
is already a &Logger
, so no need to pass &logger
here
.cheapest() | ||
.await | ||
.ok_or_else(|| graph::anyhow::anyhow!("unable to get eth adapter")) | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How stable is this selection of an adapter? Meaning: if everything is working fine, will we be always using the same adapter? This is important since switching between providers can produce pretty confusing results if one of them is lagging behind another (i.e., the chain head could be jumping back and forth) or if they are on different branches.
It also feels a little weird that most of the methods now take an eth_adapter
argument, but the ingestor also has a way to select one. That might be an indication that there is some abstraction missing for adapter selection.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
do_pool will loop inside IIRC and will only exit if there was an error, So if this will only be called on a retry. Before, if the adapter was broken it would never be replaced. Where are these methods taken an eth_adapter? They are probably called after selection, most place that do adapter selection should have a ChainClient IIRC and then inner methods just get the simpler eth_adapter so they don't need to deal with error. That's the general pattern, this was one of the few places where this behavior hadn't been implemented.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I meant methods like latest_block
or ingest_block
just preceding this.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
sorry I'm not sure I follow, once the ingestor fetches the adapter it will re-use the same one until there is an error, in which case we would get a new one which goes through the selection that's already implemented everywhere else. Did I miss some part of it?
chain/ethereum/src/network.rs
Outdated
impl EthereumNetworkAdapters { | ||
pub fn new(retest_percent: Option<f64>) -> Self { | ||
pub fn empty() -> Self { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This should be #[cfg(debug_assertions)]
as it's only used in tests, and having a scarier name like empty_for_testing
might be good, too, to assure the casual reader that this isn't used in production code.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
.unwrap_or_default(); | ||
|
||
Self::available_with_capabilities(all, required_capabilities) | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It seems to me that these two methods and available_with_capabilities
should be methods on the ProviderManager
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What should those methods do for firehose or substreams?
.manager | ||
.get_all(&self.chain_id) | ||
.await | ||
.unwrap_or_default(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is the intent behind the .unwrap_or_default()
that we'll return an empty iterator? It might be better if this returned Result<impl Iterator<Item = &EthereumNetworkAdapter> + '_, SomeError>
to force the caller to handle that condition. If not, the method should at least have a comment on the circumstances under which this returns an empty iterator.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
added a comment
let all = self | ||
.manager | ||
.get_all_unverified(&self.chain_id) | ||
.unwrap_or_default(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same comment on the unwrap_or_default
as above
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
same
} | ||
_ => AvailableCapacity::High, | ||
} | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why does this discretize the SubgraphLimit
? I would have thought you want to select the provider with the biggest available capacity and/or the lowest usage. But it seems this message first classifies providers into 3 groups, and presumably something than picks one from one of the groups.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This code wasn't modified by this, it was just moved, probably because of a rebase and I don't really want to open the scope of the review to that code since it's not part of the PR
.map_or(Ok(None), |key| { | ||
key.parse::<MetadataValue<Ascii>>().map(Some) | ||
}) | ||
.expect("Firehose key is invalid"); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
All these expect and panics will cause a configuration error to stop graph-node
from starting, and AFAICT, we don't even have tooling to check that part of the configuration. So a typo like url = "htps://some.where.com/"
will make it impossible to start graph-node
. These issues should result in an error, and that part of the config should be ignored, but the rest of graph-node
should start up successfully.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
same as the previous comment, this wasn't changed, this code has been reviewed elsewhere and hasn't been modified
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I didn't realize that this code was moved over. I went through the PR to look at places where it can produce a panic, as we've had several instances now where panics can cause huge operational issues and we need to avoid them as much as possible.
node/src/chain.rs
Outdated
(k, BlockchainKind::Substreams) => k, | ||
(k, _) => k, | ||
}) | ||
.expect("each chain should have at least 1 adapter"); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It would be better to log an error and ignore that chain instead of aborting startup of the entire process.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If this is a valid state of configuration where an empty section can be provided then sure, I would also like to know what we should do with that setup.
Do you have a suggestion for the error message?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If reading the config file validates that there is at least one provider, logs an error and removes the chain from the config otherwise, it would be ok to leave the expect
there to say something like validation should have checked we have at least one provider
. Otherwise, the error should be something like Chain {name} does not have any providers configured. Ignoring it
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
updated
When do you think that will happen? We also still have this issue outstanding. So I am a bit hesitant to approve this also on the promise of a future improvement. |
The way I see it, we can either learn to live with these "out of scope" improvements in a way that will be addressed somewhere in the future, since they are, after all, out of the scope of this particular PR or we need to essentially accept that whenever we write any PR we will need to fix the world in one go. I think it's already a really bad practice to have a PR large as this as the standard and would prefer to avoid it when possible. I also don't think whoever works on a specific piece should be "owning" any future improvements since most of those require time that someone needs to spend but anyone can really spend that time. In this case if you are unwilling to accept the compromise then I guess I would like to understand what you are willing to accept from this PR so that I can get the needed functionality merged in. |
9549e19
to
a568ed5
Compare
I agree that such a large PR is bad practice, and it would have been better to split it into multiple PRs focused on specific aspects (like 'restructure code, no functional change', 'Delay provider checking') especially when the individual commits are not really reviewable by themselves and the commit messages give no hint at what the commit tries to achieve (a message of 'fix tests' does not give me confidence that the implementation followed a plan) Fixing 'out of scope' improvements is part and parcel of working with a large, mature codebase; there will always be things we understand better now than we did when the code was first written. The work for aggregations for example was 70-80% just refactoring how we deal with the subgraph schema to bring some sanity to that code; the actual functionality was a relatively small amount of the overall work.
I think this is a little facile, and sounds a lot like "I don't want to deal with it, somebody else should do it", especially in light of PR 4916 which copy-pasted some very intricate code and which I approved with the understanding that you would fix that - it's been 6 months. The attitude of "let's commit this and we'll fix issues with it later (never)" is what makes the code around ingestion from providers so complex: the code wasn't in the greatest shape when we had just RPC, we then shoehorned Firehose and substreams on top of it, and now how any of that works is incredibly hard to follow from the code as it is scattered across a million places. This code needs an owner who has a keen interest in improving it - that requires rethinking how we do that, writing up a plan, and actually doing it. Without somebody who understands this code at a deep level, fixing things in isolation will only make matters worse.
At a minimum, please file tickets for the various issues I pointed out (like crashing on misconfiguration) We also need to figure out who will own the whole front side of the house so that we can make progress on improving this code. |
06523cf
to
7f9388f
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Approving this with the understanding that we will schedule a more thorough revamp of the whole ingestion side of graph-node
soon.
492c073
to
6428635
Compare
b9b59ec
to
b2c51cf
Compare
b2c51cf
to
0d321df
Compare
Fixes #3937
ChainIdentifier::default()
on startProviderManager
ProviderManager
if the ident returned is not the expected value.ChainClient
and tries to get an adapter with each poll, this means it won't stop if the first poll failsNetworks
type.Future improvements: