-
Notifications
You must be signed in to change notification settings - Fork 25.2k
ESQL: Add documents_found
and values_loaded
#125631
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
Adds the `found_documents` and `values_loaded` fields to the tasks and profile for ESQL: ``` "drivers" : [ { "description" : "data", "cluster_name" : "runTask", "node_name" : "runTask-0", "start_millis" : 1742923173077, "stop_millis" : 1742923173087, "took_nanos" : 9557014, "cpu_nanos" : 9091340, "documents_found" : 5, <---- THESE "values_loaded" : 15, <---- THESE "iterations" : 6, ... ```
``` { "took" : 194, "is_partial" : false, "documents_found" : 100000, "values_loaded" : 200000, "columns" : [ { "name" : "a", "type" : "long" }, { "name" : "b", "type" : "long" } ], "values" : [[10, 1]] } ```
Pinging @elastic/es-analytical-engine (Team:Analytics) |
Pinging @elastic/kibana-esql (ES|QL-ui) |
Hi @nik9000, I've created a changelog YAML for you. |
I've stuck the |
I believe I can get a |
Sorry for the ping @giladgal. Wrong Gal. |
I'll need to update the docs in this PR! |
@GalLalouche, this doesn't expose the |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Approved with a few small nitpicks and suggestions.
} | ||
|
||
Status(StreamInput in) throws IOException { | ||
super(in); | ||
readersBuilt = in.readOrderedMap(StreamInput::readString, StreamInput::readVInt); | ||
if (in.getTransportVersion().onOrAfter(ESQL_VALUES_LOADED)) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Consider using ?:
here (I Know you don't like it, but it seems so obvious here :) )
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it's better here, yeah. Sometimes in the deserialization the ternary makes it unreadable. But here it's good.
* The number of documents found by this driver. | ||
*/ | ||
public long documentsFound() { | ||
long documentsFound = completedOperators.stream().mapToLong(OperatorStatus::documentsFound).sum(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Consider just inlining this. The streams can also be combined:
return Stream.concat(completedOperators.stream(), activeOperators.stream()).mapToLong(OperatorStatus::documentsFound).sum()
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I suppose that's better. I had thought about it and didn't like how it looked in my mind but when you type it it looks fine.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Drive by comment - how often will this method be called? @idegtiarenko found several instances of streams showing up in the profiler vs regular for/int.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
every minute or so.
@@ -83,7 +83,11 @@ public int getPositionCount() { | |||
|
|||
@Override | |||
public int getTotalValueCount() { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should this be changed to long
, or are we not worried about overflows here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Overflow would be a bug in planning. That'd be us using billions of values in a single block....
/** | ||
* Information returned when one of more {@link Driver}s is completed. | ||
* @param documentsFound The number of documents found by all lucene queries performed by these drivers. | ||
* @param valuesLoaded The number of values loaded from lucene for all drivers. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could you go into more details as to how valuesLoaded
differs from documentsFound
? Maybe it's obvious and I just don't have the context, but it's unclear to me.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
valuesLoaded
are the field values - so the value for @timestamp
or foo
or whatever. And it's the number of them. So [1, 2]
is two values and null
is zero values.
...ugin/esql/compute/src/main/java/org/elasticsearch/compute/operator/DriverCompletionInfo.java
Show resolved
Hide resolved
* Are the {@code documents_found} and {@code values_loaded} fields available | ||
* in the response and profile? | ||
*/ | ||
DOCUMENTS_FOUND, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should this be named something more inclusive to values_loaded
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can be. I can make the transport version the same and have it mention both.
default long valuesLoaded() { | ||
return 0; | ||
} | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not related to this change, but I think it is worth revisiting operator status.
Currently this is interface with implementation per each operator. It is used to expose some diagnostics attributes outside.
I believe we should replace it with a simple map (or maybe a wrapper on top of the map).
Common schema could be achieved with some sort of status builder (something like status().withPagesProcessed(..).withRowReceived(..).withRowsEmitted(..).build()
)
This way we would not need to have a custom implementation per operator, could have common equals/hashCode and toString logic. With map structure we will not need a new transport version when adding/renaming/removing infos.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Overall 👍 from me.
Left suggestion for status implementation that we should discuss separately.
The 9.0 bwc tests failed because of the 9.0 release. |
This adds
documents_found
andvalues_loaded
to the to the ESQL response:These are cheap enough to collect that we can do it for every query and
return it with every response. It's small, but it still gives you a
reasonable sense of how much work Elasticsearch had to go through to
perform the query.
I've also added these two fields to the driver profile and task status:
These are at a high level and should be easy to reason about. We'd like to
extract this into a "show me how difficult this running query is" API one
day. But today, just plumbing it into the debugging output is good.
Any
Operator
can claim to "find documents" or "load values" by overridinga method on its
Operator.Status
implementation:In this PR all of the
LuceneOperator
s declare that eachposition
theyemit is a "document found" and the
ValuesSourceValuesSourceReaderOperator
says each value it makes is a "value loaded". That's pretty pretty much
true. The
LuceneCountOperator
andLuceneMinMaxOperator
sort of pretendthat the count/min/max that they emit is a "document" - but that's good
enough to give you a sense of what's going on. It's like document.