Skip to content

Optimized text for full unicode and some escape sequences #129169

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 7 commits into from
Jun 12, 2025

Conversation

jordan-powers
Copy link
Contributor

Follow-up to #126492 to apply the json parsing optimization to strings containing unicode characters and some backslash-escaped characters.

Supporting backslash-escaped strings is tricky as it requires modifying the string. There are two types of modification: some just remove the backslash (e.g. \", \\), and some replace the whole escape sequence with a new character (e.g. \n, \r, \u00e5). In this implementation, the optimization only support the first case--removing the backslash. This is done by making a copy of the data, skipping the backslash. It should still be more optimized than full String decoding, but it won't be as fast as non-backslashed strings where we can directly reference the input bytes.

Relates to #129072.

@jordan-powers jordan-powers self-assigned this Jun 9, 2025
@jordan-powers jordan-powers requested a review from a team as a code owner June 9, 2025 23:39
@jordan-powers jordan-powers added >non-issue :Core/Infra/Core Core issues without another label auto-backport Automatically create backport pull requests when merged v8.19.0 v9.1.0 labels Jun 9, 2025
@elasticsearchmachine
Copy link
Collaborator

Pinging @elastic/es-core-infra (Team:Core/Infra)

@elasticsearchmachine elasticsearchmachine added the Team:Core/Infra Meta label for core/infra team label Jun 9, 2025
@jordan-powers jordan-powers changed the title Optimized text full unicode Optimized text for full unicode and some escape sequences Jun 9, 2025
Copy link
Member

@martijnvg martijnvg left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

I think a review from @elastic/es-core-infra is also required here.


public class ESUTF8StreamJsonParser extends UTF8StreamJsonParser {
protected int stringEnd = -1;
protected int stringLength;

private final List<Integer> backslashes = new ArrayList<>();
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe use Lucene's IntArrayList here, so that primitive ints can be collected?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Unfortunately, Lucene isn't available in libs/x-content/impl

Copy link
Contributor

@alexey-ivanov-es alexey-ivanov-es left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

If we completely consume the input buffer before finding a quote
character ending the string, we should return null and fall back to
jackson's native getValueAsString() which has logic to load more data
into the input buffer.
@jordan-powers jordan-powers merged commit 96300a9 into elastic:main Jun 12, 2025
18 checks passed
@elasticsearchmachine
Copy link
Collaborator

💚 Backport successful

Status Branch Result
8.19

elasticsearchmachine pushed a commit that referenced this pull request Jun 12, 2025
…129360)

Follow-up to #126492 to apply the json parsing optimization to strings
containing unicode characters and some backslash-escaped characters.

Supporting backslash-escaped strings is tricky as it requires modifying the
string. There are two types of modification: some just remove the backslash
(e.g. \", \\), and some replace the whole escape sequence with a new
character (e.g. \n, \r, \u00e5). In this implementation, the optimization
only supports the first case--removing the backslash. This is done by
making a copy of the data, skipping the backslash. It should still be more
optimized than full String decoding, but it won't be as fast as 
non-backslashed strings where we can directly reference the input bytes.

Relates to #129072.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
auto-backport Automatically create backport pull requests when merged :Core/Infra/Core Core issues without another label >non-issue Team:Core/Infra Meta label for core/infra team v8.19.0 v9.1.0
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants