Optimized text for full unicode and some escape sequences #129169

jordan-powers · 2025-06-09T23:39:47Z

Follow-up to #126492 to apply the json parsing optimization to strings containing unicode characters and some backslash-escaped characters.

Supporting backslash-escaped strings is tricky as it requires modifying the string. There are two types of modification: some just remove the backslash (e.g. \", \\), and some replace the whole escape sequence with a new character (e.g. \n, \r, \u00e5). In this implementation, the optimization only support the first case--removing the backslash. This is done by making a copy of the data, skipping the backslash. It should still be more optimized than full String decoding, but it won't be as fast as non-backslashed strings where we can directly reference the input bytes.

Relates to #129072.

elasticsearchmachine · 2025-06-09T23:40:11Z

Pinging @elastic/es-core-infra (Team:Core/Infra)

…-unicode

martijnvg

LGTM

I think a review from @elastic/es-core-infra is also required here.

martijnvg · 2025-06-10T13:27:00Z

...tent/impl/src/main/java/org/elasticsearch/xcontent/provider/json/ESUTF8StreamJsonParser.java


 public class ESUTF8StreamJsonParser extends UTF8StreamJsonParser {
    protected int stringEnd = -1;
+    protected int stringLength;
+
+    private final List<Integer> backslashes = new ArrayList<>();


Maybe use Lucene's IntArrayList here, so that primitive ints can be collected?

Unfortunately, Lucene isn't available in libs/x-content/impl

alexey-ivanov-es

LGTM

If we completely consume the input buffer before finding a quote character ending the string, we should return null and fall back to jackson's native getValueAsString() which has logic to load more data into the input buffer.

…-unicode

elasticsearchmachine · 2025-06-12T16:56:32Z

💚 Backport successful

Status	Branch	Result
✅	8.19

…129360) Follow-up to #126492 to apply the json parsing optimization to strings containing unicode characters and some backslash-escaped characters. Supporting backslash-escaped strings is tricky as it requires modifying the string. There are two types of modification: some just remove the backslash (e.g. \", \\), and some replace the whole escape sequence with a new character (e.g. \n, \r, \u00e5). In this implementation, the optimization only supports the first case--removing the backslash. This is done by making a copy of the data, skipping the backslash. It should still be more optimized than full String decoding, but it won't be as fast as non-backslashed strings where we can directly reference the input bytes. Relates to #129072.

jordan-powers added 4 commits June 9, 2025 12:22

Add support for some escape sequences in optimizedText

1ff4df4

Extend json parser randomized testing to include escape sequences

86f9bd2

Add full unicode support to optimizedText

8513148

Use single arraylist instance to track backslashes

a31506d

jordan-powers requested review from martijnvg and ldematte June 9, 2025 23:39

jordan-powers self-assigned this Jun 9, 2025

jordan-powers requested a review from a team as a code owner June 9, 2025 23:39

jordan-powers added >non-issue :Core/Infra/Core Core issues without another label auto-backport Automatically create backport pull requests when merged v8.19.0 v9.1.0 labels Jun 9, 2025

elasticsearchmachine added the Team:Core/Infra Meta label for core/infra team label Jun 9, 2025

jordan-powers changed the title ~~Optimized text full unicode~~ Optimized text for full unicode and some escape sequences Jun 9, 2025

Merge remote-tracking branch 'upstream/main' into optimized-text-full…

e544f9d

…-unicode

jordan-powers mentioned this pull request Jun 6, 2025

Skip redundant UTF8 to UTF16 conversion follow-ups #129072

Open

7 tasks

martijnvg approved these changes Jun 10, 2025

View reviewed changes

martijnvg reviewed Jun 10, 2025

View reviewed changes

alexey-ivanov-es approved these changes Jun 12, 2025

View reviewed changes

jordan-powers added 2 commits June 12, 2025 07:58

Give up if input buffer is fully consumed

f47400d

If we completely consume the input buffer before finding a quote character ending the string, we should return null and fall back to jackson's native getValueAsString() which has logic to load more data into the input buffer.

Merge remote-tracking branch 'upstream/main' into optimized-text-full…

9ac140b

…-unicode

jordan-powers merged commit 96300a9 into elastic:main Jun 12, 2025
18 checks passed

jordan-powers mentioned this pull request Jun 12, 2025

[8.19] Optimized text for full unicode and some escape sequences (#129169) #129360

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Optimized text for full unicode and some escape sequences #129169

Optimized text for full unicode and some escape sequences #129169

jordan-powers commented Jun 9, 2025

Uh oh!

elasticsearchmachine commented Jun 9, 2025

Uh oh!

martijnvg left a comment

Uh oh!

martijnvg Jun 10, 2025

Uh oh!

jordan-powers Jun 10, 2025

Uh oh!

alexey-ivanov-es left a comment

Uh oh!

Uh oh!

elasticsearchmachine commented Jun 12, 2025

Uh oh!

Uh oh!

Optimized text for full unicode and some escape sequences #129169

Optimized text for full unicode and some escape sequences #129169

Conversation

jordan-powers commented Jun 9, 2025

Uh oh!

elasticsearchmachine commented Jun 9, 2025

Uh oh!

martijnvg left a comment

Choose a reason for hiding this comment

Uh oh!

martijnvg Jun 10, 2025

Choose a reason for hiding this comment

Uh oh!

jordan-powers Jun 10, 2025

Choose a reason for hiding this comment

Uh oh!

alexey-ivanov-es left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

elasticsearchmachine commented Jun 12, 2025

💚 Backport successful

Uh oh!

Uh oh!