-
Notifications
You must be signed in to change notification settings - Fork 25.3k
Optimized text for full unicode and some escape sequences #129169
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Optimized text for full unicode and some escape sequences #129169
Conversation
Pinging @elastic/es-core-infra (Team:Core/Infra) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
I think a review from @elastic/es-core-infra is also required here.
|
||
public class ESUTF8StreamJsonParser extends UTF8StreamJsonParser { | ||
protected int stringEnd = -1; | ||
protected int stringLength; | ||
|
||
private final List<Integer> backslashes = new ArrayList<>(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe use Lucene's IntArrayList
here, so that primitive ints can be collected?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Unfortunately, Lucene isn't available in libs/x-content/impl
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
If we completely consume the input buffer before finding a quote character ending the string, we should return null and fall back to jackson's native getValueAsString() which has logic to load more data into the input buffer.
💚 Backport successful
|
…129360) Follow-up to #126492 to apply the json parsing optimization to strings containing unicode characters and some backslash-escaped characters. Supporting backslash-escaped strings is tricky as it requires modifying the string. There are two types of modification: some just remove the backslash (e.g. \", \\), and some replace the whole escape sequence with a new character (e.g. \n, \r, \u00e5). In this implementation, the optimization only supports the first case--removing the backslash. This is done by making a copy of the data, skipping the backslash. It should still be more optimized than full String decoding, but it won't be as fast as non-backslashed strings where we can directly reference the input bytes. Relates to #129072.
Follow-up to #126492 to apply the json parsing optimization to strings containing unicode characters and some backslash-escaped characters.
Supporting backslash-escaped strings is tricky as it requires modifying the string. There are two types of modification: some just remove the backslash (e.g.
\"
,\\
), and some replace the whole escape sequence with a new character (e.g.\n
,\r
,\u00e5
). In this implementation, the optimization only support the first case--removing the backslash. This is done by making a copy of the data, skipping the backslash. It should still be more optimized than fullString
decoding, but it won't be as fast as non-backslashed strings where we can directly reference the input bytes.Relates to #129072.