Skip UTF8 to UTF16 conversion during document indexing #126492

jordan-powers · 2025-04-08T20:33:33Z

Initial implementation of avoiding the UTF8 to UTF16 back to UTF8 conversion that currently happens when parsing JSON.

Currently, the optimization only applies for ASCII-only strings that do not contain escaped characters (f.e. \").

Also, the optimization is currently only implemented for keyword fields.

The core logic to return the underlying bytes is implemented in ESUTF8StreamJsonParser. The other "ES" versions of Jackson classes are just the glue logic required to make Jackson create instances of ESUTF8StreamJsonParser instead of the superclass UTF8StreamJsonParser.

I ran the prototype in esbench using the logging-indexing challenge on a standard-mode index, and I got these results:

Baseline:

|                                                         Metric |                                   Task |           Value |   Unit |
|---------------------------------------------------------------:|---------------------------------------:|----------------:|-------:|
|                     Cumulative indexing time of primary shards |                                        |   855.769       |    min |
|             Min cumulative indexing time across primary shards |                                        |     3.16602     |    min |
|          Median cumulative indexing time across primary shards |                                        |    11.4853      |    min |
|             Max cumulative indexing time across primary shards |                                        |   142.816       |    min |
|            Cumulative indexing throttle time of primary shards |                                        |     0           |    min |
|    Min cumulative indexing throttle time across primary shards |                                        |     0           |    min |
| Median cumulative indexing throttle time across primary shards |                                        |     0           |    min |
|    Max cumulative indexing throttle time across primary shards |                                        |     0           |    min |
|                        Cumulative merge time of primary shards |                                        |   610.819       |    min |
|                       Cumulative merge count of primary shards |                                        |  1018           |        |
|                Min cumulative merge time across primary shards |                                        |     0.395583    |    min |
|             Median cumulative merge time across primary shards |                                        |     3.37862     |    min |
|                Max cumulative merge time across primary shards |                                        |   142.732       |    min |
|               Cumulative merge throttle time of primary shards |                                        |   282.793       |    min |
|       Min cumulative merge throttle time across primary shards |                                        |     0.14925     |    min |
|    Median cumulative merge throttle time across primary shards |                                        |     1.39443     |    min |
|       Max cumulative merge throttle time across primary shards |                                        |    63.0728      |    min |
|                      Cumulative refresh time of primary shards |                                        |    58.5789      |    min |
|                     Cumulative refresh count of primary shards |                                        | 19551           |        |
|              Min cumulative refresh time across primary shards |                                        |     0.0121333   |    min |
|           Median cumulative refresh time across primary shards |                                        |     0.0492833   |    min |
|              Max cumulative refresh time across primary shards |                                        |    19.3951      |    min |
|                        Cumulative flush time of primary shards |                                        |    72.752       |    min |
|                       Cumulative flush count of primary shards |                                        |  7713           |        |
|                Min cumulative flush time across primary shards |                                        |     0.47355     |    min |
|             Median cumulative flush time across primary shards |                                        |     1.58855     |    min |
|                Max cumulative flush time across primary shards |                                        |     5.32642     |    min |

Contender:

|                                                         Metric |                                   Task |          Value |   Unit |
|---------------------------------------------------------------:|---------------------------------------:|---------------:|-------:|
|                     Cumulative indexing time of primary shards |                                        |   724.972      |    min |
|             Min cumulative indexing time across primary shards |                                        |     2.45815    |    min |
|          Median cumulative indexing time across primary shards |                                        |     9.29517    |    min |
|             Max cumulative indexing time across primary shards |                                        |   127          |    min |
|            Cumulative indexing throttle time of primary shards |                                        |     0          |    min |
|    Min cumulative indexing throttle time across primary shards |                                        |     0          |    min |
| Median cumulative indexing throttle time across primary shards |                                        |     0          |    min |
|    Max cumulative indexing throttle time across primary shards |                                        |     0          |    min |
|                        Cumulative merge time of primary shards |                                        |   611.619      |    min |
|                       Cumulative merge count of primary shards |                                        |  1135          |        |
|                Min cumulative merge time across primary shards |                                        |     0.991983   |    min |
|             Median cumulative merge time across primary shards |                                        |     3.72563    |    min |
|                Max cumulative merge time across primary shards |                                        |   144.073      |    min |
|               Cumulative merge throttle time of primary shards |                                        |   277.04       |    min |
|       Min cumulative merge throttle time across primary shards |                                        |     0.426367   |    min |
|    Median cumulative merge throttle time across primary shards |                                        |     1.7056     |    min |
|       Max cumulative merge throttle time across primary shards |                                        |    65.306      |    min |
|                      Cumulative refresh time of primary shards |                                        |    62.5784     |    min |
|                     Cumulative refresh count of primary shards |                                        | 18632          |        |
|              Min cumulative refresh time across primary shards |                                        |     0.01085    |    min |
|           Median cumulative refresh time across primary shards |                                        |     0.0686167  |    min |
|              Max cumulative refresh time across primary shards |                                        |    19.9901     |    min |
|                        Cumulative flush time of primary shards |                                        |    69.0186     |    min |
|                       Cumulative flush count of primary shards |                                        |  6940          |        |
|                Min cumulative flush time across primary shards |                                        |     0.448817   |    min |
|             Median cumulative flush time across primary shards |                                        |     1.4913     |    min |
|                Max cumulative flush time across primary shards |                                        |     4.95223    |    min |

…lueAsByteRef()

jordan-powers · 2025-04-15T17:54:23Z

Ok, I figured out a way to avoid cloning the file. Basically the problem was that detecting the encoding may possibly consume some bytes from the beginning of the input if there is a BOM present, and the property keeping track of how many bytes had been consumed was internal to ByteSourceJsonBootstrapper and not visible. But I figured out that I can use ByteSourceJsonBootstrapper to detect the encoding, then if it's UTF-8, I can manually check for and skip over the BOM.

Also, looking at the API docs, it seems elasticsearch only supports UTF-8 inputs anyway. So maybe I don't need to check the character encoding? Although I don't know for certain that elasticsearch won't parse json from another source (such as a configuration file or something) that's not UTF-8 encoded, so I'll leave it in

martijnvg

This looks great Jordan! Happy to see we don't need to fork entire files to get this.

Did you also see an increase in bulk indexing median throughput when running the elastic/logs track?

martijnvg · 2025-04-16T13:15:30Z

server/src/main/java/org/elasticsearch/index/mapper/KeywordFieldMapper.java

+    /**
+     * Class that holds either a UTF-16 String or a UTF-8 BytesRef, and lazily converts between the two.
+     */
+    private static class RawString {


Can this be a record?

Possibly, but it would be a bit tough since records are immutable. The lazy conversion works by only setting the bytesValue or stringValue once they've been converted, which wouldn't work if the properties were final.

Ok, I falsely assumed that making RawString's properties final would be straightforward. In that case turning this class into a record doesn't make sense.

martijnvg · 2025-04-16T13:17:45Z

...tent/impl/src/main/java/org/elasticsearch/xcontent/provider/json/ESUTF8StreamJsonParser.java

+    /**
+     * Method that will try to get underlying UTF-8 encoded bytes of the current string token.
+     * This is only a best-effort attempt; if there is some reason the bytes cannot be retrieved, this method will return null.
+     * Currently, this is only implemented for ascii-only strings that do not contain escaped characters.


I suspect we need to support escaped characters? I don't think we have a test for this. Maybe a rest test (yaml or java) that tries to index a escaped json string field using keyword field mapper?

I'll definitely add a test.

I was thinking that support for escaped characters and/or possibly full unicode (not just ascii) could happen in a follow-up after this is merged. The current implementation does still function with such characters, it just won't be optimized (since this method will return null so the KeywordFieldMapper will fall back to getValueAsString()).

The current implementation does still function with such characters, it just won't be optimized (since this method will return null so the KeywordFieldMapper will fall back to getValueAsString()).

That makes sense to me.

++ it makes sense to have it in a follow-up, this PR is already big enough :)
The important thing is to test nothing is broken, and it takes the non-optimized path.

jordan-powers · 2025-04-16T16:36:12Z

Did you also see an increase in bulk indexing median throughput when running the elastic/logs track?

No it was actually 0.27% slower, but that's such a small change I think it's noise. Latency was much lower though.

I plan to rerun the benchmarks with logsdb. Hopefully that shows some improvement to throughput.

Baseline:

|                  Min Throughput |                             bulk-index |  1133.11        | docs/s |
|                 Mean Throughput |                             bulk-index | 22013.5         | docs/s |
|               Median Throughput |                             bulk-index | 21980.4         | docs/s |
|                  Max Throughput |                             bulk-index | 23748.9         | docs/s |
|         50th percentile latency |                             bulk-index |  2566.75        |     ms |
|         90th percentile latency |                             bulk-index |  3637.47        |     ms |
|         99th percentile latency |                             bulk-index |  4794.34        |     ms |
|       99.9th percentile latency |                             bulk-index |  5885.33        |     ms |
|      99.99th percentile latency |                             bulk-index |  7442.52        |     ms |
|        100th percentile latency |                             bulk-index |  8522.51        |     ms |
|    50th percentile service time |                             bulk-index |  2564.74        |     ms |
|    90th percentile service time |                             bulk-index |  3637.85        |     ms |
|    99th percentile service time |                             bulk-index |  4799.66        |     ms |
|  99.9th percentile service time |                             bulk-index |  5910.21        |     ms |
| 99.99th percentile service time |                             bulk-index |  7434.99        |     ms |
|   100th percentile service time |                             bulk-index |  8522.51        |     ms |
|                      error rate |                             bulk-index |     0           |      % |

Contender:

|                  Min Throughput |                             bulk-index |   563.25       | docs/s |
|                 Mean Throughput |                             bulk-index | 21945.6        | docs/s |
|               Median Throughput |                             bulk-index | 21920.7        | docs/s |
|                  Max Throughput |                             bulk-index | 23439.9        | docs/s |
|         50th percentile latency |                             bulk-index |   493.49       |     ms |
|         90th percentile latency |                             bulk-index |   757.997      |     ms |
|         99th percentile latency |                             bulk-index |  1166.96       |     ms |
|       99.9th percentile latency |                             bulk-index |  2113.26       |     ms |
|      99.99th percentile latency |                             bulk-index |  2723.74       |     ms |
|        100th percentile latency |                             bulk-index |  3567.55       |     ms |
|    50th percentile service time |                             bulk-index |   492.466      |     ms |
|    90th percentile service time |                             bulk-index |   757.44       |     ms |
|    99th percentile service time |                             bulk-index |  1163.02       |     ms |
|  99.9th percentile service time |                             bulk-index |  2109.98       |     ms |
| 99.99th percentile service time |                             bulk-index |  2719.76       |     ms |
|   100th percentile service time |                             bulk-index |  3567.55       |     ms |
|                      error rate |                             bulk-index |     0          |      % |

jordan-powers · 2025-04-17T20:18:37Z

Ok, ran another benchmark. This time a multi-node logsdb run, similar to the nightlies.

It seems that in this run, cumulative indexing time, latency, service time, and indexing throughput all increased.

I'm not sure why indexing time, latency, and service time increased. Maybe those values are just noisy?
But the median indexing throughput increased significantly by ~9%, which is a good sign.

Baseline:

|                                                         Metric |                                   Task |          Value |   Unit |
|---------------------------------------------------------------:|---------------------------------------:|---------------:|-------:|
|                     Cumulative indexing time of primary shards |                                        |   865.972      |    min |
|             Min cumulative indexing time across primary shards |                                        |     2.79818    |    min |
|          Median cumulative indexing time across primary shards |                                        |    11.1573     |    min |
|             Max cumulative indexing time across primary shards |                                        |   159.704      |    min |
|            Cumulative indexing throttle time of primary shards |                                        |     0          |    min |
|    Min cumulative indexing throttle time across primary shards |                                        |     0          |    min |
| Median cumulative indexing throttle time across primary shards |                                        |     0          |    min |
|    Max cumulative indexing throttle time across primary shards |                                        |     0          |    min |
|                        Cumulative merge time of primary shards |                                        |   415.579      |    min |
|                       Cumulative merge count of primary shards |                                        |  1316          |        |
|                Min cumulative merge time across primary shards |                                        |     0.556283   |    min |
|             Median cumulative merge time across primary shards |                                        |     2.5208     |    min |
|                Max cumulative merge time across primary shards |                                        |   101.172      |    min |
|               Cumulative merge throttle time of primary shards |                                        |   118.28       |    min |
|       Min cumulative merge throttle time across primary shards |                                        |     0.14245    |    min |
|    Median cumulative merge throttle time across primary shards |                                        |     0.527983   |    min |
|       Max cumulative merge throttle time across primary shards |                                        |    28.2791     |    min |
|                      Cumulative refresh time of primary shards |                                        |    86.7655     |    min |
|                     Cumulative refresh count of primary shards |                                        | 14976          |        |
|              Min cumulative refresh time across primary shards |                                        |     0.0321333  |    min |
|           Median cumulative refresh time across primary shards |                                        |     0.146817   |    min |
|              Max cumulative refresh time across primary shards |                                        |    29.7167     |    min |
|                        Cumulative flush time of primary shards |                                        |   108.617      |    min |
|                       Cumulative flush count of primary shards |                                        |  6216          |        |
|                Min cumulative flush time across primary shards |                                        |     0.69835    |    min |
|             Median cumulative flush time across primary shards |                                        |     2.60913    |    min |
|                Max cumulative flush time across primary shards |                                        |     7.87545    |    min |
|                                        Total Young Gen GC time |                                        |   202.142      |      s |
|                                       Total Young Gen GC count |                                        |  8909          |        |
|                                          Total Old Gen GC time |                                        |     0          |      s |
|                                         Total Old Gen GC count |                                        |     0          |        |
|                                                   Dataset size |                                        |    73.4386     |     GB |
|                                                     Store size |                                        |    73.4386     |     GB |
|                                                  Translog size |                                        |     3.03224    |     GB |
|                                         Heap used for segments |                                        |     0          |     MB |
|                                       Heap used for doc values |                                        |     0          |     MB |
|                                            Heap used for terms |                                        |     0          |     MB |
|                                            Heap used for norms |                                        |     0          |     MB |
|                                           Heap used for points |                                        |     0          |     MB |
|                                    Heap used for stored fields |                                        |     0          |     MB |
|                                                  Segment count |                                        |  1628          |        |
|                                    Total Ingest Pipeline count |                                        |     4.8861e+08 |        |
|                                     Total Ingest Pipeline time |                                        | 29324.6        |      s |
|                                   Total Ingest Pipeline failed |                                        |     0          |        |
|                                                 Min Throughput |                       insert-pipelines |     6.78       |  ops/s |
|                                                Mean Throughput |                       insert-pipelines |     6.78       |  ops/s |
|                                              Median Throughput |                       insert-pipelines |     6.78       |  ops/s |
|                                                 Max Throughput |                       insert-pipelines |     6.78       |  ops/s |
|                                       100th percentile latency |                       insert-pipelines |  2159.06       |     ms |
|                                  100th percentile service time |                       insert-pipelines |  2159.06       |     ms |
|                                                     error rate |                       insert-pipelines |     0          |      % |
|                                                 Min Throughput |                             insert-ilm |    19.81       |  ops/s |
|                                                Mean Throughput |                             insert-ilm |    19.81       |  ops/s |
|                                              Median Throughput |                             insert-ilm |    19.81       |  ops/s |
|                                                 Max Throughput |                             insert-ilm |    19.81       |  ops/s |
|                                       100th percentile latency |                             insert-ilm |    49.4519     |     ms |
|                                  100th percentile service time |                             insert-ilm |    49.4519     |     ms |
|                                                     error rate |                             insert-ilm |     0          |      % |
|                                                 Min Throughput | validate-package-template-installation |    52.07       |  ops/s |
|                                                Mean Throughput | validate-package-template-installation |    52.07       |  ops/s |
|                                              Median Throughput | validate-package-template-installation |    52.07       |  ops/s |
|                                                 Max Throughput | validate-package-template-installation |    52.07       |  ops/s |
|                                       100th percentile latency | validate-package-template-installation |    18.8955     |     ms |
|                                  100th percentile service time | validate-package-template-installation |    18.8955     |     ms |
|                                                     error rate | validate-package-template-installation |     0          |      % |
|                                                 Min Throughput |        update-custom-package-templates |    22.51       |  ops/s |
|                                                Mean Throughput |        update-custom-package-templates |    22.51       |  ops/s |
|                                              Median Throughput |        update-custom-package-templates |    22.51       |  ops/s |
|                                                 Max Throughput |        update-custom-package-templates |    22.51       |  ops/s |
|                                       100th percentile latency |        update-custom-package-templates |   532.922      |     ms |
|                                  100th percentile service time |        update-custom-package-templates |   532.922      |     ms |
|                                                     error rate |        update-custom-package-templates |     0          |      % |
|                                                 Min Throughput |                             bulk-index |   531.71       | docs/s |
|                                                Mean Throughput |                             bulk-index | 28977.3        | docs/s |
|                                              Median Throughput |                             bulk-index | 28876.9        | docs/s |
|                                                 Max Throughput |                             bulk-index | 30829.5        | docs/s |
|                                        50th percentile latency |                             bulk-index |   352.035      |     ms |
|                                        90th percentile latency |                             bulk-index |   657.696      |     ms |
|                                        99th percentile latency |                             bulk-index |  1107.89       |     ms |
|                                      99.9th percentile latency |                             bulk-index |  1804.35       |     ms |
|                                     99.99th percentile latency |                             bulk-index |  3264.32       |     ms |
|                                       100th percentile latency |                             bulk-index |  6313.07       |     ms |
|                                   50th percentile service time |                             bulk-index |   351.93       |     ms |
|                                   90th percentile service time |                             bulk-index |   659.242      |     ms |
|                                   99th percentile service time |                             bulk-index |  1107.04       |     ms |
|                                 99.9th percentile service time |                             bulk-index |  1809.02       |     ms |
|                                99.99th percentile service time |                             bulk-index |  3267.62       |     ms |
|                                  100th percentile service time |                             bulk-index |  6313.07       |     ms |
|                                                     error rate |                             bulk-index |     0          |      % |

Contender:

|                                                         Metric |                                   Task |           Value |   Unit |
|---------------------------------------------------------------:|---------------------------------------:|----------------:|-------:|
|                     Cumulative indexing time of primary shards |                                        |   899.486       |    min |
|             Min cumulative indexing time across primary shards |                                        |     2.95095     |    min |
|          Median cumulative indexing time across primary shards |                                        |    11.6404      |    min |
|             Max cumulative indexing time across primary shards |                                        |   158.792       |    min |
|            Cumulative indexing throttle time of primary shards |                                        |     0           |    min |
|    Min cumulative indexing throttle time across primary shards |                                        |     0           |    min |
| Median cumulative indexing throttle time across primary shards |                                        |     0           |    min |
|    Max cumulative indexing throttle time across primary shards |                                        |     0           |    min |
|                        Cumulative merge time of primary shards |                                        |   426.511       |    min |
|                       Cumulative merge count of primary shards |                                        |  1024           |        |
|                Min cumulative merge time across primary shards |                                        |     0.512583    |    min |
|             Median cumulative merge time across primary shards |                                        |     2.31557     |    min |
|                Max cumulative merge time across primary shards |                                        |   114.794       |    min |
|               Cumulative merge throttle time of primary shards |                                        |   113.117       |    min |
|       Min cumulative merge throttle time across primary shards |                                        |     0.0915333   |    min |
|    Median cumulative merge throttle time across primary shards |                                        |     0.58525     |    min |
|       Max cumulative merge throttle time across primary shards |                                        |    27.3451      |    min |
|                      Cumulative refresh time of primary shards |                                        |    89.7276      |    min |
|                     Cumulative refresh count of primary shards |                                        | 14144           |        |
|              Min cumulative refresh time across primary shards |                                        |     0.0293833   |    min |
|           Median cumulative refresh time across primary shards |                                        |     0.0993667   |    min |
|              Max cumulative refresh time across primary shards |                                        |    29.8377      |    min |
|                        Cumulative flush time of primary shards |                                        |   111.547       |    min |
|                       Cumulative flush count of primary shards |                                        |  6206           |        |
|                Min cumulative flush time across primary shards |                                        |     0.713667    |    min |
|             Median cumulative flush time across primary shards |                                        |     2.71242     |    min |
|                Max cumulative flush time across primary shards |                                        |     7.9553      |    min |
|                                        Total Young Gen GC time |                                        |   253.605       |      s |
|                                       Total Young Gen GC count |                                        |  8500           |        |
|                                          Total Old Gen GC time |                                        |     0           |      s |
|                                         Total Old Gen GC count |                                        |     0           |        |
|                                                   Dataset size |                                        |    71.7906      |     GB |
|                                                     Store size |                                        |    71.7906      |     GB |
|                                                  Translog size |                                        |     3.85838     |     GB |
|                                         Heap used for segments |                                        |     0           |     MB |
|                                       Heap used for doc values |                                        |     0           |     MB |
|                                            Heap used for terms |                                        |     0           |     MB |
|                                            Heap used for norms |                                        |     0           |     MB |
|                                           Heap used for points |                                        |     0           |     MB |
|                                    Heap used for stored fields |                                        |     0           |     MB |
|                                                  Segment count |                                        |  1350           |        |
|                                    Total Ingest Pipeline count |                                        |     4.88622e+08 |        |
|                                     Total Ingest Pipeline time |                                        | 30250.2         |      s |
|                                   Total Ingest Pipeline failed |                                        |     0           |        |
|                                                 Min Throughput |                       insert-pipelines |     7.62        |  ops/s |
|                                                Mean Throughput |                       insert-pipelines |     7.62        |  ops/s |
|                                              Median Throughput |                       insert-pipelines |     7.62        |  ops/s |
|                                                 Max Throughput |                       insert-pipelines |     7.62        |  ops/s |
|                                       100th percentile latency |                       insert-pipelines |  1911.34        |     ms |
|                                  100th percentile service time |                       insert-pipelines |  1911.34        |     ms |
|                                                     error rate |                       insert-pipelines |     0           |      % |
|                                                 Min Throughput |                             insert-ilm |    20.61        |  ops/s |
|                                                Mean Throughput |                             insert-ilm |    20.61        |  ops/s |
|                                              Median Throughput |                             insert-ilm |    20.61        |  ops/s |
|                                                 Max Throughput |                             insert-ilm |    20.61        |  ops/s |
|                                       100th percentile latency |                             insert-ilm |    47.5897      |     ms |
|                                  100th percentile service time |                             insert-ilm |    47.5897      |     ms |
|                                                     error rate |                             insert-ilm |     0           |      % |
|                                                 Min Throughput | validate-package-template-installation |    66.18        |  ops/s |
|                                                Mean Throughput | validate-package-template-installation |    66.18        |  ops/s |
|                                              Median Throughput | validate-package-template-installation |    66.18        |  ops/s |
|                                                 Max Throughput | validate-package-template-installation |    66.18        |  ops/s |
|                                       100th percentile latency | validate-package-template-installation |    14.7938      |     ms |
|                                  100th percentile service time | validate-package-template-installation |    14.7938      |     ms |
|                                                     error rate | validate-package-template-installation |     0           |      % |
|                                                 Min Throughput |        update-custom-package-templates |    23.96        |  ops/s |
|                                                Mean Throughput |        update-custom-package-templates |    23.96        |  ops/s |
|                                              Median Throughput |        update-custom-package-templates |    23.96        |  ops/s |
|                                                 Max Throughput |        update-custom-package-templates |    23.96        |  ops/s |
|                                       100th percentile latency |        update-custom-package-templates |   500.494       |     ms |
|                                  100th percentile service time |        update-custom-package-templates |   500.494       |     ms |
|                                                     error rate |        update-custom-package-templates |     0           |      % |
|                                                 Min Throughput |                             bulk-index |  1290.94        | docs/s |
|                                                Mean Throughput |                             bulk-index | 31518.1         | docs/s |
|                                              Median Throughput |                             bulk-index | 31546.7         | docs/s |
|                                                 Max Throughput |                             bulk-index | 33875.7         | docs/s |
|                                        50th percentile latency |                             bulk-index |  1685.81        |     ms |
|                                        90th percentile latency |                             bulk-index |  3011.78        |     ms |
|                                        99th percentile latency |                             bulk-index |  4668.77        |     ms |
|                                      99.9th percentile latency |                             bulk-index |  6193.05        |     ms |
|                                     99.99th percentile latency |                             bulk-index |  7657.43        |     ms |
|                                       100th percentile latency |                             bulk-index |  9168.88        |     ms |
|                                   50th percentile service time |                             bulk-index |  1692.69        |     ms |
|                                   90th percentile service time |                             bulk-index |  2995.72        |     ms |
|                                   99th percentile service time |                             bulk-index |  4659.98        |     ms |
|                                 99.9th percentile service time |                             bulk-index |  6193.08        |     ms |
|                                99.99th percentile service time |                             bulk-index |  7660.23        |     ms |
|                                  100th percentile service time |                             bulk-index |  9168.88        |     ms |
|                                                     error rate |                             bulk-index |     0           |      % |

This PR is a precursor to elastic#126492. It does three things: - Move org.elasticsearch.common.text.Text from :server to org.elasticsearch.xcontent.Text in :libs:x-content. - Refactor the Text class to use a java-native ByteBuffer instead of the elasticsearch BytesReference. - Add the XContentString interface, with the Text class implementing that interface.

ldematte

@jordan-powers have you added the test on "support" escaped characters? (ensure it will fall back, not error?)
If so, this PR is gtg for me.

jordan-powers · 2025-05-21T15:51:43Z

Yes, those tests exist. ESUTF8StreamJsonParserTests#testGetValueAsText has a couple of explicit tests to test both ascii-only and non-ascii characters at the jackson parser level. I also have a test at the JsonXContentParser level in BaseXContentTestCase#testOptimizedText, which tests using a random unicode string that is likely to contain non-ascii characters

ldematte

Looks good to me, but given the complexity you might want an additional pair of eyes on it

This PR is a precursor to #126492. It does three things: 1. Move org.elasticsearch.common.text.Text from :server to org.elasticsearch.xcontent.Text in :libs:x-content. 2. Refactor the Text class to use a new EncodedBytes record instead of the elasticsearch BytesReference. 3. Add the XContentString interface, with the Text class implementing that interface. These changes were originally implemented in #127666 and #128316, however they were reverted in #128484 due to problems caused by the mutable nature of java ByteBuffers. This is resolved by instead using a new immutable EncodedBytes record.

This PR is a precursor to #126492. It does three things: 1. Move org.elasticsearch.common.text.Text from :server to org.elasticsearch.xcontent.Text in :libs:x-content. 2. Refactor the Text class to use a new EncodedBytes record instead of the elasticsearch BytesReference. 3. Add the XContentString interface, with the Text class implementing that interface. These changes were originally implemented in #127666 and #128316, however they were reverted in #128484 due to problems caused by the mutable nature of java ByteBuffers. This is resolved by instead using a new immutable EncodedBytes record. (cherry picked from commit de40ac4) # Conflicts: # server/src/main/java/org/elasticsearch/search/fetch/subphase/highlight/DefaultHighlighter.java # server/src/test/java/org/elasticsearch/common/xcontent/BaseXContentTestCase.java # server/src/test/java/org/elasticsearch/search/fetch/subphase/highlight/HighlightFieldTests.java # test/framework/src/main/java/org/elasticsearch/search/SearchResponseUtils.java

martijnvg

One comment, LGTM otherwise.

martijnvg · 2025-06-05T09:26:50Z

server/src/main/java/org/elasticsearch/index/mapper/KeywordFieldMapper.java

        }

        boolean indexed = indexValue(context, value);
        if (offsetsFieldName != null && context.isImmediateParentAnArray() && context.canAddIgnoredField()) {
            if (indexed) {
-                context.getOffSetContext().recordOffset(offsetsFieldName, value);
+                context.getOffSetContext().recordOffset(offsetsFieldName, value.string());


I think there is no need to convert to string here? UTF8Bytes already implements Comparable and the recordOffset method isn't bound to string, the provided value just needs to implement Comparable. This should save another unneeded conversation to java string?

Makes sense to me.

While looking into this I found an issue with the implementation of UTF8Bytes#compareTo. It was delegating to ByteBuffer#compareTo, which was performing signed comparison. This was incorrect, and it needs to do unsigned comparison to perform correct lexicographical ordering of encoded UTF8 values. So I also fixed that.

martijnvg · 2025-06-05T09:28:27Z

server/src/main/java/org/elasticsearch/index/mapper/KeywordFieldMapper.java

-        // convert to utf8 only once before feeding postings/dv/stored fields
-        final BytesRef binaryValue = new BytesRef(value);
+        if (fieldType().normalizer() != Lucene.KEYWORD_ANALYZER) {
+            String normalizedString = normalizeValue(fieldType().normalizer(), fullPath(), value.string());


In a follow up we can explore normalizing to not relying on java string as return type here.

elasticsearchmachine · 2025-06-06T02:51:32Z

💚 Backport successful

Status	Branch	Result
✅	8.19

…9023) When parsing documents, we receive the document as UTF-8 encoded data which we then parse and convert the fields to java-native UTF-16 encoded Strings. We then convert these strings back to UTF-8 for storage in lucene. This patch skips the redundant conversion, instead passing lucene a direct reference to the received UTF-8 bytes when possible.

commit e6b8a64 Author: Lorenzo Dematte <lorenzo.dematte@elastic.co> Date: Tue Jun 10 12:02:16 2025 +0200 PR comments commit ad0902e Author: Moritz Mack <mmack@apache.org> Date: Fri Jun 6 13:33:37 2025 +0200 Update ReproduceInfoPrinter to correctly print a reproduction for lucene / BC upgrade tests. Relates to ES-12005 commit 78b4168 Author: Alexander Spies <alexander.spies@elastic.co> Date: Fri Jun 6 11:37:53 2025 +0200 ESQL: Throw ISE instead of IAE for illegal block in page (elastic#128960) IAE gets reported as a 400 status code, but that's inappropriate when inconsistent pages are always bugs, and should be reported with a 500. Throw ISE instead. commit 29e68bd Author: Aurélien FOUCRET <aurelien.foucret@gmail.com> Date: Fri Jun 6 11:24:23 2025 +0200 [ES|QL] Fix test releases for telemetry. (elastic#128971) commit 1a76bc2 Author: Bogdan Pintea <bogdan.pintea@elastic.co> Date: Fri Jun 6 11:01:49 2025 +0200 ESQL: Workaround for RLike handling of empty lang pattern (elastic#128895) Lucene's `org.apache.lucene.util.automaton.Operations#getSingleton` fails with an Automaton for a `REGEXP_EMPTY` `RegExp`. This adds a workaround for that, to check the type of automaton before calling into that failing method. Closes elastic#128813 commit e24fd32 Author: Aurélien FOUCRET <aurelien.foucret@gmail.com> Date: Fri Jun 6 10:25:28 2025 +0200 [ES|QL] Enable the completion command as a tech preview feature (elastic#128948) commit 3f03775 Author: Niels Bauman <33722607+nielsbauman@users.noreply.github.com> Date: Fri Jun 6 09:00:24 2025 +0200 Remove non-test usages of `Metadata.Builder#putCustom` (elastic#128801) This removes all non-test usages of ``` Metadata.Builder.putCustom(String type, ProjectCustom custom) ``` And replaces it with appropriate calls to the equivalent method on `ProjectMetadata.Builder`. In most cases this _does not_ make the code project aware, but does reduce the number of deprecated methods in use. commit 330d127 Author: Niels Bauman <33722607+nielsbauman@users.noreply.github.com> Date: Fri Jun 6 08:07:59 2025 +0200 Make utility methods in `IndexLifecycleTransition` project-aware (elastic#128930) Modifies the methods to work with a project scope rather than a cluster scope. This is part of an iterative process to make ILM project-aware. commit 1b5720d Author: Aurélien FOUCRET <aurelien.foucret@gmail.com> Date: Fri Jun 6 07:50:52 2025 +0200 [ES|QL] Fix test releases for LookupJoinTypesIT. (elastic#128985) commit 40cf2d3 Author: Tim Vernum <tim@adjective.org> Date: Fri Jun 6 13:09:31 2025 +1000 Add "extension" attribute validation to IdP SPs (elastic#128805) This extends the change from elastic#128176 to validate the "custom attributes" on a per Service Provider basis. Each Service Provider (whether registered or wildcard based) has a field "attributes.extensions" which is a list of attribute names that may be provided by the caller of "/_idp/saml/init". Service Providers that have not be configured with extension attributes will reject any custom attributes in SAML init. This necessitates a new field in the service provider index (but only if the new `extensions` attribute is set). The template has been updated, but there is no data migration because the `saml-service-provider` index does not exist in any of the environments into which we wish to deploy this change. commit 496fb2d Author: Jordan Powers <jordan.powers@elastic.co> Date: Thu Jun 5 19:50:09 2025 -0700 Skip UTF8 to UTF16 conversion during document indexing (elastic#126492) When parsing documents, we receive the document as UTF-8 encoded data which we then parse and convert the fields to java-native UTF-16 encoded Strings. We then convert these strings back to UTF-8 for storage in lucene. This patch skips the redundant conversion, instead passing lucene a direct reference to the received UTF-8 bytes when possible. commit c34f8b6 Author: Tim Vernum <tim@adjective.org> Date: Fri Jun 6 12:03:25 2025 +1000 Improve cache invalidation in IdP SP cache (elastic#128890) The Identity Provider's Service Provider cache had two issues: 1. It checked for identity based on sequence numbers, but didn't include the `seq_no_primary_term` parameter on searches, which means the sequence would always by `-2` 2. It didn't track whether the index was deleted, which means it could be caching values from an old version of the index This commit fixes both of these issues. In practice neither issue was a problem because there are no deployments that use index-based service providers, however the 2nd issue did cause some challenges for testing. commit 923f029 Author: Nhat Nguyen <nhat.nguyen@elastic.co> Date: Thu Jun 5 18:09:58 2025 -0700 Fix block loader with missing ignored source (elastic#129006) We miss appending null when ignored_source is not available. Our randomized tests already cover this case, but we do not check it when loading fields. I labelled this non-issue for an unreleased bug. Closes elastic#128959 Relates elastic#119546 commit 0f8178a Author: Bogdan Pintea <bogdan.pintea@elastic.co> Date: Fri Jun 6 02:02:24 2025 +0200 ESQL: Forward port 8.19 RegexMatch serialization change version (elastic#128979) Fwd port ESQL_REGEX_MATCH_WITH_CASE_INSENSITIVITY_8_19. Related: elastic#128919. commit ee716f1 Author: Simon Chase <simon.chase@elastic.co> Date: Thu Jun 5 15:20:08 2025 -0700 transport: edit TransportConnectionListener for close exceptions (elastic#129015) The TransportConnectionListener interface has previously included the Transport.Connection being closed and unregistered in its onNodeDisconnected callback. This is not in use, and can be removed as it is also available in the onConnectionClosed callback. It is being replaced with a Nullable exception that caused the close. This is being used in pending work (ES-11448) to differentiate network issues from node restarts. Closes ES-12007 commit aceaf23 Merge: f18f4ee 159c57f Author: elasticsearchmachine <infra-root+elasticsearchmachine@elastic.co> Date: Thu Jun 5 19:27:54 2025 +0000 Merge patch/serverless-fix into main commit 159c57f Author: Rene Groeschke <rene@elastic.co> Date: Tue Jun 3 11:56:27 2025 +0200 Prepare serverless patch including elastic#128784 elastic#128740 (elastic#128807) * Change default for vector.rescoring.directio to false (elastic#128784) On serverless (and potentially elsewhere), direct IO is not available, which can cause BBQ shards to fail to read with org.apache.lucene.CorruptIndexException when this setting is true. * Optimize sparse vector stats collection (elastic#128740) This change improves the performance of sparse vector statistics gathering by using the document count of terms directly, rather than relying on the field name field to compute stats. By avoiding per-term disk/network reads and instead leveraging statistics already loaded into leaf readers at index opening, we expect to significantly reduce overhead. Relates to elastic#128583 --------- Co-authored-by: Dave Pifke <dave.pifke@elastic.co> Co-authored-by: Jim Ferenczi <jim.ferenczi@elastic.co>

When parsing documents, we receive the document as UTF-8 encoded data which we then parse and convert the fields to java-native UTF-16 encoded Strings. We then convert these strings back to UTF-8 for storage in lucene. This patch skips the redundant conversion, instead passing lucene a direct reference to the received UTF-8 bytes when possible.

Follow-up to #126492 to apply the json parsing optimization to strings containing unicode characters and some backslash-escaped characters. Supporting backslash-escaped strings is tricky as it requires modifying the string. There are two types of modification: some just remove the backslash (e.g. \", \\), and some replace the whole escape sequence with a new character (e.g. \n, \r, \u00e5). In this implementation, the optimization only supports the first case--removing the backslash. This is done by making a copy of the data, skipping the backslash. It should still be more optimized than full String decoding, but it won't be as fast as non-backslashed strings where we can directly reference the input bytes. Relates to #129072.

…129360) Follow-up to #126492 to apply the json parsing optimization to strings containing unicode characters and some backslash-escaped characters. Supporting backslash-escaped strings is tricky as it requires modifying the string. There are two types of modification: some just remove the backslash (e.g. \", \\), and some replace the whole escape sequence with a new character (e.g. \n, \r, \u00e5). In this implementation, the optimization only supports the first case--removing the backslash. This is done by making a copy of the data, skipping the backslash. It should still be more optimized than full String decoding, but it won't be as fast as non-backslashed strings where we can directly reference the input bytes. Relates to #129072.

Follow-up to #126492 to use the json parsing optimizations for match_only_text fields. Relates to #129072.

jordan-powers added 6 commits April 7, 2025 14:10

Prototype avoid UTF8 to UTF16 conversion

2153322

Rename to ESBytesRef

eaa66bb

Apply spotless

c5d71f7

Some cleanup and comments

a277aea

Remove unnecessary throws IOException

74822c2

Fix missing bytesValue call

a9ee991

jordan-powers self-assigned this Apr 8, 2025

elasticsearchmachine added the v9.1.0 label Apr 8, 2025

jordan-powers added 8 commits April 8, 2025 15:59

Fix subsequent calls to parser.getText() after a call to parser.getVa…

ae432e1

…lueAsByteRef()

Use cached stringEnd on subsequent calls to getValueAsByteRef()

3d1bec4

Spotless

702ada2

Add textRefOrNull to DotExpandingXContentParser

e2ebb92

Add textRefOrNull() override to MultiFieldParser

4a6fe60

Avoid cloning ByteSourceJsonBootstrapper

240fbdb

Rename ESBytesRef to XBytesRef

c1affcf

Merge remote-tracking branch 'upstream/main' into prototype-skip-utf16

ddf7495

martijnvg reviewed Apr 16, 2025

View reviewed changes

jordan-powers added 4 commits April 16, 2025 11:24

Merge remote-tracking branch 'upstream/main' into prototype-skip-utf16

e285349

Add tests for ESJsonFactory

b0f3336

Add tests for ESUTF8StreamJsonParser

f2f106f

Merge remote-tracking branch 'upstream/main' into prototype-skip-utf16

b0f701c

jordan-powers added 3 commits April 17, 2025 14:26

Move RawString class into separate file and rename to EncodedString

20616e6

Merge remote-tracking branch 'upstream/main' into prototype-skip-utf16

1f66ff2

Merge remote-tracking branch 'upstream/main' into prototype-skip-utf16

fd4ec6c

jordan-powers changed the title ~~Prototype to skip UTF8 to UTF16 conversion~~ Skip UTF8 to UTF16 conversion during document indexing Apr 18, 2025

jordan-powers marked this pull request as ready for review April 18, 2025 20:12

jordan-powers requested a review from a team as a code owner April 18, 2025 20:12

ldematte reviewed May 21, 2025

View reviewed changes

ldematte approved these changes May 22, 2025

View reviewed changes

jordan-powers mentioned this pull request Jun 2, 2025

Move Text class to libs/xcontent #128780

Merged

jordan-powers added 2 commits June 4, 2025 11:23

Merge remote-tracking branch 'upstream/main' into prototype-skip-utf16

de33cb6

Use new UTF8Bytes class

d75f180

martijnvg approved these changes Jun 5, 2025

View reviewed changes

jordan-powers added 6 commits June 5, 2025 08:35

Use unsigned comparison in UTF8Bytes#compareTo

bcb195e

Use encoded value when recording array offsets

4c76525

Merge remote-tracking branch 'upstream/main' into prototype-skip-utf16

986d1f6

Merge remote-tracking branch 'upstream/main' into prototype-skip-utf16

d3b9496

Merge remote-tracking branch 'upstream/main' into prototype-skip-utf16

9b53320

Merge remote-tracking branch 'upstream/main' into prototype-skip-utf16

91313b5

jordan-powers merged commit 496fb2d into elastic:main Jun 6, 2025
18 checks passed

jordan-powers mentioned this pull request Jun 6, 2025

[8.19] Skip UTF8 to UTF16 conversion during document indexing (#126492) #129023

Merged

This was referenced Jun 6, 2025

Skip redundant UTF8 to UTF16 conversion follow-ups #129072

Open

Optimized text for full unicode and some escape sequences #129169

Merged

jordan-powers deleted the prototype-skip-utf16 branch June 12, 2025 21:17

jordan-powers mentioned this pull request Jun 12, 2025

Use optimized text in match_only_text fields #129371

Merged

jordan-powers added a commit that referenced this pull request Jun 17, 2025

Use optimized text in match_only_text fields (#129371)

5d19997

Follow-up to #126492 to use the json parsing optimizations for match_only_text fields. Relates to #129072.

elasticsearchmachine pushed a commit that referenced this pull request Jun 18, 2025

Use optimized text in match_only_text fields (#129371) (#129565)

a6dd703

Follow-up to #126492 to use the json parsing optimizations for match_only_text fields. Relates to #129072.

Skip UTF8 to UTF16 conversion during document indexing #126492

Skip UTF8 to UTF16 conversion during document indexing #126492

Uh oh!

Conversation

jordan-powers commented Apr 8, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

jordan-powers commented Apr 15, 2025

Uh oh!

martijnvg left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

martijnvg Apr 17, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

jordan-powers commented Apr 16, 2025

Uh oh!

jordan-powers commented Apr 17, 2025

Uh oh!

ldematte left a comment

Choose a reason for hiding this comment

Uh oh!

jordan-powers commented May 21, 2025

Uh oh!

ldematte left a comment

Choose a reason for hiding this comment

Uh oh!

martijnvg left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

elasticsearchmachine commented Jun 6, 2025

💚 Backport successful

Uh oh!

Uh oh!

jordan-powers commented Apr 8, 2025 •

edited

Loading

martijnvg Apr 17, 2025 •

edited

Loading