[Feature] Allowing the _id field in the segment index to be automatically generated by Elasticsearch (ES) will significantly improve the performance of bulk insertion, potentially by several times. #13089

dingsongjie · 2025-03-08T12:44:24Z

dingsongjie
Mar 8, 2025

Search before asking

I had searched in the issues and found no similar feature requirement.

Description

In SkyWalking version 9.x, I noticed that the CPU usage of Elasticsearch (ES) in my cluster was very high. After checking the hot_threads, I found that the majority of the CPU consumption was attributed to PerThreadIDVersionAndSeqNoLookup.lookupVersion. This issue typically arises when ES ensures the uniqueness of _id. Upon inspecting the _id in the segment index, I discovered that the _id was indeed specified by SkyWalking. To address this, I created an ingest pipeline to remove the _id generated by SkyWalking and allowed ES to generate the IDs automatically. After implementing this change, the CPU usage dropped to about 10% of its previous level, and the hot_threads no longer showed this overhead. Additionally, the slow tasks related to segment batching disappeared from the ES task list.

PUT _ingest/pipeline/force_auto_id
{
  "description": "Force auto-generated _id by removing client-provided _id",
  "processors": [
    {
      "remove": {
        "field": "_id"
      }
    }
  ]
}

set  "default_pipeline": "force_auto_id",

Use case

Improve performance

Related issues

No response

Are you willing to submit a pull request to implement this on your own?

Yes I am willing to submit a pull request on my own!

Code of Conduct

I agree to follow this project's Code of Conduct

wu-sheng · 2025-03-08T13:29:32Z

wu-sheng
Mar 8, 2025
Collaborator

First of all, each id is inserted or updated in over 20s by default, unless you changed that privately. So, I can't see the race condition.
Then, according to the logic, the id must be known and generated by oap, otherwise, you can't do _id based updated, which lead to query based update, that will impact performance much much more.

This discussion had been done long time ago. And the conclusion is very clear.

If you have concerns in elastic performance and resource cost, that is common, which is why we are all in BanyanDB, which is built by skywalking community itself, and focus on skywalking use cases. Elasticsearch is not our first priority anymore, because clearly there are several conflicts can't be resolved perfectly.

9 replies

dingsongjie Mar 10, 2025
Author

I didn't write any code. I removed the _id during bulk insertion by adding an _ingest pipeline. Previously, I deployed an ES cluster with 4 nodes, each having 6 CPUs and 14GB of memory, and the cluster quickly reached 100% load. Now, I only need to deploy a 4-node ES cluster with 2 CPUs and 8GB of memory per node, and the CPU usage is around 30%. The segment write rate is about 4500 per second, and all SkyWalking functionalities are working perfectly.
This is a screenshot of the current load situation

dingsongjie Mar 10, 2025
Author

Manually creating _id has a significant performance impact on Elasticsearch, especially in high-concurrency bulk insertion scenarios.

wu-sheng Mar 10, 2025
Collaborator

I feel that is the benefit of pipeline, which we don't support. We are using regular HTTP write.

dingsongjie Mar 10, 2025
Author

The pipeline is a data preprocessing feature in Elasticsearch. Before data is written, Elasticsearch invokes this pipeline to process the data. In the pipeline, I removed the _id field from the index. When inserting data, if the _id is not assigned, Elasticsearch will automatically assign an _id field. The documentation for the ingest feature can be found at: https://www.elastic.co/guide/en/elasticsearch/reference/7.17/ingest.html.

PUT _ingest/pipeline/force_auto_id
{
  "description": "Force auto-generated _id by removing client-provided _id",
  "processors": [
    {
      "remove": {
        "field": "_id"
      }
    }
  ]
}

This is the statement that creates the pipeline

PUT sw9_segment-20250310/_settings
{
  "index.default_pipeline": "force_auto_id"
}

This is the statement that applies the pipeline to the Segment index.
Of course, you can directly set ingest to the index template, so that the ingest configuration will be automatically included when the index is created.

wu-sheng Mar 10, 2025
Collaborator

That is my point, we don't support that. We know that has performance benefit, but no one implemented that in skywalking oap.
That performance advantage is from that pipeline.
We(several core maintenaners) focus on BanyanDB, so, maybe someone in community would like to do that.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[Feature] Allowing the _id field in the segment index to be automatically generated by Elasticsearch (ES) will significantly improve the performance of bulk insertion, potentially by several times. #13089

Uh oh!

{{title}}

Uh oh!

Replies: 1 comment 9 replies

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{editor}}'s edit

{{editor}}'s edit

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Select a reply

Uh oh!

[Feature] Allowing the _id field in the segment index to be automatically generated by Elasticsearch (ES) will significantly improve the performance of bulk insertion, potentially by several times. #13089

Uh oh!

dingsongjie Mar 8, 2025

Search before asking

Description

Use case

Related issues

Are you willing to submit a pull request to implement this on your own?

Code of Conduct

Replies: 1 comment · 9 replies

Uh oh!

wu-sheng Mar 8, 2025 Collaborator

Uh oh!

Uh oh!

dingsongjie Mar 10, 2025 Author

Uh oh!

dingsongjie Mar 10, 2025 Author

Uh oh!

wu-sheng Mar 10, 2025 Collaborator

Uh oh!

dingsongjie Mar 10, 2025 Author

Uh oh!

wu-sheng Mar 10, 2025 Collaborator

dingsongjie
Mar 8, 2025

Replies: 1 comment 9 replies

wu-sheng
Mar 8, 2025
Collaborator

dingsongjie Mar 10, 2025
Author

dingsongjie Mar 10, 2025
Author

wu-sheng Mar 10, 2025
Collaborator

dingsongjie Mar 10, 2025
Author

wu-sheng Mar 10, 2025
Collaborator