[Question]: Why not concurrently execute merge to speedup inserting? #1385

zhouzhou12 · 2025-04-16T09:19:43Z

Do you need to ask a question?

I have searched the existing question and discussions and this question is not already answered.
I believe this is a legitimate question, not just a bug or feature request.

Your Question

Currently, merging entities and relationships is sequentially:

        # Process and update all entities at once
        for entity_name, entities in all_nodes.items():
            entity_data = await _merge_nodes_then_upsert(
                entity_name,
                entities,
                knowledge_graph_inst,
                global_config,
                pipeline_status,
                pipeline_status_lock,
                llm_response_cache,
            )
            entities_data.append(entity_data)

        # Process and update all relationships at once
        for edge_key, edges in all_edges.items():
            edge_data = await _merge_edges_then_upsert(
                edge_key[0],
                edge_key[1],
                edges,
                knowledge_graph_inst,
                global_config,
                pipeline_status,
                pipeline_status_lock,
                llm_response_cache,
            )
            relationships_data.append(edge_data)

Why not change it to be parallel? Like this:

        # Process and update all entities at once
        futures = []
        for entity_name, entities in all_nodes.items():
            futures.append(asyncio.create_task(_merge_nodes_then_upsert(
                entity_name,
                entities,
                knowledge_graph_inst,
                global_config,
                pipeline_status,
                pipeline_status_lock,
                llm_response_cache,
            )))
        entities_data = await asyncio.gather(*futures)

        # Process and update all relationships at once
        futures = []
        for edge_key, edges in all_edges.items():
            futures.append(asyncio.create_task(_merge_edges_then_upsert(
                edge_key[0],
                edge_key[1],
                edges,
                knowledge_graph_inst,
                global_config,
                pipeline_status,
                pipeline_status_lock,
                llm_response_cache,
            )))
        relationships_data = await asyncio.gather(*futures)

Additional Context

No response

danielaskdd · 2025-04-16T17:11:45Z

Too many concurrent requests can generate excessive database connections, consuming a large amount of resources and leading to a performance decline. The correct approach is to use batch mode for data queries. The main branch has already been optimized following this principle. Currently, the performance of PostgreSQL and Neo4j has been significantly improved.

danielaskdd · 2025-04-16T17:37:47Z

The bottleneck in node merging occurs during LLM calls, and simply increasing Python's concurrency isn't the right solution. In v1.3.2, we've optimized the process by deferring node merging until after each document's processing phase, followed by a comprehensive merge of all nodes with identical names across the entire document. This approach has significantly improved node merging speed.

zhouzhou12 · 2025-04-17T01:09:36Z

The bottleneck in node merging occurs during LLM calls, and simply increasing Python's concurrency isn't the right solution. In v1.3.2, we've optimized the process by deferring node merging until after each document's processing phase, followed by a comprehensive merge of all nodes with identical names across the entire document. This approach has significantly improved node merging speed.

@danielaskdd

Our tests revealed that sequentially running the function _merge_nodes_then_upsert and _merge_edges_then_upsert would cause sequential running of the function _handle_entity_relation_summary, which in turn led to only one concurrent LLM call, thereby wasting the GPU computing power.

danielaskdd · 2025-04-17T06:10:16Z

To ensure data consistency, only a single thread is permitted to access the graph database during the node merging stage. This limitation accounts for the observed reduction in LLM concurrency. Resolving this issue requires optimizing the pipeline architecture to enable asynchronous operation between document entity relation extraction and node merging processes. However, this enhancement is not currently a high priority. If any contributor is willing to take on this responsibility, it would be greatly appreciated. @LarFii

zhouzhou12 · 2025-04-17T06:31:31Z

To ensure data consistency, only a single thread is permitted to access the graph database during the node merging stage. This limitation accounts for the observed reduction in LLM concurrency. Resolving this issue requires optimizing the pipeline architecture to enable asynchronous operation between document entity relation extraction and node merging processes. However, this enhancement is not currently a high priority. If any contributor is willing to take on this responsibility, it would be greatly appreciated. @LarFii

Within a document, each entity with the same name only be merged once, this should not compromise data consistency. Could you explain your concerns about how this might lead to data inconsistency? @danielaskdd

TonicZhang · 2025-04-17T09:29:47Z

The bottleneck in node merging occurs during LLM calls, and simply increasing Python's concurrency isn't the right solution. In v1.3.2, we've optimized the process by deferring node merging until after each document's processing phase, followed by a comprehensive merge of all nodes with identical names across the entire document. This approach has significantly improved node merging speed.

@danielaskdd

Our tests revealed that sequentially running the function _merge_nodes_then_upsert and _merge_edges_then_upsert would cause sequential running of the function _handle_entity_relation_summary, which in turn led to only one concurrent LLM call, thereby wasting the GPU computing power.

Agree. Following the extract phase, during which "chunk_results = await asyncio.gather(*tasks)" is executed, the LLM call becomes available for subsequent operations and is serialized during the merging phase.

danielaskdd · 2025-04-17T15:25:50Z

To ensure data consistency, only a single thread is permitted to access the graph database during the node merging stage. This limitation accounts for the observed reduction in LLM concurrency. Resolving this issue requires optimizing the pipeline architecture to enable asynchronous operation between document entity relation extraction and node merging processes. However, this enhancement is not currently a high priority. If any contributor is willing to take on this responsibility, it would be greatly appreciated. @LarFii

Within a document, each entity with the same name only be merged once, this should not compromise data consistency. Could you explain your concerns about how this might lead to data inconsistency? @danielaskdd

For example, chunks A and B both contain an entity named "ent". Each chunk attempts to locate and merge "ent" independently, which may result in one chunk's outcome overwriting the other's.

danielaskdd · 2025-04-28T16:26:14Z

The latest version on main branch significantly optimizes file parallel processing logic. The entity/relationship merge phase no longer blocks the extraction phase tasks, ensuring LLM resources maintain full utilization without idle periods.

zhouzhou12 added the question Further information is requested label Apr 16, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Question]: Why not concurrently execute merge to speedup inserting? #1385

[Question]: Why not concurrently execute merge to speedup inserting? #1385

zhouzhou12 commented Apr 16, 2025

danielaskdd commented Apr 16, 2025

danielaskdd commented Apr 16, 2025

zhouzhou12 commented Apr 17, 2025

danielaskdd commented Apr 17, 2025 •

edited

Loading

zhouzhou12 commented Apr 17, 2025 •

edited

Loading

TonicZhang commented Apr 17, 2025

danielaskdd commented Apr 17, 2025 •

edited

Loading

danielaskdd commented Apr 28, 2025

[Question]: Why not concurrently execute merge to speedup inserting? #1385

[Question]: Why not concurrently execute merge to speedup inserting? #1385

Comments

zhouzhou12 commented Apr 16, 2025

Do you need to ask a question?

Your Question

Additional Context

danielaskdd commented Apr 16, 2025

danielaskdd commented Apr 16, 2025

zhouzhou12 commented Apr 17, 2025

danielaskdd commented Apr 17, 2025 • edited Loading

zhouzhou12 commented Apr 17, 2025 • edited Loading

TonicZhang commented Apr 17, 2025

danielaskdd commented Apr 17, 2025 • edited Loading

danielaskdd commented Apr 28, 2025

danielaskdd commented Apr 17, 2025 •

edited

Loading

zhouzhou12 commented Apr 17, 2025 •

edited

Loading

danielaskdd commented Apr 17, 2025 •

edited

Loading