Skip to content

[Question]: Why not concurrently execute merge to speedup inserting? #1385

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
2 tasks done
zhouzhou12 opened this issue Apr 16, 2025 · 8 comments
Open
2 tasks done
Labels
question Further information is requested

Comments

@zhouzhou12
Copy link

Do you need to ask a question?

  • I have searched the existing question and discussions and this question is not already answered.
  • I believe this is a legitimate question, not just a bug or feature request.

Your Question

Currently, merging entities and relationships is sequentially:

        # Process and update all entities at once
        for entity_name, entities in all_nodes.items():
            entity_data = await _merge_nodes_then_upsert(
                entity_name,
                entities,
                knowledge_graph_inst,
                global_config,
                pipeline_status,
                pipeline_status_lock,
                llm_response_cache,
            )
            entities_data.append(entity_data)

        # Process and update all relationships at once
        for edge_key, edges in all_edges.items():
            edge_data = await _merge_edges_then_upsert(
                edge_key[0],
                edge_key[1],
                edges,
                knowledge_graph_inst,
                global_config,
                pipeline_status,
                pipeline_status_lock,
                llm_response_cache,
            )
            relationships_data.append(edge_data)

Why not change it to be parallel? Like this:

        # Process and update all entities at once
        futures = []
        for entity_name, entities in all_nodes.items():
            futures.append(asyncio.create_task(_merge_nodes_then_upsert(
                entity_name,
                entities,
                knowledge_graph_inst,
                global_config,
                pipeline_status,
                pipeline_status_lock,
                llm_response_cache,
            )))
        entities_data = await asyncio.gather(*futures)

        # Process and update all relationships at once
        futures = []
        for edge_key, edges in all_edges.items():
            futures.append(asyncio.create_task(_merge_edges_then_upsert(
                edge_key[0],
                edge_key[1],
                edges,
                knowledge_graph_inst,
                global_config,
                pipeline_status,
                pipeline_status_lock,
                llm_response_cache,
            )))
        relationships_data = await asyncio.gather(*futures)

Additional Context

No response

@zhouzhou12 zhouzhou12 added the question Further information is requested label Apr 16, 2025
@danielaskdd
Copy link
Collaborator

Too many concurrent requests can generate excessive database connections, consuming a large amount of resources and leading to a performance decline. The correct approach is to use batch mode for data queries. The main branch has already been optimized following this principle. Currently, the performance of PostgreSQL and Neo4j has been significantly improved.

@danielaskdd
Copy link
Collaborator

The bottleneck in node merging occurs during LLM calls, and simply increasing Python's concurrency isn't the right solution. In v1.3.2, we've optimized the process by deferring node merging until after each document's processing phase, followed by a comprehensive merge of all nodes with identical names across the entire document. This approach has significantly improved node merging speed.

@zhouzhou12
Copy link
Author

The bottleneck in node merging occurs during LLM calls, and simply increasing Python's concurrency isn't the right solution. In v1.3.2, we've optimized the process by deferring node merging until after each document's processing phase, followed by a comprehensive merge of all nodes with identical names across the entire document. This approach has significantly improved node merging speed.

@danielaskdd

Our tests revealed that sequentially running the function _merge_nodes_then_upsert and _merge_edges_then_upsert would cause sequential running of the function _handle_entity_relation_summary, which in turn led to only one concurrent LLM call, thereby wasting the GPU computing power.

@danielaskdd
Copy link
Collaborator

danielaskdd commented Apr 17, 2025

To ensure data consistency, only a single thread is permitted to access the graph database during the node merging stage. This limitation accounts for the observed reduction in LLM concurrency. Resolving this issue requires optimizing the pipeline architecture to enable asynchronous operation between document entity relation extraction and node merging processes. However, this enhancement is not currently a high priority. If any contributor is willing to take on this responsibility, it would be greatly appreciated. @LarFii

@zhouzhou12
Copy link
Author

zhouzhou12 commented Apr 17, 2025

To ensure data consistency, only a single thread is permitted to access the graph database during the node merging stage. This limitation accounts for the observed reduction in LLM concurrency. Resolving this issue requires optimizing the pipeline architecture to enable asynchronous operation between document entity relation extraction and node merging processes. However, this enhancement is not currently a high priority. If any contributor is willing to take on this responsibility, it would be greatly appreciated. @LarFii

Within a document, each entity with the same name only be merged once, this should not compromise data consistency. Could you explain your concerns about how this might lead to data inconsistency? @danielaskdd

@TonicZhang
Copy link

The bottleneck in node merging occurs during LLM calls, and simply increasing Python's concurrency isn't the right solution. In v1.3.2, we've optimized the process by deferring node merging until after each document's processing phase, followed by a comprehensive merge of all nodes with identical names across the entire document. This approach has significantly improved node merging speed.

@danielaskdd

Our tests revealed that sequentially running the function _merge_nodes_then_upsert and _merge_edges_then_upsert would cause sequential running of the function _handle_entity_relation_summary, which in turn led to only one concurrent LLM call, thereby wasting the GPU computing power.

Agree. Following the extract phase, during which "chunk_results = await asyncio.gather(*tasks)" is executed, the LLM call becomes available for subsequent operations and is serialized during the merging phase.

@danielaskdd
Copy link
Collaborator

danielaskdd commented Apr 17, 2025

To ensure data consistency, only a single thread is permitted to access the graph database during the node merging stage. This limitation accounts for the observed reduction in LLM concurrency. Resolving this issue requires optimizing the pipeline architecture to enable asynchronous operation between document entity relation extraction and node merging processes. However, this enhancement is not currently a high priority. If any contributor is willing to take on this responsibility, it would be greatly appreciated. @LarFii

Within a document, each entity with the same name only be merged once, this should not compromise data consistency. Could you explain your concerns about how this might lead to data inconsistency? @danielaskdd

For example, chunks A and B both contain an entity named "ent". Each chunk attempts to locate and merge "ent" independently, which may result in one chunk's outcome overwriting the other's.

@danielaskdd
Copy link
Collaborator

The latest version on main branch significantly optimizes file parallel processing logic. The entity/relationship merge phase no longer blocks the extraction phase tasks, ensuring LLM resources maintain full utilization without idle periods.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
question Further information is requested
Projects
None yet
Development

No branches or pull requests

3 participants