-
Notifications
You must be signed in to change notification settings - Fork 2.2k
[Question]: Why not concurrently execute merge to speedup inserting? #1385
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
Too many concurrent requests can generate excessive database connections, consuming a large amount of resources and leading to a performance decline. The correct approach is to use batch mode for data queries. The main branch has already been optimized following this principle. Currently, the performance of PostgreSQL and Neo4j has been significantly improved. |
The bottleneck in node merging occurs during LLM calls, and simply increasing Python's concurrency isn't the right solution. In v1.3.2, we've optimized the process by deferring node merging until after each document's processing phase, followed by a comprehensive merge of all nodes with identical names across the entire document. This approach has significantly improved node merging speed. |
Our tests revealed that sequentially running the function _merge_nodes_then_upsert and _merge_edges_then_upsert would cause sequential running of the function _handle_entity_relation_summary, which in turn led to only one concurrent LLM call, thereby wasting the GPU computing power. |
To ensure data consistency, only a single thread is permitted to access the graph database during the node merging stage. This limitation accounts for the observed reduction in LLM concurrency. Resolving this issue requires optimizing the pipeline architecture to enable asynchronous operation between document entity relation extraction and node merging processes. However, this enhancement is not currently a high priority. If any contributor is willing to take on this responsibility, it would be greatly appreciated. @LarFii |
Within a document, each entity with the same name only be merged once, this should not compromise data consistency. Could you explain your concerns about how this might lead to data inconsistency? @danielaskdd |
Agree. Following the extract phase, during which "chunk_results = await asyncio.gather(*tasks)" is executed, the LLM call becomes available for subsequent operations and is serialized during the merging phase. |
For example, chunks A and B both contain an entity named "ent". Each chunk attempts to locate and merge "ent" independently, which may result in one chunk's outcome overwriting the other's. |
The latest version on main branch significantly optimizes file parallel processing logic. The entity/relationship merge phase no longer blocks the extraction phase tasks, ensuring LLM resources maintain full utilization without idle periods. |
Do you need to ask a question?
Your Question
Currently, merging entities and relationships is sequentially:
Why not change it to be parallel? Like this:
Additional Context
No response
The text was updated successfully, but these errors were encountered: