-
Notifications
You must be signed in to change notification settings - Fork 2.3k
[Feature Request]: "entity_continue_extraction" should be formulated a bit differently & new chunking function #1379
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
Could you submit a PR to improve the chunking logic to prevent small trailing chunks? The preferred implementation should ensure the last chunk's size is at least half of the standard chunk size. |
@LarFii could you please review the prompt optimization proposal and assess whether it should be incorporated into our standard prompts? |
I have an idea of providing a more powerful chunker:
|
This would be great as new default chunker.
This would allow people to reuse existing chunking methods like the great
Regardless of optimization proposal, #1401 could help to tune such baked in prompts (like this |
@drahnreb can you implement a more powerful chunker and make chunking_func o be flexible Callable? |
Do you need to file a feature request?
Feature Request Description
This is a very minor change. I am currently experimenting with the new gpt-4.1-mini and it is wonderfull and works perfectly with my given instructions.
The only thing I noticed when doing the "entity_continue_extraction" is that gpt-4.1-mini starts renaming the same entities and relationships from the previous message which isn't necessary and will cause duplicate desciptions for a lot of entities.
Here I added step 7 and I changed the starting sentence to mention that there "might" have been relationships and entities missed in the previous extraction.
This makes the AI perform better and gives less duplicate descriptions. It is a very minor but powerfull change.
Additional Context
I request a differnt chunking method. Now I have noticed that sometimes you can have a chunk which is for example 5 tokens at the end. Lets say i have chunk size set to 1200 and a document which is 2500 tokens then I get 3 chunks. The first 2 chunks are fine but the final chunk has no context whathowevefr. Thats why I want the chunking function to make sure chunks are at least 800 tokens.
Especially in the future when AI's are even better at extracting there is no reason to be affraid to send a chunk which is a bit bigger. Now that I am looking at the function it might be better to just check the results final index the "tokens" and if it is less than 800 combine it with the one before that one.
The text was updated successfully, but these errors were encountered: