Skip to content

[Feature Request]: "entity_continue_extraction" should be formulated a bit differently & new chunking function #1379

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
2 tasks done
frederikhendrix opened this issue Apr 15, 2025 · 5 comments
Labels
enhancement New feature or request

Comments

@frederikhendrix
Copy link
Contributor

frederikhendrix commented Apr 15, 2025

Do you need to file a feature request?

  • I have searched the existing feature request and this feature request is not already filed.
  • I believe this is a legitimate feature request, not just a question or bug.

Feature Request Description

This is a very minor change. I am currently experimenting with the new gpt-4.1-mini and it is wonderfull and works perfectly with my given instructions.

The only thing I noticed when doing the "entity_continue_extraction" is that gpt-4.1-mini starts renaming the same entities and relationships from the previous message which isn't necessary and will cause duplicate desciptions for a lot of entities.

PROMPTS["entity_continue_extraction"] = """
MANY entities and relationships might have been missed in the last extraction. This is critical for our dense database, which is essential to the company.

---Remember Steps---

1. Identify all entities. For each identified entity, extract the following information:
- entity_name: Name of the entity, use same language as input text. If English, capitalized the name.
- entity_type: One of the following types: [{entity_types}]
- entity_description: Comprehensive description of the entity's attributes and activities
Format each entity as ("entity"{tuple_delimiter}<entity_name>{tuple_delimiter}<entity_type>{tuple_delimiter}<entity_description>

2. From the entities identified in step 1, identify all pairs of (source_entity, target_entity) that are *clearly related* to each other.
For each pair of related entities, extract the following information:
- source_entity: name of the source entity, as identified in step 1
- target_entity: name of the target entity, as identified in step 1
- relationship_description: explanation as to why you think the source entity and the target entity are related to each other
- relationship_strength: a numeric score indicating strength of the relationship between the source entity and target entity
- relationship_keywords: one or more high-level key words that summarize the overarching nature of the relationship, focusing on concepts or themes rather than specific details
Format each relationship as ("relationship"{tuple_delimiter}<source_entity>{tuple_delimiter}<target_entity>{tuple_delimiter}<relationship_description>{tuple_delimiter}<relationship_keywords>{tuple_delimiter}<relationship_strength>)

3. Identify high-level key words that summarize the main concepts, themes, or topics of the entire text. These should capture the overarching ideas present in the document.
Format the content-level key words as ("content_keywords"{tuple_delimiter}<high_level_keywords>)

4. If there is exisiting data in the existing data result use that to add relationships where needed or to improve other relations.

5. Return output in {language} as a single list of all the entities and relationships identified in steps 1 and 2. Use **{record_delimiter}** as the list delimiter.

6. When finished, output {completion_delimiter}

7. Do not write down the same entities or relationships already mentioned in your previous answer. This step should only output previously missed entities or relationships.

---Output---

Add them below using the same format:\n
""".strip()

Here I added step 7 and I changed the starting sentence to mention that there "might" have been relationships and entities missed in the previous extraction.

This makes the AI perform better and gives less duplicate descriptions. It is a very minor but powerfull change.

Additional Context

I request a differnt chunking method. Now I have noticed that sometimes you can have a chunk which is for example 5 tokens at the end. Lets say i have chunk size set to 1200 and a document which is 2500 tokens then I get 3 chunks. The first 2 chunks are fine but the final chunk has no context whathowevefr. Thats why I want the chunking function to make sure chunks are at least 800 tokens.

def chunking_by_token_size(
    content: str,
    split_by_character: str | None = None,
    split_by_character_only: bool = False,
    overlap_token_size: int = 128,
    max_token_size: int = 1024,
    tiktoken_model: str = "gpt-4o",
    min_chunk_size: int = 800,  # new parameter to control merging
) -> list[dict[str, Any]]:
    tokens = encode_string_by_tiktoken(content, model_name=tiktoken_model)
    results: list[dict[str, Any]] = []
    if split_by_character:
        raw_chunks = content.split(split_by_character)
        new_chunks = []
        if split_by_character_only:
            for chunk in raw_chunks:
                _tokens = encode_string_by_tiktoken(chunk, model_name=tiktoken_model)
                new_chunks.append((len(_tokens), chunk))
        else:
            for chunk in raw_chunks:
                _tokens = encode_string_by_tiktoken(chunk, model_name=tiktoken_model)
                if len(_tokens) > max_token_size:
                    for start in range(
                        0, len(_tokens), max_token_size - overlap_token_size
                    ):
                        chunk_content = decode_tokens_by_tiktoken(
                            _tokens[start : start + max_token_size],
                            model_name=tiktoken_model,
                        )
                        new_chunks.append(
                            (min(max_token_size, len(_tokens) - start), chunk_content)
                        )
                else:
                    new_chunks.append((len(_tokens), chunk))
        for index, (_len, chunk) in enumerate(new_chunks):
            results.append(
                {
                    "tokens": _len,
                    "content": chunk.strip(),
                    "chunk_order_index": index,
                }
            )
    else:
        for index, start in enumerate(
            range(0, len(tokens), max_token_size - overlap_token_size)
        ):
            chunk_content = decode_tokens_by_tiktoken(
                tokens[start : start + max_token_size], model_name=tiktoken_model
            )
            results.append(
                {
                    "tokens": min(max_token_size, len(tokens) - start),
                    "content": chunk_content.strip(),
                    "chunk_order_index": index,
                }
            )
    
    # Merging step: iterate through the chunks and merge any chunk
    # that has fewer tokens than the min_chunk_size with the previous chunk.
    if results:
        merged_results = []
        # Start with the first chunk
        current_chunk = results[0]
        for chunk in results[1:]:
            # If a chunk has fewer tokens than the minimum size, merge it.
            if chunk["tokens"] < min_chunk_size:
                # Concatenate text with a space separator (you may adjust as needed)
                current_chunk["content"] = current_chunk["content"].rstrip() + " " + chunk["content"].lstrip()
                # Update the token count (you could also re-encode if you need exact counts)
                current_chunk["tokens"] += chunk["tokens"]
            else:
                merged_results.append(current_chunk)
                current_chunk = chunk
        # Append the last (merged) chunk.
        merged_results.append(current_chunk)
        results = merged_results

    return results

Especially in the future when AI's are even better at extracting there is no reason to be affraid to send a chunk which is a bit bigger. Now that I am looking at the function it might be better to just check the results final index the "tokens" and if it is less than 800 combine it with the one before that one.

@frederikhendrix frederikhendrix added the enhancement New feature or request label Apr 15, 2025
@frederikhendrix frederikhendrix changed the title [Feature Request]: "entity_continue_extraction" should be formulated a bit differently [Feature Request]: "entity_continue_extraction" should be formulated a bit differently & new chunking function Apr 16, 2025
@danielaskdd
Copy link
Collaborator

Could you submit a PR to improve the chunking logic to prevent small trailing chunks? The preferred implementation should ensure the last chunk's size is at least half of the standard chunk size.

@danielaskdd
Copy link
Collaborator

@LarFii could you please review the prompt optimization proposal and assess whether it should be incorporated into our standard prompts?

@danielaskdd
Copy link
Collaborator

I have an idea of providing a more powerful chunker:

  • Identify the chapter structure of the document to avoid cross-chapter splits.
  • Provide metadata for each chunk, such as chapter, domain, and necessary contextual information.
  • Recognize table column headers and output RAG-friendly table chunks: providing column header to each line.
  • Prevent the last chunk of the document from being too short.

@drahnreb
Copy link
Contributor

I have an idea of providing a more powerful chunker:

  • Identify the chapter structure of the document to avoid cross-chapter splits.
  • Provide metadata for each chunk, such as chapter, domain, and necessary contextual information.
  • Recognize table column headers and output RAG-friendly table chunks: providing column header to each line.
  • Prevent the last chunk of the document from being too short.

This would be great as new default chunker.
I suggest to ease chunking_func to be flexible Callable [str, List[Dict[str, Any]]] with an Interface definition for

{
    "tokens": int,
    "content": str,
    "chunk_order_index": int,
}

This would allow people to reuse existing chunking methods like the great Text Splitters by LangChain. It could be wrapped into a custom chunking_func.

@LarFii could you please review the prompt optimization proposal and assess whether it should be incorporated into our standard prompts?

Regardless of optimization proposal, #1401 could help to tune such baked in prompts (like this entity_continue_extraction ) for certain domains or strategies (like chunking or modality).

@danielaskdd
Copy link
Collaborator

@drahnreb can you implement a more powerful chunker and make chunking_func o be flexible Callable?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

3 participants