Skip to content

[Feature Request]: "entity_continue_extraction" should be formulated a bit differently & new chunking function #1379

Open
@frederikhendrix

Description

@frederikhendrix

Do you need to file a feature request?

  • I have searched the existing feature request and this feature request is not already filed.
  • I believe this is a legitimate feature request, not just a question or bug.

Feature Request Description

This is a very minor change. I am currently experimenting with the new gpt-4.1-mini and it is wonderfull and works perfectly with my given instructions.

The only thing I noticed when doing the "entity_continue_extraction" is that gpt-4.1-mini starts renaming the same entities and relationships from the previous message which isn't necessary and will cause duplicate desciptions for a lot of entities.

PROMPTS["entity_continue_extraction"] = """
MANY entities and relationships might have been missed in the last extraction. This is critical for our dense database, which is essential to the company.

---Remember Steps---

1. Identify all entities. For each identified entity, extract the following information:
- entity_name: Name of the entity, use same language as input text. If English, capitalized the name.
- entity_type: One of the following types: [{entity_types}]
- entity_description: Comprehensive description of the entity's attributes and activities
Format each entity as ("entity"{tuple_delimiter}<entity_name>{tuple_delimiter}<entity_type>{tuple_delimiter}<entity_description>

2. From the entities identified in step 1, identify all pairs of (source_entity, target_entity) that are *clearly related* to each other.
For each pair of related entities, extract the following information:
- source_entity: name of the source entity, as identified in step 1
- target_entity: name of the target entity, as identified in step 1
- relationship_description: explanation as to why you think the source entity and the target entity are related to each other
- relationship_strength: a numeric score indicating strength of the relationship between the source entity and target entity
- relationship_keywords: one or more high-level key words that summarize the overarching nature of the relationship, focusing on concepts or themes rather than specific details
Format each relationship as ("relationship"{tuple_delimiter}<source_entity>{tuple_delimiter}<target_entity>{tuple_delimiter}<relationship_description>{tuple_delimiter}<relationship_keywords>{tuple_delimiter}<relationship_strength>)

3. Identify high-level key words that summarize the main concepts, themes, or topics of the entire text. These should capture the overarching ideas present in the document.
Format the content-level key words as ("content_keywords"{tuple_delimiter}<high_level_keywords>)

4. If there is exisiting data in the existing data result use that to add relationships where needed or to improve other relations.

5. Return output in {language} as a single list of all the entities and relationships identified in steps 1 and 2. Use **{record_delimiter}** as the list delimiter.

6. When finished, output {completion_delimiter}

7. Do not write down the same entities or relationships already mentioned in your previous answer. This step should only output previously missed entities or relationships.

---Output---

Add them below using the same format:\n
""".strip()

Here I added step 7 and I changed the starting sentence to mention that there "might" have been relationships and entities missed in the previous extraction.

This makes the AI perform better and gives less duplicate descriptions. It is a very minor but powerfull change.

Additional Context

I request a differnt chunking method. Now I have noticed that sometimes you can have a chunk which is for example 5 tokens at the end. Lets say i have chunk size set to 1200 and a document which is 2500 tokens then I get 3 chunks. The first 2 chunks are fine but the final chunk has no context whathowevefr. Thats why I want the chunking function to make sure chunks are at least 800 tokens.

def chunking_by_token_size(
    content: str,
    split_by_character: str | None = None,
    split_by_character_only: bool = False,
    overlap_token_size: int = 128,
    max_token_size: int = 1024,
    tiktoken_model: str = "gpt-4o",
    min_chunk_size: int = 800,  # new parameter to control merging
) -> list[dict[str, Any]]:
    tokens = encode_string_by_tiktoken(content, model_name=tiktoken_model)
    results: list[dict[str, Any]] = []
    if split_by_character:
        raw_chunks = content.split(split_by_character)
        new_chunks = []
        if split_by_character_only:
            for chunk in raw_chunks:
                _tokens = encode_string_by_tiktoken(chunk, model_name=tiktoken_model)
                new_chunks.append((len(_tokens), chunk))
        else:
            for chunk in raw_chunks:
                _tokens = encode_string_by_tiktoken(chunk, model_name=tiktoken_model)
                if len(_tokens) > max_token_size:
                    for start in range(
                        0, len(_tokens), max_token_size - overlap_token_size
                    ):
                        chunk_content = decode_tokens_by_tiktoken(
                            _tokens[start : start + max_token_size],
                            model_name=tiktoken_model,
                        )
                        new_chunks.append(
                            (min(max_token_size, len(_tokens) - start), chunk_content)
                        )
                else:
                    new_chunks.append((len(_tokens), chunk))
        for index, (_len, chunk) in enumerate(new_chunks):
            results.append(
                {
                    "tokens": _len,
                    "content": chunk.strip(),
                    "chunk_order_index": index,
                }
            )
    else:
        for index, start in enumerate(
            range(0, len(tokens), max_token_size - overlap_token_size)
        ):
            chunk_content = decode_tokens_by_tiktoken(
                tokens[start : start + max_token_size], model_name=tiktoken_model
            )
            results.append(
                {
                    "tokens": min(max_token_size, len(tokens) - start),
                    "content": chunk_content.strip(),
                    "chunk_order_index": index,
                }
            )
    
    # Merging step: iterate through the chunks and merge any chunk
    # that has fewer tokens than the min_chunk_size with the previous chunk.
    if results:
        merged_results = []
        # Start with the first chunk
        current_chunk = results[0]
        for chunk in results[1:]:
            # If a chunk has fewer tokens than the minimum size, merge it.
            if chunk["tokens"] < min_chunk_size:
                # Concatenate text with a space separator (you may adjust as needed)
                current_chunk["content"] = current_chunk["content"].rstrip() + " " + chunk["content"].lstrip()
                # Update the token count (you could also re-encode if you need exact counts)
                current_chunk["tokens"] += chunk["tokens"]
            else:
                merged_results.append(current_chunk)
                current_chunk = chunk
        # Append the last (merged) chunk.
        merged_results.append(current_chunk)
        results = merged_results

    return results

Especially in the future when AI's are even better at extracting there is no reason to be affraid to send a chunk which is a bit bigger. Now that I am looking at the function it might be better to just check the results final index the "tokens" and if it is less than 800 combine it with the one before that one.

Metadata

Metadata

Assignees

No one assigned

    Labels

    enhancementNew feature or request

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions