Semantic Chunking Processor #96

urvishp80 · 2024-12-17T18:42:19Z

You can set SEMANTIC_CHUNKING_PARAMS env variable with a path to a json file with some params.
Example params (default) (semantic_chunker_params.json):

{
            # Semantic chunking params (for SemanticChunker, from documentation)
            "breakpoint_threshold_type": "standard_deviation",
            "buffer_size": 4,
            "breakpoint_threshold_amount": 3,
            "sentence_split_regex": r'(?<=[\.\!\?\n])(\n?)\s+',
            # Openai models
            "oai_embedding_model": "text-embedding-3-small",  # Model for embedding
            "oai_gen_model": "gpt-4o-mini",  # Model used for generating titles
            # Other params
            "chunk_threshold": 6000,  # Chunk only if number of tokens larger than this
            "use_markdown_headers": True,  # Use markdown headers for splitting
            "max_chunk_size": 6000,  # Max number of tokens in a chunk
            "generate_titles": False,  # Use LLM to generate titles for a chunk
}

Semantic Chunking Processor

ab54766

urvishp80 requested a review from kouloumos December 24, 2024 13:41

kouloumos force-pushed the master branch 16 times, most recently from a4bd36f to 2112dad Compare January 9, 2025 11:00

kouloumos force-pushed the master branch from db7b653 to f513cdb Compare January 31, 2025 12:58

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Semantic Chunking Processor #96

Semantic Chunking Processor #96

Uh oh!

urvishp80 commented Dec 17, 2024

Uh oh!

Uh oh!

Semantic Chunking Processor #96

Are you sure you want to change the base?

Semantic Chunking Processor #96

Uh oh!

Conversation

urvishp80 commented Dec 17, 2024

Uh oh!

Uh oh!