Dynamic batch size for openai embedding models #153

ThomasDelsart · 2025-06-04T09:44:20Z

Answer this issue: #151

Please read context from the issue.

Copilot

Pull Request Overview

This PR implements dynamic batching for OpenAI embedding models by grouping input strings based on their token counts, rather than using a fixed batch size.

Added a new helper function (_create_token_aware_batches) that creates batches based on token limits.
Updated embed_strings_without_late_chunking to use the new batching logic for OpenAI models.
Applied formatting improvements throughout the file.

Reviewed Changes

Copilot reviewed 2 out of 2 changed files in this pull request and generated 1 comment.

File	Description
src/raglite/_embed.py	Introduced dynamic, token-aware batching and improved multi-line formatting in various funcs.
pyproject.toml	Added the tiktoken dependency required for token-based batching.

Copilot · 2025-06-04T09:44:52Z

src/raglite/_embed.py

+        # If adding this string exceeds limit, start new batch
+        if current_tokens + tokens > max_tokens and current_batch:
+            batches.append(current_batch)
+            current_batch = [string]
+            current_tokens = tokens
+        else:
+            current_batch.append(string)
+            current_tokens += tokens


Consider adding a check to handle cases where an individual string's token count exceeds the maximum allowed tokens, to ensure the batching logic gracefully handles such edge cases.

Suggested change

# If adding this string exceeds limit, start new batch

if current_tokens + tokens > max_tokens and current_batch:

batches.append(current_batch)

current_batch = [string]

current_tokens = tokens

else:

current_batch.append(string)

current_tokens += tokens

# Handle strings that exceed the max token limit

if tokens > max_tokens:

# Split the string into smaller chunks

start = 0

while start < len(string):

chunk = string[start:start + max_tokens]

chunk_tokens = len(encoding.encode(chunk))

if current_tokens + chunk_tokens > max_tokens and current_batch:

batches.append(current_batch)

current_batch = []

current_tokens = 0

current_batch.append(chunk)

current_tokens += chunk_tokens

start += len(chunk)

else:

# If adding this string exceeds limit, start new batch

if current_tokens + tokens > max_tokens and current_batch:

batches.append(current_batch)

current_batch = [string]

current_tokens = tokens

else:

current_batch.append(string)

current_tokens += tokens

@lsorber WDYT?

Imo, this is impossible to have group of sentences (chunklets) higher than 8192 tokens.

ThomasDelsart · 2025-06-04T09:45:43Z

Modification include these functions: embed_strings_without_late_chunking and _create_token_aware_batches

The remainder is related to linting.

ThomasDelsart · 2025-06-04T09:57:17Z

src/raglite/_embed.py

+        batches = _create_token_aware_batches(strings, config.embedder)
+    else:
+        # Original fixed batching for non-OpenAI models
+        batch_size = 96


Should never be more than 96 because if using an Azure AI foundry embedding model (such as cohere embed4), the maximum batch size allowed by AZURE ai foundry is 96.

lsorber · 2025-06-06T20:23:44Z

Closing after discussion with @ThomasDelsart led to the conclusion that this PR is not needed.

Dynamic batch size for openai embedding models

6de1e80

ThomasDelsart requested review from lsorber and Copilot June 4, 2025 09:44

Copilot AI reviewed Jun 4, 2025

View reviewed changes

Merge main with latest work and batch size fixed to 96

4651d1b

ThomasDelsart commented Jun 4, 2025

View reviewed changes

lsorber closed this Jun 6, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Dynamic batch size for openai embedding models #153

Dynamic batch size for openai embedding models #153

ThomasDelsart commented Jun 4, 2025

Uh oh!

Copilot AI left a comment

Uh oh!

Copilot AI Jun 4, 2025

Uh oh!

ThomasDelsart Jun 4, 2025

Uh oh!

ThomasDelsart commented Jun 4, 2025

Uh oh!

ThomasDelsart Jun 4, 2025

Uh oh!

lsorber commented Jun 6, 2025

Uh oh!

Uh oh!

Dynamic batch size for openai embedding models #153

Dynamic batch size for openai embedding models #153

Conversation

ThomasDelsart commented Jun 4, 2025

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull Request Overview

Reviewed Changes

Uh oh!

Copilot AI Jun 4, 2025

Choose a reason for hiding this comment

Uh oh!

ThomasDelsart Jun 4, 2025

Choose a reason for hiding this comment

Uh oh!

ThomasDelsart commented Jun 4, 2025

Uh oh!

ThomasDelsart Jun 4, 2025

Choose a reason for hiding this comment

Uh oh!

lsorber commented Jun 6, 2025

Uh oh!

Uh oh!