Is It Possible to Await Each Streamed Token or Chunk? #467
Replies: 2 comments 2 replies
-
You're right, all the Why do you need to delay the generation? |
Beta Was this translation helpful? Give feedback.
-
Oh, I see. So you're saying we don't see all the tokens actually generated by the models because some are reserved for special moments, like when the model finishes or when thinking models include a thought section first. That makes sense. The reason I want the ability to delay isn't critical, although having it in the future would be really nice. I have a custom wrapper module we've been using for a project. It provides a simple way to handle logic for onStart, onResponse, onAbort, onEnd, and onError. All of them are async. Each one, including onResponse, is expected to properly await any time-consuming logic. But onResponse doesn’t behave that way, because it's actually onResponseChunk that awaits the onResponse callback. So what I did for the other four doesn’t work for the fifth. I was just trying to make sure it worked the same way. It wasn’t for a specific use case, just for future-proofing. I still think it would be a useful feature. If I'm not mistaken, when using the Python Ollama module, I was able to delay generation per token. We used the generator method to iterate over each chunk. I wonder if that was truly delaying the generation, or if it was just looping over chunks that were already being populated in some cached array. So even though we delayed the loop, the generation itself might not have been affected. |
Beta Was this translation helpful? Give feedback.
Uh oh!
There was an error while loading. Please reload this page.
-
It seems that when we supply an
onTextChunk
oronResponseChunk
callback to the.prompt
function, and those callbacks are asynchronous functions, their internalawait
behavior doesn’t affect the token generation timing as expected. For example, if each callbackawait
s a 1-second delay, I would expect the next token not to be processed until that second has passed. However, this doesn’t appear to be the case.Example callback:
In this example, whenever the chunk contains a
't'
, it should delay for 1 second. But what actually happens is all the chunks without't'
print immediately, and then the delayed ones print afterward, each one a second apart.This suggests that the callback itself isn't being awaited before continuing to the next token. Is there currently a way to ensure that token generation respects asynchronous behavior within these callbacks?
Beta Was this translation helpful? Give feedback.
All reactions