Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix mirostat state when using multiple sequences #3543

Merged
merged 6 commits into from
Oct 11, 2023

Conversation

KerfuffleV2
Copy link
Collaborator

As mentioned in #3537, mirostat currently isn't compatible with using multiple sequences.

The main selling point of the way this pull is implemented is such a way that it's pretty simple and uninvasive.

However, I really don't like storing mutable sampler state in gpt_params (even though it's only in common and not the main llama.cpp API). This also required removing the const from the params argument to llama_sample_token. As far as I can see, the existing examples don't care about that.

I feel like the right way to do this is probably to move the sampler state out of gpt_params and have it passed separately. In that case, this is probably also where grammar should be since it's a type of sampler state. So we wouldn't add a new argument to llama_sample_token, we'd replace the current grammar one with sampler state. This of course would require changing a lot more stuff, including all the examples that use llama_sample_token (I don't think it would be too bad though).

Thoughts?

Closes #3537

Copy link
Owner

@ggerganov ggerganov left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, dedicated sampling state with grammar and mirostat would be better. Maybe implemented in common/sampling.h/.cpp. It should probably inherit all sampling-related parameters gpt_params, such as temperature, top_p, top_k, etc, so that llama_sample_token accepts struct llama_sampling_state instead of struct gpt_params.

For now we can have this workaround

@KerfuffleV2
Copy link
Collaborator Author

@ggerganov

For now we can have this workaround

Do you actually prefer doing it this way for now?

I don't mind changing this to do it the other way I suggested as long as you agree that approach is okay.

Maybe implemented in common/sampling.h/.cpp.

This actually raises another question that I'm actually dealing with in my seqrep sampler. Right now it's really awkward to have multiple source files in common. This is how I dealt with it:

COMMON_DEPS = common/common.cpp common/common.h build-info.h common/log.h
COMMON_OBJS = common.o
ifndef LLAMA_DISABLE_SEQREP_SAMPLER
COMMON_DEPS += common/seqrep-sampler.cpp common/seqrep-sampler.h
COMMON_OBJS += seqrep-sampler.o
endif
common.o: $(COMMON_DEPS)
	$(CXX) $(CXXFLAGS) -c $< -o $@

simple: examples/simple/simple.cpp                            build-info.h ggml.o llama.o $(COMMON_OBJS) $(OBJS)
	$(CXX) $(CXXFLAGS) $(filter-out %.h,$^) -o $@ $(LDFLAGS)

and so on. It's probably less of a pain in cmake.

@ggerganov
Copy link
Owner

We should try separating the sampling state - would be better than the current fix, so let's give it a try if you are up to.

The proposed Makefile looks OK to me.

@FSSRepo
Copy link
Collaborator

FSSRepo commented Oct 8, 2023

When this commit is merged into the master branch, we need to add the parameter slot.id in llama_sample_token(ctx, NULL, NULL, params, slot.tokens_prev, candidates, slot.i_batch, slot.id) in the llama_sample_token call of the server-parallel in #3490 and this params.sampler_state.erase(slot.id); too.

@KerfuffleV2
Copy link
Collaborator Author

Well, this went from small and self-contained to huge and complicated. I sure hope I'm on the right track after all this.

Pretty much all the sampling stuff got moved into common/sampling.{c,h}. Creating sampling state takes gpt_params and llama_grammar * (can be NULL). The sampling-related params in gpt_params are also in a llama_sampling_params struct now.

The sampling state (llama_sampling_state) holds a copy of the params that were created at init time. It also holds per-sequence state for samplers that use it (Mirostat 1/2, and grammar).

If the per-sequence state doesn't exist when llama_sample_token is called, it will be created with default values. In the case of grammar, this does llama_grammar_copy on the llama_grammar * that was supplied at init time.

If you want to use separate grammar, or separate grammar states per sequence then you'd probably have to manually manage the grammar part yourself. (Not sure this is so easy right now, so I might need to add an interface.)

In the case of parallel generation, when a sequence ends and you will want to reuse it (or just free up memory) you need to call llama_sampling_state_reset with the sequence id. This will reset stuff like Mirostat mu to the initial value so it won't mess with future generation with that sequence id.

I also randomly threw in support for LLAMA_SANITIZE_{THREAD,ADDRESS,UNDEFINED} in the Makefile.

I think this currently doesn't break stuff, but there were some tricky parts like server and speculative.

@KerfuffleV2
Copy link
Collaborator Author

@FSSRepo You're going to hate me when you see the next step in this pull.

Calling sampling is going to look like:

llama_sample_token(ctx, NULL, sampling_state, slot.tokens_prev, candidates, slot.i_batch, slot.id);

It looks like #3490 doesn't support grammar currently? So that's going to make your life easier. Pretty much the only other thing to worry about is calling llama_sampling_state_reset when you're done with a sequence id (but may want to reuse it for a different generation). Pretty much when you hit the EOS token or reach the quota of tokens to generate, you can just reset that sequence id.

@ggerganov ggerganov self-requested a review October 8, 2023 17:29
@ggerganov ggerganov added the need feedback Testing and feedback with results are needed label Oct 8, 2023
Code formatting cleanups and add some comments

Silence a warning about id not being used when logging is disabled
@KerfuffleV2
Copy link
Collaborator Author

KerfuffleV2 commented Oct 8, 2023

I exported the function to fetch/create default instances of sampler state. This should fix the problem I mentioned earlier about how it would be hard to do something like parallel generation where each sequence used its own grammar.

By the way, since the ggml-alloc stuff:

ggml-alloc.c:212:32: runtime error: pointer index expression with base 0x00000100a020 overflowed to 0xffffffffffffffff
llama_new_context_with_model: compute buffer total size = 552.88 MB
llama_new_context_with_model: VRAM scratch buffer: 546.75 MB
llama_new_context_with_model: total VRAM used: 6515.44 MB (model: 3368.69 MB, context: 3146.75 MB)
ggml-cuda.cu:6787:51: runtime error: applying non-zero offset 1152 to null pointer

Not sure if that's anything to worry about. The ggml-alloc one is new, the other one isn't. (I was assuming it was just because GCC's address sanitizing stuff doesn't know about CUDA/ROCM.) edit: Might not be a problem since it seems like it's triggering just on doing pointer math that results in something out of bounds rather than actually accessing it.

@KerfuffleV2
Copy link
Collaborator Author

This isn't really approved, right? Even without a full review, any changes I can/should start working on?

Comment on lines +54 to +55
const std::vector<llama_token> & last_tokens,
std::vector<llama_token_data> & candidates,
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we absorb last_tokens and candidates into llama_sampling_state?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Last tokens can be specific to the sequence, right? So this would kind of mean stuff using last tokens would have to be aware of sequences. I kind of feel like this also might limit how people can manipulate last tokens and if there are currently examples that do that kind of thing it might be difficult to adapt them (for me anyway, since I'm not really deeply familiar with most of them).

candidates I'm less sure about, it's basically just a scratch area for the logits in a form samplers can work with (right?) so I think moving it in there is less of a big deal. It's a pretty large structure though, I don't know if that's a consideration. Right now the current stuff in those structs is pretty lightweight.

I don't have very strong feelings about this. I'd like to say "These changes are already complicated enough, let's come back to that" but... I probably never would. :)

Comment on lines 161 to 163
if (seq_state.grammar != NULL) {
llama_grammar_accept_token(ctx, seq_state.grammar, id);
}
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Probably we should add llama_sampling_accept_token() and move this call in there together with update of last_tokens member (if we decide it should become part of llama_sampling_state)

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This might be a bit of a pain since it depends on a number of other static functions like decode_utf8, llama_grammar_accept, llama_token_to_str. It looks like the grammar stuff is the only thing that uses them, so maybe they could be moved too. I'm not sure what other parts of the grammar code depends on them though, so it might not be that simple.

Copy link
Owner

@ggerganov ggerganov left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a great change. I actually think that we should merge llama_sampling directly into llama.cpp. But let's do this after this PR is merged and tested for some time

Not sure if that's anything to worry about.

These errors look benign, but we will look in ways to fix them anyway

common/sampling.h Outdated Show resolved Hide resolved
common/sampling.h Outdated Show resolved Hide resolved
Fix comments that were out of sync with the pull.
@KerfuffleV2
Copy link
Collaborator Author

KerfuffleV2 commented Oct 11, 2023

Current status: I took the suggestions to rename the functions/types but I didn't do stuff like moving last_tokens into the sampling context (yet). edit: Just to be clear, the "(yet)" doesn't mean I'm actually planning to unless someone insists on it.

@KerfuffleV2 KerfuffleV2 requested a review from ggerganov October 11, 2023 10:18
llama_token llama_sampling_sample(
struct llama_context * ctx,
struct llama_context * ctx_guidance,
struct llama_sampling_context & sampling_ctx,
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
struct llama_sampling_context & sampling_ctx,
struct llama_sampling_context & ctx_sampling,

There are a few other places that need similar change for consistency sake

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

By "a few" you mean 20 or so? :) Hopefully I caught them all. Everything seems to compile/work still.

@ggerganov ggerganov merged commit 70c29da into ggerganov:master Oct 11, 2023
33 of 38 checks passed
@ggerganov
Copy link
Owner

Thanks for this - I accidentally merged this too quickly with the old title. Should have updated to the more relevant change of introducing the llama_sampling_context.

Comment on lines +50 to +53

// map of sequence ids to sampler contexts
std::unordered_map<llama_seq_id, llama_sampler_sequence_context> sequence_contexts;

Copy link
Owner

@ggerganov ggerganov Oct 12, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@KerfuffleV2

Any reason not to have single sequence data in llama_sampling_context?
When we want to sample multiple sequences, we will create on llama_sampling_context for each.

This way, each sequence can also have a separate llama_grammar instance which seems to make sense.

I.e. have it like this:

// general sampler context
typedef struct llama_sampling_context {
    ~llama_sampling_context();

    // parameters that will be used for sampling
    llama_sampling_params params;

    float mirostat_mu; // mirostat sampler state

    llama_grammar * grammar;
} llama_sampling_context;

joelkuiper added a commit to vortext/llama.cpp that referenced this pull request Oct 12, 2023
…example

* 'master' of github.com:ggerganov/llama.cpp: (34 commits)
  examples: support LLaVA v1.5 (multimodal model) (ggerganov#3436)
  docs : fix typo GOMP_CPU_AFFINITY (ggerganov#3597)
  cmake : fix add_compile_options on macOS
  typo : it is `--n-gpu-layers` not `--gpu-layers` (ggerganov#3592)
  ci : check if there is enough VRAM (ggerganov#3596)
  server : add completion mode (no chat) (ggerganov#3582)
  prompts : add mnemonics.txt
  server : fix kv cache management (ggerganov#3588)
  main : fix session loading bug (ggerganov#3400)
  server : add parameter -tb N, --threads-batch N (ggerganov#3584)
  common : fix mirostat state when using multiple sequences (ggerganov#3543)
  batched : add bench tool (ggerganov#3545)
  examples : add batched.swift + improve CI for swift (ggerganov#3562)
  Add MPT model to supported models in README.md (ggerganov#3574)
  Minor improvements in GPT2 tokenizer (ggerganov#3567)
  readme : add bloom (ggerganov#3570)
  llm : add bloom models (ggerganov#3553)
  swift : improvements and fixes (ggerganov#3564)
  llm : add MPT support (ggerganov#3417)
  infill. : fix tokenization (ggerganov#3508)
  ...
@KerfuffleV2 KerfuffleV2 deleted the fix-parseq-mirostat branch November 17, 2023 03:11
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working need feedback Testing and feedback with results are needed
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[Bug] Mirostat samplers don't work properly with parallel generation
3 participants