Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ggml : expose hash table API from ggml.c and reuse in ggml-alloc #549

Closed
ggerganov opened this issue Oct 4, 2023 · 17 comments · Fixed by #586
Closed

ggml : expose hash table API from ggml.c and reuse in ggml-alloc #549

ggerganov opened this issue Oct 4, 2023 · 17 comments · Fixed by #586
Labels
good first issue Good for newcomers refactoring Refactoring

Comments

@ggerganov
Copy link
Owner

#548 adds hash table implementation:

ggml/src/ggml.c

Lines 16926 to 16985 in a094d9c

static_assert(GGML_GRAPH_HASHTABLE_SIZE > GGML_MAX_NODES * 2, "GGML_GRAPH_HT_SIZE is too small");
static size_t hash(void * p) {
return (size_t)p % GGML_GRAPH_HASHTABLE_SIZE;
}
static size_t hash_find(void * hash_table[], void * p) {
size_t h = hash(p);
// linear probing
size_t i = h;
while (hash_table[i] != NULL && hash_table[i] != p) {
i = (i + 1) % GGML_GRAPH_HASHTABLE_SIZE;
if (i == h) {
// visited all hash table entries -> not found
return GGML_GRAPH_HASHTABLE_SIZE;
}
}
return i;
}
static bool hash_insert(void * hash_table[], void * p) {
size_t i = hash_find(hash_table, p);
GGML_ASSERT(i < GGML_GRAPH_HASHTABLE_SIZE); // assert that not full
if (hash_table[i] == p) {
return true;
}
// insert
GGML_ASSERT(hash_table[i] == NULL);
hash_table[i] = p;
return false;
}
static bool hash_contains(void * hash_table[], void * p) {
size_t i = hash_find(hash_table, p);
return (i < GGML_GRAPH_HASHTABLE_SIZE) && (hash_table[i] == p);
}
struct hash_map {
void * keys[GGML_GRAPH_HASHTABLE_SIZE];
void * vals[GGML_GRAPH_HASHTABLE_SIZE];
};
static struct hash_map * new_hash_map(void) {
struct hash_map * result = malloc(sizeof(struct hash_map));
for (int i=0; i<GGML_GRAPH_HASHTABLE_SIZE; ++i) {
result->keys[i] = NULL;
result->vals[i] = NULL;
}
return result;
}
static void free_hash_map(struct hash_map * map) {
free(map);
}

It should be exposed through the API and reused in ggml-alloc

@ggerganov ggerganov added good first issue Good for newcomers refactoring Refactoring labels Oct 4, 2023
@slaren
Copy link
Collaborator

slaren commented Oct 10, 2023

I need a hash table in ggml-backend to map tensors between different backends, so I am looking into this. I don't think that this can be exported and reused as is, because it is not good enough to have only values of type void *, but that can be solved. However, I am not convinced that this should be shared in ggml.h, it will become part of the public API and inevitable someone will try to use it in their code. I have said this before, but I think we need to be more careful about what we expose in the public headers. This will also need changes after we implement dynamically sized graphs, so the API would change in the near future.

So I am not sure if it is worth to do this. Should we add a private header file for internal use within ggml only?

@ggerganov
Copy link
Owner Author

ggerganov commented Oct 10, 2023

Sure, we should add an internal header (e.g.ggml-impl.h or ggml-util.h) in src. We can also put things behind #ifdef GGML_PRIVATE_API when we think a new header isn't really necessary, but we want to restrict exposure to the users. Eventually, we will hide all internal structs behind accessors. It's just that at the start it is easier to have the internals exposed and not think about what would be the best API. But as people start to rely more on ggml we will have to indeed be careful.

Btw, I'm still not really worried about breaking things in the ggml API, so if there are improvements that you see in ggml_cplan, ggml_cgraph, etc. , we should make them without hesitating about breaking other projects. Changes are still easy to adapt to at this stage. In llama.cpp and whisper.cpp we have more stable APIs since these are being used more intensively, so there it is already more difficult to break the API

@slaren
Copy link
Collaborator

slaren commented Oct 10, 2023

if there are improvements that you see in ggml_cplan, ggml_cgraph, etc. , we should make them without hesitating about breaking other projects.

I am very concerned about breaking our own examples, especially the training examples depend very strongly on the internals of all of this. I think that the whole graph building code should be redesigned. I don't know how to solve every problem with dependencies, but I think a good start would be to automatically attach all the tensors created in a context to a graph, with an option to explicitly detach them. Currently we do the opposite, we create all tensor detached and then attach them with calls to ggml_build_forward_expand, and I think that that makes ggml harder to use. If we start from that, we may also conclude that ggml_context and ggml_cgraph could be unified. But I don't think any of this would be doable without introducing a new API version and slowly adapting all the code.

@ggerganov
Copy link
Owner Author

ggerganov commented Oct 10, 2023

I don't know how to solve every problem with dependencies, but I think a good start would be to automatically attach all the tensors created in a context to a graph, with an option to explicitly detach them.

That's a good idea. I think it is totally doable. The order of the nodes of the graph will be determined by the order of creation (i.e. the order of calling ggml_xxx()), so we avoid having to compute the dependencies, so not only we will resolve the dependency problems - we would likely speed things up for small networks.

The graph construction where we traverse the children / parents is probably completely unnecessary. It was one of the very first features in ggml and it was an experimental approach. I can imagine some use cases where you could benefit from having the option to compute certain nodes based only on their dependencies, but for the majority of practical uses this is irrelevant.

Edit: Actually it is relevant for the backward pass, since in that situation you could have many extra ops that are not necessary for the calculation of the gradients that you are interested. So in this case, traversing the dependencies will automatically avoid any unnecessary ops.

Maybe we can think of a way to support both modes - would make the API transition easier too.

@slaren
Copy link
Collaborator

slaren commented Oct 10, 2023

The backwards pass should always start from a loss, right? So I think in that case it is perfectly reasonable to start from the loss node and follow the parents to build the backwards graph. But if we have all the nodes somewhere, we may be able to add missing dependencies automatically. The typical case is the KV update:

ggml_cpy(ctx0, Kcur, ggml_view_1d(ctx0, kv_self.k, ...));

struct ggml_tensor * K = ggml_view_3d(ctx0, kv_self.k, ...);

struct ggml_tensor * KQ = ggml_mul_mat(ctx0, K, Q);

The ggml_cpy is completely disconnected from KQ, but if we have all the nodes we can figure that there is a dependency there and add it automatically. This is more what I am thinking about with this. Currently it is very easy to build a graph with several nodes disconnected, and I can only imagine that this is very confusing unless you know the internal about the way graphs are built in ggml. So I think we should find ways to simplify this.

@ggerganov
Copy link
Owner Author

Let me think some more about the backward pass.

The ggml_cpy is completely disconnected from KQ, but if we have all the nodes we can figure that there is a dependency there and add it automatically.

Why would we need to figure out the dependency? The operations will be executed in the order that they have been declared, so no need to find the dependency. I am thinking more that we will basically record the node order while constructing the graph, instead of trying to determine it by traversing the nodes post-construction.

@slaren
Copy link
Collaborator

slaren commented Oct 10, 2023

For parallel execution we need to know all the dependencies. To minimize backend switches when partial offloading we may need to reorder the evaluation order, but we can only do that if we know all the dependencies. I assume that we also need to know the dependencies to build the correct backwards pass, but I don't know enough about that.

We may also be able to find the best evaluation order to minimize memory usage during training if we know all the dependencies.

@ggerganov
Copy link
Owner Author

The order in which we invoke the ops (A) + the existing dependency information through ggml_tensor.src (B) is enough to achieve all goals. We currently do not store the order (A) and therefore we have cases where we cannot figure out the correct execution.

If we start recording the order, it will add extra constraints: an op x invoked before op y cannot be executed after y. Note that this not mean that x and y cannot be executed together.

The KV cache update example would be trivially solved if we start recording (A) and apply the new constraints. The parallelisation can be achieved with an extra pass that analyses the graph and takes both (A) and (B) into an account. Same goes for the training use cases.

Having (A), we also avoid the ggml_visit_parents() traversal for trivial cases - when we don't care about parallelisation, reordering, etc. We just evaluate the ops directly as they have been invoked during the graph creation.

Let me know if you agree - is there any other use case that you think we might need to cover?

@slaren
Copy link
Collaborator

slaren commented Oct 11, 2023

an op x invoked before op y cannot be executed after y. Note that this not mean that x and y cannot be executed together.

I think this should be "an op x invoked before op y cannot be executed after y if x is a source of y (*)". Otherwise this is a contradiction.

(1): or if a view of x, or a view with the same view_src as x, are sources of y.

@ggerganov
Copy link
Owner Author

How is it a contradiction - I mean that in the example above, ggml_cpy() cannot be executed after ggml_mul_mat() because it is invoked earlier - we can enforce this using the order of invocation (i.e. the order of calling the ggml ops).

@slaren
Copy link
Collaborator

slaren commented Oct 11, 2023

The contradiction is that x must be executed before y, but at the same time x and y can be executed together. If they can be executed together, that must imply that the order of evaluation doesn't matter. Maybe I am misunderstanding what you mean by "executed together"? As it is, this restriction implies that every op must be strictly executed in the order that they are invoked.

@ggerganov
Copy link
Owner Author

ggerganov commented Oct 11, 2023

x cannot be executed after y does not imply that x must be executed before y.
Here is another example to clarify:

Let's say we have the following ops:

// these can go in parallel
K = ggml_mul_mat(ctx, model.wk, cur);
Q = ggml_mul_mat(ctx, model.wq, cur);
V = ggml_mul_mat(ctx, model.wv, cur);

// these can go in parallel
Q_scaled = ggml_scale(ctx, Q, scale);
V_scaled = ggml_scale(ctx, V, scale);

// these can go in parallel
Q_masked = ggml_add(ctx, Q_scaled, mask);
V_masked = ggml_add(ctx, V_scaled, mask);

K cannot be executed after Q and it cannot be executed after V, but we can execute K, Q and V in parallel.
We can optionally decide to do that by analysing their sources with a function similar to ggml_visit_parents().

Similarly, Q_scaled and V_scaled can be executed together in parallel after V, since we can analyze their dependencies. But both of them cannot be executed after Q_masked or V_masked because the invocation order forbids it.

In contrast, the following graph will not allow parallelization:

K = ggml_mul_mat(ctx, model.wk, cur);

Q = ggml_mul_mat(ctx, model.wq, cur);
Q_scaled = ggml_scale(ctx, Q, scale);
Q_masked = ggml_add(ctx, Q_scaled, mask);

V = ggml_mul_mat(ctx, model.wv, cur);
V_scaled = ggml_scale(ctx, V, scale);
V_masked = ggml_add(ctx, V_scaled, mask);

Or technically, in this case we could parallelize (K, Q) and (Q_masked, V) at most.

With this approach, I don't think we even have to consider view_src ever - at least I cannot think of scenario where we would need to consider it.

@slaren
Copy link
Collaborator

slaren commented Oct 11, 2023

K cannot be executed after Q and it cannot be executed after V, but we can execute K, Q and V in parallel.

I think I must be missing something. If K, Q and V can be executed in parallel, then necessarily that must mean that K, Q and V can be executed in any order. Purely from a practical perspective, this must be true simply because we cannot control the scheduling of the threads. What is the definition of executed in parallel (in this context)?

@ggerganov
Copy link
Owner Author

Parallel mean we can run the ops concurrently. E.g. in multiple CUDA streams or with Metal using concurrent dispatch command buffer followed by a sync / barrier.

@slaren
Copy link
Collaborator

slaren commented Oct 11, 2023

As with CPU threads, there is no guarantee that two kernels launched on different CUDA streams will run at the same time. They may run in any order. I think we need to be very clear that running multiple operations in parallel requires the operations to be invariant to the order of execution.

To add another example, with this wording, I don't see what would prevent executing the ggml_cpy and KQ at the same time. But that's not correct, KQ has a hidden dependency on the ggml_cpy that currently we are not able to detect.

ggml_cpy(ctx0, Kcur, ggml_view_1d(ctx0, kv_self.k, ...));
KQ = ggml_mul_mat(ctx0, ggml_view_3d(ctx0, kv_self.k, ...), Q);

The wording "an op x invoked before op y cannot be executed after y if x is a source of y" solves this issue (it is implied that this also applies to views x).

@ggerganov
Copy link
Owner Author

Nothing would prevent the ops from executing in parallel, but I'm not convinced that this is actually wrong.

In a similar example, I could be copying data into the first half of a tensor k and multiplying the second half of tensor k and this is perfectly valid to execute in parallel, since we know there would be no data race.

I would rather have the user be explicit about the dependencies in the graph, rather than trying to solve the dependency problem in the general case. The logic seem to become too complicated and the only benefit is that we saved an extra op in the graph.

In the default scenario, the parallelization optimization would be disabled and the example above would evaluate correctly, without need for carefully placed ggml_build_forward_expand() calls, as we currently do. When the parallel optimization is enabled, the user has to be aware of the effects of this option and explicitly declare the hidden dependencies where it is necessary.

We are still just brainstorming, so I am open to considerations and more discussion. But I'm thinking that starting to respect the invocation order would give us multiple benefits and should be easy to implement.

@slaren
Copy link
Collaborator

slaren commented Oct 11, 2023

In the case of the KV update, it is undoubtedly wrong to not add a dependency between the ggml_cpy update and KQ. We could also extend the requirement to consider only overlapping ranges of a tensor rather than just any views, but that would be just an optimization and wouldn't fundamentally change the requirement.

Admittedly, the main difficulty of the full requirement is the complexity of the implementation. However I think that this could actually be pretty simple to implement if we are just a bit smarter about the way we handle views. For example, if we kept a list of the views of a tensor, we could in principle simply modify the view ops to use as a source the most recent view of the source tensor (again, if we want to optimize this, we could revise this definition to "the most recent view of the source tensor whose range overlaps the current view"). Note that this may require allowing negative offsets internally in the view ops, but that's not a big issue.

The problem of requiring the user to handle these dependencies explicitly is that if they don't, then the program will silently produce wrong results, and that's a very difficult problem to diagnose unless you are already an expert in ggml. In my opinion that's not good if we want ggml to be widely used.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
good first issue Good for newcomers refactoring Refactoring
Projects
Status: Done
Development

Successfully merging a pull request may close this issue.

2 participants