Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ggml_backend_sched_graph_compute fails to compute a hybrid(?) graph when called for the second time #892

Closed
MollySophia opened this issue Jul 18, 2024 · 2 comments

Comments

@MollySophia
Copy link
Contributor

Hi!
When a graph has weight buffers on several backends, the scheduler makes the graph into several splits.
ggml-backend.c#L1593

                if (src_backend_id != cur_backend_id && !supported) {
                    // create a copy of the input in the split's backend
                    const size_t id = hash_id(src);
                    if (sched->tensor_copies[id][cur_backend_id][0] == NULL) {
                        ggml_backend_t backend = sched->backends[cur_backend_id];
                        for (int c = 0; c < sched->n_copies; c++) {
                            struct ggml_tensor * tensor_copy = ggml_dup_tensor_layout(sched->ctx, src);
                            ggml_format_name(tensor_copy, "%s#%s#%d", ggml_backend_name(backend), src->name, c);
                            if (sched->n_copies > 1) {
                                ggml_set_input(tensor_copy);
                                ggml_set_output(tensor_copy); // prevent ggml-alloc from overwriting the tensor
                            }
                            sched->tensor_copies[id][cur_backend_id][c] = tensor_copy;
                            SET_CAUSE(tensor_copy, "4.cpy");
                        }
                        int n_inputs = split->n_inputs++;
                        GGML_ASSERT(n_inputs < GGML_SCHED_MAX_SPLIT_INPUTS);
                        split->inputs[n_inputs] = src;
                    }
                    node->src[j] = sched->tensor_copies[id][cur_backend_id][sched->cur_copy];
                }

If a node's backend_id differs from one of its srcs, it creates a new split, as well as tensor_copies and the information needed to copy between splits later.
However, here it also modifies the original cgraph to make the node's src points to the tensor_copy.

node->src[j] = sched->tensor_copies[id][cur_backend_id][sched->cur_copy];

But ggml_backend_sched_reset doesn't reset this.

void ggml_backend_sched_reset(ggml_backend_sched_t sched) {
    // reset state for the next run
    if (!sched->is_reset) {
        size_t hash_size = sched->hash_set.size;
        memset(sched->hash_set.keys,      0, sizeof(sched->hash_set.keys[0])     * hash_size); // NOLINT
        memset(sched->tensor_backend_id, -1, sizeof(sched->tensor_backend_id[0]) * hash_size);
        memset(sched->tensor_copies,      0, sizeof(sched->tensor_copies[0])     * hash_size);

        sched->is_reset = true;
    }
    sched->is_alloc = false;
}

As a result, the next time ggml_backend_sched_graph_compute is called, there are two different situations:

  1. The allocated tensor_copy is still there: Then this node no longer has different backend_id from its srcs, thus the copying between splits doesn't happen again.
  2. The allocated tensor_copy is freed: Then access violation or something else may happen.

Here's a simple program to reproduce this issue:


#include <stdlib.h>
#include <stdio.h>

#include <math.h>
#include <string.h>

#include <ggml.h>
#include <ggml-alloc.h>
#include <ggml-backend.h>

#include "ggml/include/ggml-cuda.h"

#define ELEMENT_COUNT 64

#define ASSERT(cond, ...) if (!(cond)) { fprintf(stderr, __VA_ARGS__); exit(1); }

int main(void) {
    struct ggml_init_params params = {
        .mem_size   = 96 * 1024,
        .mem_buffer = NULL,
        .no_alloc   = true,
    };

    struct ggml_context * ctx = ggml_init(params);

    ggml_backend_t backend_gpu = ggml_backend_cuda_init(0);

    ggml_backend_t backend_cpu = ggml_backend_cpu_init();

    ggml_backend_t backends[2] = { backend_gpu, backend_cpu };

    ASSERT(backend_gpu && backend_cpu, "ggml_backend init failed\n");

    // weight buffers
    ggml_backend_buffer_t buffer_cpu = ggml_backend_alloc_buffer(backend_cpu, 
        ELEMENT_COUNT * ggml_type_size(GGML_TYPE_F32) + ggml_tensor_overhead());
    ggml_backend_buffer_set_usage(buffer_cpu, GGML_BACKEND_BUFFER_USAGE_WEIGHTS);

    ggml_backend_buffer_t buffer_gpu = ggml_backend_alloc_buffer(backend_gpu, 
        ELEMENT_COUNT * ggml_type_size(GGML_TYPE_F32) + ggml_tensor_overhead());
    ggml_backend_buffer_set_usage(buffer_gpu, GGML_BACKEND_BUFFER_USAGE_WEIGHTS);

    float * data = (float *) malloc(ELEMENT_COUNT * ggml_type_size(GGML_TYPE_F32));
    for (int i = 0; i < ELEMENT_COUNT; i++) {
        if (i % 2 == 0)
            data[i] = 1.0F * i / 2;
    }

    // cpu w0
    struct ggml_tensor * w0 = ggml_new_tensor_2d(ctx, GGML_TYPE_F32, ELEMENT_COUNT, 1);
    struct ggml_tallocr tallocr_cpu = ggml_tallocr_new(buffer_cpu);
    ggml_tallocr_alloc(&tallocr_cpu, w0);
    ggml_backend_tensor_set(w0, data, 0, ggml_nbytes(w0));

    // gpu w1
    struct ggml_tensor * w1 = ggml_new_tensor_2d(ctx, GGML_TYPE_F32, ELEMENT_COUNT, 1);
    struct ggml_tallocr tallocr_gpu = ggml_tallocr_new(buffer_gpu);
    ggml_tallocr_alloc(&tallocr_gpu, w1);
    ggml_backend_tensor_set(w1, data, 0, ggml_nbytes(w1));

    // on cpu
    struct ggml_tensor * x = ggml_new_tensor_2d(ctx, GGML_TYPE_F32, ELEMENT_COUNT, 1);
    struct ggml_tensor * y = ggml_mul(ctx, x, w0);

    // on gpu
    struct ggml_tensor * z = ggml_mul(ctx, y, w1);


    struct ggml_cgraph * graph = ggml_new_graph(ctx);
    ggml_build_forward_expand(graph, z);


    ggml_backend_sched_t sched = ggml_backend_sched_new(backends, NULL, 2, 4096, false);

    // first run
    ggml_backend_sched_reset(sched);
    ggml_backend_sched_alloc_graph(sched, graph);
    ggml_backend_tensor_set(x, data, 0, ggml_nbytes(x));
    ggml_backend_sched_graph_compute(sched, graph);

    /*
        ## SPLIT #0: CPU # 0 inputs: 
        node #  0 (       MUL):               node_0 (   0K) [  CPU         ]:               leaf_0 (   0K) [  CPU         ]               leaf_1 (   0K) [  CPU         ]
        ## SPLIT #1: CUDA0 # 1 inputs: [node_0 (   0K)] 
        node #  1 (       MUL):               node_1 (   0K) [CUDA0         ]:       CUDA0#node_0#0 (   0K) [ NULL         ]               leaf_2 (   0K) [CUDA0         ]
    */

    // second run
    ggml_backend_sched_reset(sched);
    ggml_backend_sched_alloc_graph(sched, graph);
    ggml_backend_tensor_set(x, data, 0, ggml_nbytes(x));
    ggml_backend_sched_graph_compute(sched, graph);

    /*
        ## SPLIT #0: CPU # 0 inputs: 
        node #  0 (       MUL):               node_0 (   0K) [  CPU         ]:               leaf_0 (   0K) [  CPU         ]               leaf_1 (   0K) [  CPU         ]

        ## SPLIT #1: CUDA0 # 0 inputs: 
        node #  1 (       MUL):               node_1 (   0K) [CUDA0         ]:       CUDA0#node_0#0 (   0K) [CUDA0         ]               leaf_2 (   0K) [CUDA0         ]
    */

    ggml_free(ctx);
    free(data);

    return 0;
}

Is this a bug? Feel free to point it out if I've misunderstood something :D
Thanks in advance

@slaren
Copy link
Collaborator

slaren commented Jul 18, 2024

As you noted, ggml_backend_sched_alloc_graph modifies the graph in a way that prevents it from being used again after ggml_backend_sched_reset is called. You should still be able to evaluate the same graph multiple times by calling ggml_backend_sched_graph_compute only. I would consider this is a limitation rather than a bug.

@MollySophia
Copy link
Contributor Author

As you noted, ggml_backend_sched_alloc_graph modifies the graph in a way that prevents it from being used again after ggml_backend_sched_reset is called. You should still be able to evaluate the same graph multiple times by calling ggml_backend_sched_graph_compute only. I would consider this is a limitation rather than a bug.

It seems that you are right :P
Thanks a lot for your help!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants