`ggml_backend_sched_graph_compute` fails to compute a hybrid(?) graph when called for the second time #892

MollySophia · 2024-07-18T10:26:16Z

Hi!
When a graph has weight buffers on several backends, the scheduler makes the graph into several splits.
ggml-backend.c#L1593

                if (src_backend_id != cur_backend_id && !supported) {
                    // create a copy of the input in the split's backend
                    const size_t id = hash_id(src);
                    if (sched->tensor_copies[id][cur_backend_id][0] == NULL) {
                        ggml_backend_t backend = sched->backends[cur_backend_id];
                        for (int c = 0; c < sched->n_copies; c++) {
                            struct ggml_tensor * tensor_copy = ggml_dup_tensor_layout(sched->ctx, src);
                            ggml_format_name(tensor_copy, "%s#%s#%d", ggml_backend_name(backend), src->name, c);
                            if (sched->n_copies > 1) {
                                ggml_set_input(tensor_copy);
                                ggml_set_output(tensor_copy); // prevent ggml-alloc from overwriting the tensor
                            }
                            sched->tensor_copies[id][cur_backend_id][c] = tensor_copy;
                            SET_CAUSE(tensor_copy, "4.cpy");
                        }
                        int n_inputs = split->n_inputs++;
                        GGML_ASSERT(n_inputs < GGML_SCHED_MAX_SPLIT_INPUTS);
                        split->inputs[n_inputs] = src;
                    }
                    node->src[j] = sched->tensor_copies[id][cur_backend_id][sched->cur_copy];
                }

If a node's backend_id differs from one of its srcs, it creates a new split, as well as tensor_copies and the information needed to copy between splits later.
However, here it also modifies the original cgraph to make the node's src points to the tensor_copy.

node->src[j] = sched->tensor_copies[id][cur_backend_id][sched->cur_copy];

But ggml_backend_sched_reset doesn't reset this.

void ggml_backend_sched_reset(ggml_backend_sched_t sched) {
    // reset state for the next run
    if (!sched->is_reset) {
        size_t hash_size = sched->hash_set.size;
        memset(sched->hash_set.keys,      0, sizeof(sched->hash_set.keys[0])     * hash_size); // NOLINT
        memset(sched->tensor_backend_id, -1, sizeof(sched->tensor_backend_id[0]) * hash_size);
        memset(sched->tensor_copies,      0, sizeof(sched->tensor_copies[0])     * hash_size);

        sched->is_reset = true;
    }
    sched->is_alloc = false;
}

As a result, the next time ggml_backend_sched_graph_compute is called, there are two different situations:

The allocated tensor_copy is still there: Then this node no longer has different backend_id from its srcs, thus the copying between splits doesn't happen again.
The allocated tensor_copy is freed: Then access violation or something else may happen.

Here's a simple program to reproduce this issue:


#include <stdlib.h>
#include <stdio.h>

#include <math.h>
#include <string.h>

#include <ggml.h>
#include <ggml-alloc.h>
#include <ggml-backend.h>

#include "ggml/include/ggml-cuda.h"

#define ELEMENT_COUNT 64

#define ASSERT(cond, ...) if (!(cond)) { fprintf(stderr, __VA_ARGS__); exit(1); }

int main(void) {
    struct ggml_init_params params = {
        .mem_size   = 96 * 1024,
        .mem_buffer = NULL,
        .no_alloc   = true,
    };

    struct ggml_context * ctx = ggml_init(params);

    ggml_backend_t backend_gpu = ggml_backend_cuda_init(0);

    ggml_backend_t backend_cpu = ggml_backend_cpu_init();

    ggml_backend_t backends[2] = { backend_gpu, backend_cpu };

    ASSERT(backend_gpu && backend_cpu, "ggml_backend init failed\n");

    // weight buffers
    ggml_backend_buffer_t buffer_cpu = ggml_backend_alloc_buffer(backend_cpu, 
        ELEMENT_COUNT * ggml_type_size(GGML_TYPE_F32) + ggml_tensor_overhead());
    ggml_backend_buffer_set_usage(buffer_cpu, GGML_BACKEND_BUFFER_USAGE_WEIGHTS);

    ggml_backend_buffer_t buffer_gpu = ggml_backend_alloc_buffer(backend_gpu, 
        ELEMENT_COUNT * ggml_type_size(GGML_TYPE_F32) + ggml_tensor_overhead());
    ggml_backend_buffer_set_usage(buffer_gpu, GGML_BACKEND_BUFFER_USAGE_WEIGHTS);

    float * data = (float *) malloc(ELEMENT_COUNT * ggml_type_size(GGML_TYPE_F32));
    for (int i = 0; i < ELEMENT_COUNT; i++) {
        if (i % 2 == 0)
            data[i] = 1.0F * i / 2;
    }

    // cpu w0
    struct ggml_tensor * w0 = ggml_new_tensor_2d(ctx, GGML_TYPE_F32, ELEMENT_COUNT, 1);
    struct ggml_tallocr tallocr_cpu = ggml_tallocr_new(buffer_cpu);
    ggml_tallocr_alloc(&tallocr_cpu, w0);
    ggml_backend_tensor_set(w0, data, 0, ggml_nbytes(w0));

    // gpu w1
    struct ggml_tensor * w1 = ggml_new_tensor_2d(ctx, GGML_TYPE_F32, ELEMENT_COUNT, 1);
    struct ggml_tallocr tallocr_gpu = ggml_tallocr_new(buffer_gpu);
    ggml_tallocr_alloc(&tallocr_gpu, w1);
    ggml_backend_tensor_set(w1, data, 0, ggml_nbytes(w1));

    // on cpu
    struct ggml_tensor * x = ggml_new_tensor_2d(ctx, GGML_TYPE_F32, ELEMENT_COUNT, 1);
    struct ggml_tensor * y = ggml_mul(ctx, x, w0);

    // on gpu
    struct ggml_tensor * z = ggml_mul(ctx, y, w1);


    struct ggml_cgraph * graph = ggml_new_graph(ctx);
    ggml_build_forward_expand(graph, z);


    ggml_backend_sched_t sched = ggml_backend_sched_new(backends, NULL, 2, 4096, false);

    // first run
    ggml_backend_sched_reset(sched);
    ggml_backend_sched_alloc_graph(sched, graph);
    ggml_backend_tensor_set(x, data, 0, ggml_nbytes(x));
    ggml_backend_sched_graph_compute(sched, graph);

    /*
        ## SPLIT #0: CPU # 0 inputs: 
        node #  0 (       MUL):               node_0 (   0K) [  CPU         ]:               leaf_0 (   0K) [  CPU         ]               leaf_1 (   0K) [  CPU         ]
        ## SPLIT #1: CUDA0 # 1 inputs: [node_0 (   0K)] 
        node #  1 (       MUL):               node_1 (   0K) [CUDA0         ]:       CUDA0#node_0#0 (   0K) [ NULL         ]               leaf_2 (   0K) [CUDA0         ]
    */

    // second run
    ggml_backend_sched_reset(sched);
    ggml_backend_sched_alloc_graph(sched, graph);
    ggml_backend_tensor_set(x, data, 0, ggml_nbytes(x));
    ggml_backend_sched_graph_compute(sched, graph);

    /*
        ## SPLIT #0: CPU # 0 inputs: 
        node #  0 (       MUL):               node_0 (   0K) [  CPU         ]:               leaf_0 (   0K) [  CPU         ]               leaf_1 (   0K) [  CPU         ]

        ## SPLIT #1: CUDA0 # 0 inputs: 
        node #  1 (       MUL):               node_1 (   0K) [CUDA0         ]:       CUDA0#node_0#0 (   0K) [CUDA0         ]               leaf_2 (   0K) [CUDA0         ]
    */

    ggml_free(ctx);
    free(data);

    return 0;
}

Is this a bug? Feel free to point it out if I've misunderstood something :D
Thanks in advance

The text was updated successfully, but these errors were encountered:

slaren · 2024-07-18T14:18:57Z

As you noted, ggml_backend_sched_alloc_graph modifies the graph in a way that prevents it from being used again after ggml_backend_sched_reset is called. You should still be able to evaluate the same graph multiple times by calling ggml_backend_sched_graph_compute only. I would consider this is a limitation rather than a bug.

MollySophia · 2024-07-19T12:34:57Z

As you noted, ggml_backend_sched_alloc_graph modifies the graph in a way that prevents it from being used again after ggml_backend_sched_reset is called. You should still be able to evaluate the same graph multiple times by calling ggml_backend_sched_graph_compute only. I would consider this is a limitation rather than a bug.

It seems that you are right :P
Thanks a lot for your help!

MollySophia closed this as completed Jul 19, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

`ggml_backend_sched_graph_compute` fails to compute a hybrid(?) graph when called for the second time #892

`ggml_backend_sched_graph_compute` fails to compute a hybrid(?) graph when called for the second time #892

MollySophia commented Jul 18, 2024

slaren commented Jul 18, 2024 •

edited

Loading

MollySophia commented Jul 19, 2024

ggml_backend_sched_graph_compute fails to compute a hybrid(?) graph when called for the second time #892

ggml_backend_sched_graph_compute fails to compute a hybrid(?) graph when called for the second time #892

Comments

MollySophia commented Jul 18, 2024

slaren commented Jul 18, 2024 • edited Loading

MollySophia commented Jul 19, 2024

`ggml_backend_sched_graph_compute` fails to compute a hybrid(?) graph when called for the second time #892

`ggml_backend_sched_graph_compute` fails to compute a hybrid(?) graph when called for the second time #892

slaren commented Jul 18, 2024 •

edited

Loading