Batch inference #325

monatis · 2023-06-30T20:12:47Z

I am implementing batch inference support in monatis/clip.cpp#30. This is still WIP but but almost there. I removed ne[3] == 1 asserts and generalized Conv2D to N-batched src1. I also included #224 in this PR for batched matrix multiplication. I believe the only thing to make batchwise is normalization now --I hope to solve it very soon.

Hope it might be a playground to test batch inference.

…-assert

monatis · 2023-07-02T10:42:06Z

I'm surprised by ggml_norm. It works in the feature dimension, e.g., ne00, as it should, but it gives a different result for the second one of two identical samples in a batch. For example, the sum of the normed tensor of a single sample is 0.000058, but the sum is 0.000089 when I add the same sample twice in a batch, while it should have been 0.000058 * 2 = 0.000116. Similarly the output vector for the batch with a single sample and the output vector 0 for the batch with two identical samples are exactly the same, but the output vector 1 is different even though the input sample is the same. And, the only place that they differ seems to be ggml_norm, but I couldn't see why.

ggerganov · 2023-07-02T18:20:36Z

For example, the sum of the normed tensor of a single sample is 0.000058, but the sum is 0.000089 when I add the same sample twice in a batch, while it should have been 0.000058 * 2 = 0.000116

Could be round-off errors. Are you accumulating the sum into a double?

monatis · 2023-07-02T18:56:10Z

Are you accumulating the sum into a double?

Yes it's just for debugging so I accumulate the sum to a double. And the actual issue is, output vectors for identical images batched together are different, the sum is just a proxy to understand where they start to diverge. The first place where the sum of the two-sample batch is not two times the sum of the single-sample batch is just after the first ggml_norm op. Before that, modified conv_2d_sk_p0 with batch support, concat op with ggml_acc and adding positional embeddings with ggml_add all produces the expected values. But I can see no reason why ggml_norm leads to this. I will revisit the other ops prior to ggml_norm with a fresh mind if we don't have any better clue.

monatis · 2023-07-03T15:09:22Z

Yes, it turned out to be that I calculated a wrong value for the offset to be added to the wdata pointer in conv_2d_sk_p0. So even if input 0 gives a correct output, input 1 gives a wrong one. And a note to future me, the sum of a tensor is not always reliable for debugging :)

Now I can confirm the batch inference works correctly. I'll solve conflicts and then mark this ready for review.

monatis · 2023-07-03T21:32:41Z

This is ready for review now. Working usage is in monatis/clip.cpp#30

As a future work, I'll implement batch inference for text encoding in clip.cpp, which will require some extra work for attention masking to handle inputs with varying lengths, and probably generalization of ggml_get_rows to get position embeddings, so I'll leave it to another PR.

CC @ggerganov

ggerganov · 2023-07-04T18:15:37Z

Great!

Will come back to reviewing this soon. Need to merge some refactoring PRs first and sync things with llama.cpp.
Also need to see if the proposed change for mul mat broadcast in ggerganov/llama.cpp#1602 (comment) is the correct way to broadcast and apply the changes if it is

monatis · 2023-07-06T11:38:57Z

fixed the conflicts after syncing with llama.cpp yesterday.

okpatil4u · 2023-07-11T09:09:34Z

@monatis do you have any benchmarks on performance with and without batch inference by any chance ?

ggerganov · 2023-07-11T18:56:59Z

@monatis

I'll probably take a look tomorrow and merge if everything is good

monatis · 2023-07-12T00:29:27Z

@okpatil4u I updated the benchmark utility to make use of the proposed batch inference in monatis/clip.cpp#30 and got a ~1.4x speedup in per-image inference time compared to the main branch (190 ms down to 136 ms / image over 10k images). And I can see numerous other places we can further improve it soon. This is exactly for it --we will have a playground where we can test and benchmark ideas for batch inference and hopefully improve the performance further.

@ggerganov

I'll probably take a look tomorrow and merge if everything is good

Sounds great. We have conflicts again, I'll fix them tomorrow.

okpatil4u · 2023-07-12T07:05:07Z

@monatis thanks. This is promising. Was the batch size chosen automatically or does user have to define it ?

monatis · 2023-07-12T12:09:08Z

@okpatil4u currently the signature of the batch encoding function is: bool clip_image_batch_encode(const clip_ctx *ctx, int n_threads, const std::vector<clip_image_f32> &imgs, float *vec), and the batch size is essentially imgs.size(), i.e., it processes all the images in imgs as a single batch. We can also improve it further by implementing the function so it can process N mini batches of size B from a N * B-sized std::vector of images to reuse the graph, but I'll leave the experiment for it to a future PR.

monatis · 2023-07-12T12:44:43Z

@ggerganov conflicts fixed and tests passed. Working usage is in the benchmark binary:
https://github.com/monatis/clip.cpp/blob/8c09113e154b8f3a589a47d5780a19e4546c227a/tests/benchmark.cpp#L122-L126

…ganov#325)

ggerganov and others added 4 commits June 26, 2023 14:31

ggml : make src0 broadcast-able into src1 for ggml_mul_mat

9c65226

WIP: batch inference support for conv_2d

d2b23a4

Merge remote-tracking branch 'upstream/fix-mul-mat' into no-batch-dim…

c16c9e5

…-assert

Impl batch inference support cov_2d

f967ae8

Correct offset in Conv2D

6bb424b

Merge upstream and fix conflicts

703c2a6

monatis marked this pull request as ready for review July 3, 2023 21:25

ggerganov self-requested a review July 4, 2023 18:10

ggerganov self-assigned this Jul 4, 2023

Merge upstream and fix conflicts

bc721e7

ggerganov mentioned this pull request Jul 11, 2023

Conv 2D CPU Implementation #352

Merged

Fix conflicts

ad18208

ggerganov approved these changes Jul 12, 2023

View reviewed changes

ggerganov merged commit c2a5a35 into ggerganov:master Jul 12, 2023

This was referenced Jul 12, 2023

ggml : make src0 broadcast-able into src1 for ggml_mul_mat() #224

Closed

ggml : make src0 broadcast-able into src1 for ggml_mul_mat() #289

Closed

CCLDArjun pushed a commit to CCLDArjun/ggml that referenced this pull request Dec 18, 2023

Fixed tokenizer.model not found error when model dir is symlink (gger…

6b6d5b5

…ganov#325)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Batch inference #325

Batch inference #325

monatis commented Jun 30, 2023

monatis commented Jul 2, 2023

ggerganov commented Jul 2, 2023

monatis commented Jul 2, 2023

monatis commented Jul 3, 2023

monatis commented Jul 3, 2023

ggerganov commented Jul 4, 2023

monatis commented Jul 6, 2023

okpatil4u commented Jul 11, 2023

ggerganov commented Jul 11, 2023

monatis commented Jul 12, 2023

okpatil4u commented Jul 12, 2023

monatis commented Jul 12, 2023

monatis commented Jul 12, 2023 •

edited

Loading

Batch inference #325

Batch inference #325

Conversation

monatis commented Jun 30, 2023

monatis commented Jul 2, 2023

ggerganov commented Jul 2, 2023

monatis commented Jul 2, 2023

monatis commented Jul 3, 2023

monatis commented Jul 3, 2023

ggerganov commented Jul 4, 2023

monatis commented Jul 6, 2023

okpatil4u commented Jul 11, 2023

ggerganov commented Jul 11, 2023

monatis commented Jul 12, 2023

okpatil4u commented Jul 12, 2023

monatis commented Jul 12, 2023

monatis commented Jul 12, 2023 • edited Loading

monatis commented Jul 12, 2023 •

edited

Loading