Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Batch inference #325

Merged
merged 8 commits into from
Jul 12, 2023
Merged

Batch inference #325

merged 8 commits into from
Jul 12, 2023

Conversation

monatis
Copy link
Contributor

@monatis monatis commented Jun 30, 2023

I am implementing batch inference support in monatis/clip.cpp#30. This is still WIP but but almost there. I removed ne[3] == 1 asserts and generalized Conv2D to N-batched src1. I also included #224 in this PR for batched matrix multiplication. I believe the only thing to make batchwise is normalization now --I hope to solve it very soon.

Hope it might be a playground to test batch inference.

@monatis
Copy link
Contributor Author

monatis commented Jul 2, 2023

I'm surprised by ggml_norm. It works in the feature dimension, e.g., ne00, as it should, but it gives a different result for the second one of two identical samples in a batch. For example, the sum of the normed tensor of a single sample is 0.000058, but the sum is 0.000089 when I add the same sample twice in a batch, while it should have been 0.000058 * 2 = 0.000116. Similarly the output vector for the batch with a single sample and the output vector 0 for the batch with two identical samples are exactly the same, but the output vector 1 is different even though the input sample is the same. And, the only place that they differ seems to be ggml_norm, but I couldn't see why.

@ggerganov
Copy link
Owner

For example, the sum of the normed tensor of a single sample is 0.000058, but the sum is 0.000089 when I add the same sample twice in a batch, while it should have been 0.000058 * 2 = 0.000116

Could be round-off errors. Are you accumulating the sum into a double?

@monatis
Copy link
Contributor Author

monatis commented Jul 2, 2023

Are you accumulating the sum into a double?

Yes it's just for debugging so I accumulate the sum to a double. And the actual issue is, output vectors for identical images batched together are different, the sum is just a proxy to understand where they start to diverge. The first place where the sum of the two-sample batch is not two times the sum of the single-sample batch is just after the first ggml_norm op. Before that, modified conv_2d_sk_p0 with batch support, concat op with ggml_acc and adding positional embeddings with ggml_add all produces the expected values. But I can see no reason why ggml_norm leads to this. I will revisit the other ops prior to ggml_norm with a fresh mind if we don't have any better clue.

@monatis
Copy link
Contributor Author

monatis commented Jul 3, 2023

Yes, it turned out to be that I calculated a wrong value for the offset to be added to the wdata pointer in conv_2d_sk_p0. So even if input 0 gives a correct output, input 1 gives a wrong one. And a note to future me, the sum of a tensor is not always reliable for debugging :)

Now I can confirm the batch inference works correctly. I'll solve conflicts and then mark this ready for review.

@monatis monatis marked this pull request as ready for review July 3, 2023 21:25
@monatis
Copy link
Contributor Author

monatis commented Jul 3, 2023

This is ready for review now. Working usage is in monatis/clip.cpp#30

As a future work, I'll implement batch inference for text encoding in clip.cpp, which will require some extra work for attention masking to handle inputs with varying lengths, and probably generalization of ggml_get_rows to get position embeddings, so I'll leave it to another PR.

CC @ggerganov

@ggerganov ggerganov self-requested a review July 4, 2023 18:10
@ggerganov ggerganov self-assigned this Jul 4, 2023
@ggerganov
Copy link
Owner

Great!

Will come back to reviewing this soon. Need to merge some refactoring PRs first and sync things with llama.cpp.
Also need to see if the proposed change for mul mat broadcast in ggerganov/llama.cpp#1602 (comment) is the correct way to broadcast and apply the changes if it is

@monatis
Copy link
Contributor Author

monatis commented Jul 6, 2023

fixed the conflicts after syncing with llama.cpp yesterday.

@okpatil4u
Copy link

@monatis do you have any benchmarks on performance with and without batch inference by any chance ?

@ggerganov
Copy link
Owner

@monatis

I'll probably take a look tomorrow and merge if everything is good

@monatis
Copy link
Contributor Author

monatis commented Jul 12, 2023

@okpatil4u I updated the benchmark utility to make use of the proposed batch inference in monatis/clip.cpp#30 and got a ~1.4x speedup in per-image inference time compared to the main branch (190 ms down to 136 ms / image over 10k images). And I can see numerous other places we can further improve it soon. This is exactly for it --we will have a playground where we can test and benchmark ideas for batch inference and hopefully improve the performance further.

@ggerganov

I'll probably take a look tomorrow and merge if everything is good

Sounds great. We have conflicts again, I'll fix them tomorrow.

@okpatil4u
Copy link

@monatis thanks. This is promising. Was the batch size chosen automatically or does user have to define it ?

@monatis
Copy link
Contributor Author

monatis commented Jul 12, 2023

@okpatil4u currently the signature of the batch encoding function is: bool clip_image_batch_encode(const clip_ctx *ctx, int n_threads, const std::vector<clip_image_f32> &imgs, float *vec), and the batch size is essentially imgs.size(), i.e., it processes all the images in imgs as a single batch. We can also improve it further by implementing the function so it can process N mini batches of size B from a N * B-sized std::vector of images to reuse the graph, but I'll leave the experiment for it to a future PR.

@monatis
Copy link
Contributor Author

monatis commented Jul 12, 2023

@ggerganov conflicts fixed and tests passed. Working usage is in the benchmark binary:
https://github.com/monatis/clip.cpp/blob/8c09113e154b8f3a589a47d5780a19e4546c227a/tests/benchmark.cpp#L122-L126

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants