Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Groth16 Circom is slower than Rapidsnark when circuit constraint is large #460

Closed
doutv opened this issue Jul 4, 2024 · 11 comments
Closed
Assignees

Comments

@doutv
Copy link

doutv commented Jul 4, 2024

Issue type

Performance

OS platform and distribution

Ubuntu 22

Current behavior?

Machine: AMD Ryzen Threadripper PRO 5975WX 32-Cores
I benchmark tachyon circom vendor and rapidsnark, and compare their performance.
When circuit constraint < 400k, tachyon is faster and requires less memory
Especially when circuit size is small, tachyon is much better.

However, when circuit constraint > 400k, rapidsnark is faster and requires less memory

Circuit Device System Time Memory zkey size Constraints
RSA Server 32 Cores Tachyon 0.4 473 78 157746
RSA Server 32 Cores Rapidsnark 1.0 1300 78 157746
Dummy-1200k Server 32 Cores Rapidsnark 3.0 2420 600 1200000
Dummy-1200k Server 32 Cores Tachyon 3.3 2410 600 1200000

Expected Behavior?

I want to figure out the reason.

Standalone code or description to reproduce the issue

Repo: https://github.com/doutv/circom-benchmark.git

Additional context

No response

@chokobole
Copy link
Contributor

One big difference is that the Tachyon prover parses ZKey, whereas Rapidsnark avoids parsing ZKey and just takes pointers. When the ZKey gets larger, the Tachyon prover takes more overhead because of this.

@chokobole
Copy link
Contributor

We also tried to avoid parsing ZKey as Rapidsnark does in this branch, but encountered some issues and got stuck. Basically what we should do is that first, the Groth16 proving key or verifying key should be modified to allow members of pointers instead of std::vector. Then, it should take the pointer after reading data from the ZKey file.

@chokobole
Copy link
Contributor

To be fair, the code to benchmark using num_runs should include only this portion.

@doutv
Copy link
Author

doutv commented Jul 4, 2024

Thanks!

Rapidsnark has a server mode. It will read the ZKey and load it to memory.

To avoid parsing ZKey, you may consider adding a server mode. I think it is not hard to add a server mode.

@chokobole
Copy link
Contributor

Yes, it's possible, but it isn't prioritized at this moment. By the way, could you tell me what purpose you use Circom for?

chokobole added a commit that referenced this issue Jul 8, 2024
Updated the proving process to repeat only the prove part,
excluding the zkey parsing. This change is made because
the zkey parsing takes significant time, but typically
this is not the part users want to measure. Additionally,
as rapidsnark does not include zkey parsing,
this adjustment ensures fairness in benchmarking.

Related: #460
@doutv
Copy link
Author

doutv commented Jul 9, 2024

Server proving and mobile proving, since Tachyon is better than rapidsnark

Mopro plans to integrate Tachyon Circom for mobile proving: zkmopro/mopro#143

They benchmark and find Tachyon is the fastest: https://docs.google.com/spreadsheets/d/1irKg_TOP-yXms8igwCN_3OjVrtFe5gTHkuF0RbrVuho/edit?gid=289866675#gid=289866675

batzor pushed a commit that referenced this issue Jul 16, 2024
Updated the proving process to repeat only the prove part,
excluding the zkey parsing. This change is made because
the zkey parsing takes significant time, but typically
this is not the part users want to measure. Additionally,
as rapidsnark does not include zkey parsing,
this adjustment ensures fairness in benchmarking.

Related: #460
@batzor
Copy link
Contributor

batzor commented Jul 16, 2024

@doutv Can you try benchmarking again using this branch?
Now tachyon is around 40% faster in my local linux setup.

========== Rapidsnark CPU ==========
~/rapidsnark/build_prover/src/prover complex-circuit-1200k-1200k.zkey build/complex-circuit-1200k-1200k.wtns proof.json public.json

prove : 5.00701seconds
entire : 5.01327seconds
mem 1107036
time 5.91
cpu 2951%
========== Tachyon CPU ==========
Start parsing zkey
Time taken for parsing zkey: 0.224168 s
Start parsing witness
Time taken for parsing witness: 0.016893 s
Start proving
Time taken for proving #0: 3.39144 s
Time taken for proving #1: 2.90737 s
Time taken for proving #2: 2.95133 s
Time taken for proving #3: 3.0924 s

@doutv
Copy link
Author

doutv commented Jul 18, 2024

@batzor Nice job!
Based on https://github.com/kroma-network/tachyon/releases/tag/v0.3.0
my benchmark result shows that Tachyon is about 10% faster than Rapidsnark.
In terms of memory usage, Tachyon only use half of the memory of Rapidsnark.

For a fair comparison, I disable num_runs. Since if num_runs enabled, Tachyon would only loads zkey once. While Rapidsnark standalone mode would loads zkey every time.

---------- complex-circuit-1200k-1200k ----------
Sample Size:  10
========== Rapidsnark CPU ==========
prover complex-circuit-1200k-1200k.zkey build/complex-circuit-1200k-1200k.wtns proof.json public.json
mem 2419 MB
time 2.868000 s
cpu 4450 
========== Tachyon CPU ==========
prover_main complex-circuit-1200k-1200k.zkey build/complex-circuit-1200k-1200k.wtns proof.json public.json
mem 1321 MB
time 2.547000 s
cpu 4594

@doutv
Copy link
Author

doutv commented Jul 18, 2024

Btw, in larger circuit 3200k constraints, the time is very close

./2-benchmark.sh complex-circuit complex-circuit-3200k-3200k
---------- complex-circuit-3200k-3200k ----------
Sample Size:  10
========== Rapidsnark CPU ==========
mem 4481 MB
time 5.823000 s
cpu 4428 
========== Tachyon CPU ==========
mem 2928 MB
time 5.814000 s
cpu 4560

proving #0 is slower, since it hasn't load zkey into memory?

prover_main complex-circuit-3200k-3200k.zkey build/complex-circuit-3200k-3200k.wtns proof.json public.json --num_runs 10
========== Tachyon CPU ==========
Start parsing zkey
Time taken for parsing zkey: 1.00079 s
Start parsing witness
Time taken for parsing witness: 0.067337 s
Start proving
Time taken for proving #0: 4.53864 s
Time taken for proving #1: 3.94007 s
Time taken for proving #2: 3.93159 s
Time taken for proving #3: 3.96555 s
Time taken for proving #4: 3.93451 s
Time taken for proving #5: 3.934 s
Time taken for proving #6: 3.92835 s
Time taken for proving #7: 3.93559 s
Time taken for proving #8: 3.94272 s
Time taken for proving #9: 3.93657 s
Avg time taken for proving: 3.99876 s
Max time taken for proving: 4.53864 s

@chokobole
Copy link
Contributor

chokobole commented Jul 18, 2024

Btw, in larger circuit 3200k constraints, the time is very close

./2-benchmark.sh complex-circuit complex-circuit-3200k-3200k
---------- complex-circuit-3200k-3200k ----------
Sample Size:  10
========== Rapidsnark CPU ==========
/home/okxdex/data/zkdex-pap/services/rapidsnark/build_prover/src/prover complex-circuit-3200k-3200k.zkey build/complex-circuit-3200k-3200k.wtns proof.json public.json
mem 4481 MB
time 5.823000 s
cpu 4428 
========== Tachyon CPU ==========
/home/okxdex/data/zkdex-pap/workspace/jason-huang/tachyon/vendors/circom/bazel-bin/prover_main complex-circuit-3200k-3200k.zkey build/complex-circuit-3200k-3200k.wtns proof.json public.json
mem 2928 MB
time 5.814000 s
cpu 4560

proving #0 is slower, since it hasn't load zkey into memory?

prover_main complex-circuit-3200k-3200k.zkey build/complex-circuit-3200k-3200k.wtns proof.json public.json --num_runs 10
========== Tachyon CPU ==========
Start parsing zkey
Time taken for parsing zkey: 1.00079 s
Start parsing witness
Time taken for parsing witness: 0.067337 s
Start proving
Time taken for proving #0: 4.53864 s
Time taken for proving #1: 3.94007 s
Time taken for proving #2: 3.93159 s
Time taken for proving #3: 3.96555 s
Time taken for proving #4: 3.93451 s
Time taken for proving #5: 3.934 s
Time taken for proving #6: 3.92835 s
Time taken for proving #7: 3.93559 s
Time taken for proving #8: 3.94272 s
Time taken for proving #9: 3.93657 s
Avg time taken for proving: 3.99876 s
Max time taken for proving: 4.53864 s

This should be optimized with faster vector initialization feature, since when Tachyon creates std::vector<T>, it allocates memory and intializes the values with zero in serial. But as you see in Radpdnsark, it only allocates and initialized in parallel, that's the one to make it different, this feature is in progress.

But i am not sure about why the proving #0 is only slow.

@doutv
Copy link
Author

doutv commented Jul 30, 2024

Benchmark result after #490

---------- complex-circuit-1200k-1200k ----------
Sample Size:  10
========== Rapidsnark CPU ==========
mem 2419 MB
time 2.909000 s
cpu 4357 
========== Tachyon CPU ==========
mem 1111 MB
time 2.169000 s
cpu 5442
---------- complex-circuit-3200k-3200k ----------
Sample Size:  10
========== Rapidsnark CPU ==========
mem 4481 MB
time 5.837000 s
cpu 4345 
========== Tachyon CPU ==========
mem 2749 MB
time 4.858000 s
cpu 5499

Wow! You guys are making rapid progress!

@doutv doutv closed this as completed Jul 30, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants