Groth16 Circom is slower than Rapidsnark when circuit constraint is large #460

doutv · 2024-07-04T02:09:54Z

Issue type

Performance

OS platform and distribution

Ubuntu 22

Current behavior?

Machine: AMD Ryzen Threadripper PRO 5975WX 32-Cores
I benchmark tachyon circom vendor and rapidsnark, and compare their performance.
When circuit constraint < 400k, tachyon is faster and requires less memory
Especially when circuit size is small, tachyon is much better.

However, when circuit constraint > 400k, rapidsnark is faster and requires less memory

Circuit	Device	System	Time	Memory	zkey size	Constraints
RSA	Server 32 Cores	Tachyon	0.4	473	78	157746
RSA	Server 32 Cores	Rapidsnark	1.0	1300	78	157746
Dummy-1200k	Server 32 Cores	Rapidsnark	3.0	2420	600	1200000
Dummy-1200k	Server 32 Cores	Tachyon	3.3	2410	600	1200000

Expected Behavior?

I want to figure out the reason.

Standalone code or description to reproduce the issue

Repo: https://github.com/doutv/circom-benchmark.git

Additional context

No response

chokobole · 2024-07-04T04:16:52Z

One big difference is that the Tachyon prover parses ZKey, whereas Rapidsnark avoids parsing ZKey and just takes pointers. When the ZKey gets larger, the Tachyon prover takes more overhead because of this.

chokobole · 2024-07-04T04:19:30Z

We also tried to avoid parsing ZKey as Rapidsnark does in this branch, but encountered some issues and got stuck. Basically what we should do is that first, the Groth16 proving key or verifying key should be modified to allow members of pointers instead of std::vector. Then, it should take the pointer after reading data from the ZKey file.

chokobole · 2024-07-04T04:21:51Z

To be fair, the code to benchmark using num_runs should include only this portion.

doutv · 2024-07-04T10:30:40Z

Thanks!

Rapidsnark has a server mode. It will read the ZKey and load it to memory.

To avoid parsing ZKey, you may consider adding a server mode. I think it is not hard to add a server mode.

chokobole · 2024-07-05T00:33:26Z

Yes, it's possible, but it isn't prioritized at this moment. By the way, could you tell me what purpose you use Circom for?

Updated the proving process to repeat only the prove part, excluding the zkey parsing. This change is made because the zkey parsing takes significant time, but typically this is not the part users want to measure. Additionally, as rapidsnark does not include zkey parsing, this adjustment ensures fairness in benchmarking. Related: #460

doutv · 2024-07-09T03:08:40Z

Server proving and mobile proving, since Tachyon is better than rapidsnark

Mopro plans to integrate Tachyon Circom for mobile proving: zkmopro/mopro#143

They benchmark and find Tachyon is the fastest: https://docs.google.com/spreadsheets/d/1irKg_TOP-yXms8igwCN_3OjVrtFe5gTHkuF0RbrVuho/edit?gid=289866675#gid=289866675

Updated the proving process to repeat only the prove part, excluding the zkey parsing. This change is made because the zkey parsing takes significant time, but typically this is not the part users want to measure. Additionally, as rapidsnark does not include zkey parsing, this adjustment ensures fairness in benchmarking. Related: #460

batzor · 2024-07-16T06:44:44Z

@doutv Can you try benchmarking again using this branch?
Now tachyon is around 40% faster in my local linux setup.

========== Rapidsnark CPU ==========
~/rapidsnark/build_prover/src/prover complex-circuit-1200k-1200k.zkey build/complex-circuit-1200k-1200k.wtns proof.json public.json

prove : 5.00701seconds
entire : 5.01327seconds
mem 1107036
time 5.91
cpu 2951%
========== Tachyon CPU ==========
Start parsing zkey
Time taken for parsing zkey: 0.224168 s
Start parsing witness
Time taken for parsing witness: 0.016893 s
Start proving
Time taken for proving #0: 3.39144 s
Time taken for proving #1: 2.90737 s
Time taken for proving #2: 2.95133 s
Time taken for proving #3: 3.0924 s

doutv · 2024-07-18T03:43:53Z

@batzor Nice job!
Based on https://github.com/kroma-network/tachyon/releases/tag/v0.3.0
my benchmark result shows that Tachyon is about 10% faster than Rapidsnark.
In terms of memory usage, Tachyon only use half of the memory of Rapidsnark.

For a fair comparison, I disable num_runs. Since if num_runs enabled, Tachyon would only loads zkey once. While Rapidsnark standalone mode would loads zkey every time.

---------- complex-circuit-1200k-1200k ----------
Sample Size:  10
========== Rapidsnark CPU ==========
prover complex-circuit-1200k-1200k.zkey build/complex-circuit-1200k-1200k.wtns proof.json public.json
mem 2419 MB
time 2.868000 s
cpu 4450 
========== Tachyon CPU ==========
prover_main complex-circuit-1200k-1200k.zkey build/complex-circuit-1200k-1200k.wtns proof.json public.json
mem 1321 MB
time 2.547000 s
cpu 4594

doutv · 2024-07-18T03:53:41Z

Btw, in larger circuit 3200k constraints, the time is very close

./2-benchmark.sh complex-circuit complex-circuit-3200k-3200k
---------- complex-circuit-3200k-3200k ----------
Sample Size:  10
========== Rapidsnark CPU ==========
mem 4481 MB
time 5.823000 s
cpu 4428 
========== Tachyon CPU ==========
mem 2928 MB
time 5.814000 s
cpu 4560

proving #0 is slower, since it hasn't load zkey into memory?

prover_main complex-circuit-3200k-3200k.zkey build/complex-circuit-3200k-3200k.wtns proof.json public.json --num_runs 10
========== Tachyon CPU ==========
Start parsing zkey
Time taken for parsing zkey: 1.00079 s
Start parsing witness
Time taken for parsing witness: 0.067337 s
Start proving
Time taken for proving #0: 4.53864 s
Time taken for proving #1: 3.94007 s
Time taken for proving #2: 3.93159 s
Time taken for proving #3: 3.96555 s
Time taken for proving #4: 3.93451 s
Time taken for proving #5: 3.934 s
Time taken for proving #6: 3.92835 s
Time taken for proving #7: 3.93559 s
Time taken for proving #8: 3.94272 s
Time taken for proving #9: 3.93657 s
Avg time taken for proving: 3.99876 s
Max time taken for proving: 4.53864 s

chokobole · 2024-07-18T04:04:39Z

Btw, in larger circuit 3200k constraints, the time is very close

./2-benchmark.sh complex-circuit complex-circuit-3200k-3200k
---------- complex-circuit-3200k-3200k ----------
Sample Size:  10
========== Rapidsnark CPU ==========
/home/okxdex/data/zkdex-pap/services/rapidsnark/build_prover/src/prover complex-circuit-3200k-3200k.zkey build/complex-circuit-3200k-3200k.wtns proof.json public.json
mem 4481 MB
time 5.823000 s
cpu 4428 
========== Tachyon CPU ==========
/home/okxdex/data/zkdex-pap/workspace/jason-huang/tachyon/vendors/circom/bazel-bin/prover_main complex-circuit-3200k-3200k.zkey build/complex-circuit-3200k-3200k.wtns proof.json public.json
mem 2928 MB
time 5.814000 s
cpu 4560

proving #0 is slower, since it hasn't load zkey into memory?

prover_main complex-circuit-3200k-3200k.zkey build/complex-circuit-3200k-3200k.wtns proof.json public.json --num_runs 10
========== Tachyon CPU ==========
Start parsing zkey
Time taken for parsing zkey: 1.00079 s
Start parsing witness
Time taken for parsing witness: 0.067337 s
Start proving
Time taken for proving #0: 4.53864 s
Time taken for proving #1: 3.94007 s
Time taken for proving #2: 3.93159 s
Time taken for proving #3: 3.96555 s
Time taken for proving #4: 3.93451 s
Time taken for proving #5: 3.934 s
Time taken for proving #6: 3.92835 s
Time taken for proving #7: 3.93559 s
Time taken for proving #8: 3.94272 s
Time taken for proving #9: 3.93657 s
Avg time taken for proving: 3.99876 s
Max time taken for proving: 4.53864 s

This should be optimized with faster vector initialization feature, since when Tachyon creates std::vector<T>, it allocates memory and intializes the values with zero in serial. But as you see in Radpdnsark, it only allocates and initialized in parallel, that's the one to make it different, this feature is in progress.

But i am not sure about why the proving #0 is only slow.

doutv · 2024-07-30T08:32:22Z

Benchmark result after #490

---------- complex-circuit-1200k-1200k ----------
Sample Size:  10
========== Rapidsnark CPU ==========
mem 2419 MB
time 2.909000 s
cpu 4357 
========== Tachyon CPU ==========
mem 1111 MB
time 2.169000 s
cpu 5442

---------- complex-circuit-3200k-3200k ----------
Sample Size:  10
========== Rapidsnark CPU ==========
mem 4481 MB
time 5.837000 s
cpu 4345 
========== Tachyon CPU ==========
mem 2749 MB
time 4.858000 s
cpu 5499

Wow! You guys are making rapid progress!

chokobole assigned chokobole and batzor Jul 4, 2024

chokobole mentioned this issue Jul 14, 2024

perf: optimize circom proof generation #469

Merged

doutv closed this as completed Jul 30, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Groth16 Circom is slower than Rapidsnark when circuit constraint is large #460

Groth16 Circom is slower than Rapidsnark when circuit constraint is large #460

doutv commented Jul 4, 2024 •

edited

Loading

chokobole commented Jul 4, 2024

chokobole commented Jul 4, 2024

chokobole commented Jul 4, 2024

doutv commented Jul 4, 2024 •

edited

Loading

chokobole commented Jul 5, 2024

doutv commented Jul 9, 2024 •

edited

Loading

batzor commented Jul 16, 2024

doutv commented Jul 18, 2024

doutv commented Jul 18, 2024 •

edited

Loading

chokobole commented Jul 18, 2024 •

edited

Loading

doutv commented Jul 30, 2024 •

edited

Loading

Groth16 Circom is slower than Rapidsnark when circuit constraint is large #460

Groth16 Circom is slower than Rapidsnark when circuit constraint is large #460

Comments

doutv commented Jul 4, 2024 • edited Loading

Issue type

OS platform and distribution

Current behavior?

Expected Behavior?

Standalone code or description to reproduce the issue

Additional context

chokobole commented Jul 4, 2024

chokobole commented Jul 4, 2024

chokobole commented Jul 4, 2024

doutv commented Jul 4, 2024 • edited Loading

chokobole commented Jul 5, 2024

doutv commented Jul 9, 2024 • edited Loading

batzor commented Jul 16, 2024

doutv commented Jul 18, 2024

doutv commented Jul 18, 2024 • edited Loading

chokobole commented Jul 18, 2024 • edited Loading

doutv commented Jul 30, 2024 • edited Loading

doutv commented Jul 4, 2024 •

edited

Loading

doutv commented Jul 4, 2024 •

edited

Loading

doutv commented Jul 9, 2024 •

edited

Loading

doutv commented Jul 18, 2024 •

edited

Loading

chokobole commented Jul 18, 2024 •

edited

Loading

doutv commented Jul 30, 2024 •

edited

Loading