Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Stress test adding data to the tree #16

Closed
arlolra opened this issue Jul 13, 2016 · 23 comments · Fixed by #43
Closed

Stress test adding data to the tree #16

arlolra opened this issue Jul 13, 2016 · 23 comments · Fixed by #43
Assignees
Labels

Comments

@arlolra
Copy link
Member

arlolra commented Jul 13, 2016

5ff85ad is an indication that we should have a test that generates a large number of key/val pairs to set and retrieve from the tree.

@vqhuy vqhuy self-assigned this Jul 13, 2016
@vqhuy
Copy link
Member

vqhuy commented Jul 13, 2016

That's right. I also had this idea last night. We should also add tests which involve all features of the library as well.

@liamsi
Copy link
Member

liamsi commented Jul 29, 2016

As @arlolra pointed out by mail this is also about profiling:

particularly automatically generated a lot of data to
stick in the tree and then profiling to see where the
bottlenecks may be / memory consumption.

@liamsi
Copy link
Member

liamsi commented Jul 29, 2016

Here is a list of missing tests (per package):

merkletree

  • pad.TB

crypto:

  • Digest
  • MakeRand

utils:

  • ToBits
  • LongToBytes
  • ~~ULongToBytes~~~
  • IntToBytes

In general only the success cases are well tested. Branches where we panic/return an error or return a negative result on a check like crypto.Verify aren't covered by tests most of the time.

Just wanted to leave this here for discussion.

@masomel
Copy link
Member

masomel commented Jul 29, 2016

In general only the success cases are well tested. Branches where we panic/return an error or return a negative result on a check like crypto.Verify aren't covered by tests most of the time.

This seems like an issue to me. I think we should definitely be covering failure cases as well. This will be especially important if/when we create a common test suite for coniks-go and coniks-java (I think we discussed this in #25 (comment)).

Another thing I wanted to open up for discussion was whether we should be using a code coverage measurement tool like Coveralls or Codecov.io? These are the main ones I've seen, I'm sure there are more out there.

@arlolra
Copy link
Member Author

arlolra commented Jul 29, 2016

Another thing I wanted to open up for discussion was whether we should be using a code coverage measurement tool like Coveralls or Codecov.io?

Couldn't hurt, and goveralls seems pretty straightforward.

@masomel
Copy link
Member

masomel commented Jul 29, 2016

Couldn't hurt, and goveralls seems pretty straightforward.

Cool! I've authorized Coveralls to access the coniks-sys repos. I'll go set up goveralls.

@liamsi
Copy link
Member

liamsi commented Jul 29, 2016

Good idea. Also thought about that. After the changes the code coverage is already quite high (90 -100%). I'm not necessarily a friend of aiming for 100% code coverage, though. But goveralls seems like a good idea anyways!

@masomel
Copy link
Member

masomel commented Jul 29, 2016

So I've set up goveralls. I wasn't able to get it to work without testing each package and subpackage individually (because goveralls doesn't support multiple go packages), so now we just have to remember to add each new package to the YAML file.

@liamsi
Copy link
Member

liamsi commented Jul 29, 2016

Besides some tests, I've added Benchmarks in #43 for a first indicator if sth. is slow (and needs to be profiled).

It's still pretty much WIP but feel free to have an early look into it and/or test it on your machine(s). Maybe the first benchmark (BenchmarkCreateLargePAD) can be removed as recreating the tree happens rarely (?). The 2nd benchmark (BenchmarkLookUpFromLargeDirectory) is measuring (single) lookups in a large tree.
I think it looks OK: For a tree with (only) 1000 entries the lookup takes 566781 ns (~0.00056 s) 10000 entries a lookup takes around 1098726 ns (around 0.0010 s); for a tree with 100000 entries it takes also around 1166422 ns (0,001167 s); for 250000 entries it takes around 1207726 ns (0,0012077 s). For even larger trees the creation of the tree consumes so much time that it doesn't make sense to Benchmark the lookups anymore (all on recent mac book pro). Also, wouldn't in practice the key-value pairs stored in a db (independent from the tree structure), such that lookups won't necessarily happen from memory (like in the tests)?

@arlolra
Copy link
Member Author

arlolra commented Jul 30, 2016

So I've set up goveralls.

Nice work @masomel

@vqhuy
Copy link
Member

vqhuy commented Jul 30, 2016

Maybe the first benchmark (BenchmarkCreateLargePAD) can be removed as recreating the tree happens rarely (?).

I think we can replace this test with tree clone tests? Since this task must be performed in each epoch. I imagine the scenario would be creating an empty tree then inserting a lot of records (e.g., 1000 per epoch) then cloning the tree; repeating this task for a large number of time.

Also, wouldn't in practice the key-value pairs stored in a db (independent from the tree structure), such that lookups won't necessarily happen from memory (like in the tests)?

Maybe I didn't get your idea completely. But I think we do need to perform the lookup from memory, not only to retrieve the user's data blob but also to check if the given user is on the tree (i.e., registered with the keyserver).

For a tree with (only) 1000 entries the lookup takes 566781 ns (~0.00056 s) 10000 entries a lookup takes around 1098726 ns (around 0.0010 s); for a tree with 100000 entries it takes also around 1166422 ns (0,001167 s); for 250000 entries it takes around 1207726 ns (0,0012077 s).

Great work!

For even larger trees the creation of the tree consumes so much time that it doesn't make sense to Benchmark the lookups anymore (all on recent mac book pro).

I will give it a try on my machine.

@vqhuy
Copy link
Member

vqhuy commented Jul 30, 2016

Should we perform the evaluation as described in the paper section 5?

@liamsi
Copy link
Member

liamsi commented Jul 31, 2016

I think we can replace this test with tree clone tests? Since this task must be performed in each epoch. I imagine the scenario would be creating an empty tree then inserting a lot of records (e.g., 1000 per epoch) then cloning the tree; repeating this task for a large number of time.

Sounds good.

Should we perform the evaluation as described in the paper section 5?

Makes sense to me. Parts of the evaluation deal with network communication, though. In order for this to be realistic we could also consider other means (derterLab/similar test beds).

@liamsi
Copy link
Member

liamsi commented Jul 31, 2016

Maybe I didn't get your idea completely. But I think we do need to perform the lookup from memory, not only to retrieve the user's data blob but also to check if the given user is on the tree (i.e.,

I though there must be a way to do a single lookup without holding the whole tree in memory (but I didn't think it through). Currently it seems like creating/querying/updating a large tree takes very long and consumes a lot of memory (even with quite modest tree sizes with only 100000 entries; also I don't have 64 GB of RAM as per section 5 ;-)). I'll investigate this further in the following week. Maybe the Benchmarks I'm doing aren't the best/right approach, too.

@liamsi
Copy link
Member

liamsi commented Jul 31, 2016

I did some profiling (when benchmarking pad.Update in tree with 100,000 entries).
It seems interesting that most of the time is spent doing scalar multiplication edwards25519.FeMul and most memory is consumed in crypto.Digest.

Update: I have to add, that this makes sense as most of the time is spent creating the 100K directory (cpu/memory-wise) where the VRF-operations and Digest are called often.
Here is also the output of the Benchmark to understand how long an update operation on a tree with 100K elements needs on average:

 go test -bench=BenchmarkPADUpdate100K -benchmem -cpuprofile cpu.out -memprofile mem.out -memprofilerate 1 -v 
PASS
BenchmarkPADUpdate100K-4               1        6632919902 ns/op        181321712 B/op   1554422 allocs/op
ok      github.com/coniks-sys/coniks-go/merkletree      189.444s

# and without the -memprofilerate flag we are about 30x faster (!) ... :

go test -bench=BenchmarkPADUpdate100K -benchmem -cpuprofile cpu.out -memprofile mem.out 
PASS
BenchmarkPADUpdate100K-4               5         235303081 ns/op        71895660 B/op     876137 allocs/op
ok      github.com/coniks-sys/coniks-go/merkletree      452.044s

Memory usage (including tree creation): https://imagebin.ca/v/2pyfIzrKD7li
cpu usage (incl. tree creation): https://imagebin.ca/v/2pyg7GQ4sDCs

update: cpu usage (update op. only): https://imagebin.ca/v/2q5DJGSE651d
update: mem usage (PAD.Update only): https://imagebin.ca/v/2q778EJzdBxd (Digest should/could be optimized)

(all files can be downloaded as .svg files and viewed in a browser)

@masomel
Copy link
Member

masomel commented Jul 31, 2016

It's still pretty much WIP but feel free to have an early look into it and/or test it on your machine(s).

I'll run some benchmarks on my lab machine, hopefully early this week and can report the results back to you, too. Really nice work on the benchmarks so far, @liamsi! Would be cool to post some results in a wiki at some point.

Should we perform the evaluation as described in the paper section 5?

If you'd like, I could try to run some benchmarks on a (similar) setup to the one in the paper. We ran them on one of the servers in my group's cluster, and I could probably get that set up again.

@liamsi
Copy link
Member

liamsi commented Aug 2, 2016

I think (if there aren't further suggestions what should be benchmarked/profiled) this is ready to be reviewed in PR #43.

My conclusion(s) regarding the profiling: We should refactor the memory intensive hashing (currently done by crypto.Digest) by using hash.Write whenever it makes sense (and only sum at the very end). Currently, all bytes are collected and Digest is called at the end.
Other than that, LookUp seems sufficiently fast; and when creating the tree using Set the crypto (field operations) seem to use a big part of the cpu time (especially edwards25519.FeMul). I'm not sure if can/should do anything about these operations though. (See the updated links in #16 (comment))

If you'd like, I could try to run some benchmarks on a (similar) setup to the one in the paper. We ran them on one of the servers in my group's cluster, and I could probably get that set up again.

@masomel I think that makes a lot of sense to compare both implementations but this isn't very urgent I guess.

@vqhuy
Copy link
Member

vqhuy commented Aug 2, 2016

Currently, all bytes are collected and Digest is called at the end.

Can you make it clearer? I think there is not much memory allocation here, except when we call hash.Write. Maybe the memory overhead because of 1) how we pass the arguments and 2) we have a long list of arguments (when hashing the leaf nodes). What do you think?

P.S. This pull is a great/cool work! Thanks @liamsi

@liamsi
Copy link
Member

liamsi commented Aug 2, 2016

Currently, all bytes are collected and Digest is called at the end.

Can you make it clearer?

I thought all data is collected and than Digested at the end. Looking through the code, I see this is not the case. Sorry for the noise. Still, we should experiment if we can reduce the memory consumption here. At least the 2 points you mentioned are worth to experiment with.

@liamsi
Copy link
Member

liamsi commented Aug 2, 2016

Using h.Write on each chunk of data instead of crypto.Digest in emptyNode.Hash and userLeafNode.Hash and removing the additional []byte() conversion where obsolete reduces the total memory consumption spent while hashing (in the exactly the same example namely BenchmarkPADUpdate100K) from 253.6 MB to 175.7 MB (see here: https://imagebin.ca/v/2qAUxQ3b70Dq).
That is motivation enough for me to open a separate issue for this: #51.

We should also keep in mind that we want devs to be able to specify their own hashing scheme (see #50).

@liamsi
Copy link
Member

liamsi commented Aug 2, 2016

Pushed the changes from the above comment to branch hash_experiment in d128e43 (in case you want to reproduce the benchmark/comparison)

@masomel
Copy link
Member

masomel commented Aug 2, 2016

I think that makes a lot of sense to compare both implementations but this isn't very urgent I guess.

No, it's not particularly urgent, especially since the Java implementation we used for our benchmarks in the paper is now out of date.

@vqhuy vqhuy closed this as completed in #43 Aug 2, 2016
@arlolra arlolra reopened this Aug 2, 2016
@liamsi
Copy link
Member

liamsi commented Aug 4, 2016

closed by #54

@liamsi liamsi closed this as completed Aug 4, 2016
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants