Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

kvdb+channeldb: speed up graph cache #6111

Merged
merged 8 commits into from
Jan 21, 2022

Conversation

joostjager
Copy link
Contributor

@joostjager joostjager commented Dec 23, 2021

This PR adds ForAll to kvdb to allow for efficient range queries. The graph cache loader is rewritten to take advantage of this new method, improving loading time on my machine by approx 150x.

Fixes #6041
Fixes #6107

@joostjager
Copy link
Contributor Author

@guggero @bhandras looking for concept ack


func (b *readWriteBucket) Prefetch(paths ...[]string) {}

func (b *readWriteBucket) ForEachFast(cb func(k, v []byte) error) error {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How about naming this ForAll? Since it'd be a valid question to ask why ForEach is not as fast as ForEachFast.

Copy link
Contributor Author

@joostjager joostjager Dec 23, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hm naming is difficult. ForEach and ForAll also sounds like the same thing to me. The functional difference is that the faster one doesn't allow further queries in the callback. Maybe that can somehow be expressed in the name.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Aha, I missed the not allowing more queries in the callback part. What I've expected instead is that it'll simply batch fetch more items at once under the hood.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also weighing in on the naming discussion :) How about ForEachUnordered (the main promise of the cursor based ForEach is the sort by key and seek functionality, right?) or ForEachNative (implying that the underlying implementation might decide on the best approach for executing it, could also mention no specific ordering is guaranteed).

Copy link
Contributor Author

@joostjager joostjager Dec 23, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is what I wanted to do originally with the cursor. In that case, no new method is needed either. But I remember there was a problem with that, only what was it?

I don't think it was writing to the bucket inside the loop, because that is already not supported with bbolt - or is it? Don't we have these iterate and delete loops? We can also only do the optimization for read-only transactions.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Or alternatively we can use start (index). That would translate to OFFSET .. LIMIT .. in sql.

Copy link
Contributor Author

@joostjager joostjager Dec 23, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it is still the case with GetAllPaginated that the callback isn't allowed to do other queries, so maybe this is just scope creep beyond a single ForAll?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Or maybe your idea is to not have a callback with GetAllPaginated and return []KV ?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Interesting. So this would also allow nested keys to be retrieved in a single query. And besides limit you'd also want to specify some kind of fromKey right?

Yes.

Or alternatively we can use start (index). That would translate to OFFSET .. LIMIT .. in sql.

I think fromKey is more "portable".

I think it is still the case with GetAllPaginated that the callback isn't allowed to do other queries, so maybe this is just scope creep beyond a single ForAll?

Maybe we should go even further and add it to the kvdb.DB interface as an extension. That'd indicate that it's a pure optimization thing and is not needed to be compatible with the rest of the kvdb interfaces.

Or maybe your idea is to not have a callback with GetAllPaginated and return []KV ?

Yeah, sgtm!

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I get the idea, but after sleeping on it, I think we should do the minimum that is necessary at this point. Just fix the immediate problem with postgres startup loading.

If another critical optimization comes up, we can look at what is needed for that. I don't expect that updating the ForAllNative that we use now to something slightly different in the future will be a problem.

@bhandras
Copy link
Collaborator

Concept ACK.

Copy link
Collaborator

@guggero guggero left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Concept ACK from my side as well.

I'm asking myself why the ForEach would be implemented by using cursors in the first place? I get that this is coming from porting over the bbolt bucket logic. But thinking about it now, is there ever a reason to use cursors instead of basically doing what ForEachFast does now (but maybe with an ORDER BY clause)?


func (b *readWriteBucket) Prefetch(paths ...[]string) {}

func (b *readWriteBucket) ForEachFast(cb func(k, v []byte) error) error {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also weighing in on the naming discussion :) How about ForEachUnordered (the main promise of the cursor based ForEach is the sort by key and seek functionality, right?) or ForEachNative (implying that the underlying implementation might decide on the best approach for executing it, could also mention no specific ordering is guaranteed).

@joostjager
Copy link
Contributor Author

I'm asking myself why the ForEach would be implemented by using cursors in the first place? I get that this is coming from porting over the bbolt bucket logic. But thinking about it now, is there ever a reason to use cursors instead of basically doing what ForEachFast does now (but maybe with an ORDER BY clause)?

The reason to use a cursor (or really a single query per row) is that it wouldn't be possible otherwise to query within the callback. While the SELECT is running, nothing else can be done in that same tx.

@joostjager joostjager force-pushed the cache-loading branch 3 times, most recently from 0d43346 to 6270ba1 Compare December 24, 2021 12:23
@joostjager
Copy link
Contributor Author

Cleaned up PR, added test coverage for graph cache population.

@joostjager joostjager force-pushed the cache-loading branch 3 times, most recently from 4bd2ad3 to f3802e2 Compare December 26, 2021 14:41
@joostjager
Copy link
Contributor Author

I discovered that DescribeGraph suffers from the same slowness as the cache loading. Restructured the code to use the efficient ForEachChannel also for DescribeGraph. I briefly worried about memory usage on low end devices, but then realized that DescribeGraph is already loading the full graph in memory for grpc serialization anyway. Perhaps a streaming version would be more appropriate if memory usage is really a concern.

@lightninglabs-deploy
Copy link

@bhandras: review reminder
@guggero: review reminder

@Roasbeef Roasbeef added this to the v0.14.2 milestone Jan 3, 2022
@Roasbeef Roasbeef added database Related to the database/storage of LND graph labels Jan 3, 2022
Copy link
Collaborator

@guggero guggero left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Did a first level pass. Very nice performance improvement!
Will run some performance and other manual tests during the next pass.

Also, can be rebased now that #6116 is merged to fix the postgres itests.

channeldb/graph.go Outdated Show resolved Hide resolved
channeldb/graph.go Show resolved Hide resolved
channeldb/graph.go Outdated Show resolved Hide resolved
channeldb/graph.go Outdated Show resolved Hide resolved
channeldb/graph_test.go Outdated Show resolved Hide resolved
docs/release-notes/release-notes-0.14.2.md Outdated Show resolved Hide resolved
channeldb/graph.go Outdated Show resolved Hide resolved
Allows for pure deserialization without depending on a database connection.
@guggero
Copy link
Collaborator

guggero commented Jan 12, 2022

So previously the problem was the number of allocations and garbage collection of those, and not so much the total memory requirement?

Yes, the problem for low-ish memory environments was that there was a huge initial spike in memory but then most of it could be garbage collected. So a lot of throw-away instances/allocations.

@guggero
Copy link
Collaborator

guggero commented Jan 13, 2022

I ran some tests (imported a mainnet channel graph into a regtest node with the help of #6149).

Current master (ed511bb)

Database open time: 2m 8s

Memory profile:

Showing nodes accounting for 45.52MB, 85.78% of 53.07MB total
Showing top 10 nodes out of 124
      flat  flat%   sum%        cum   cum%
   10.50MB 19.79% 19.79%    10.50MB 19.79%  github.com/lightningnetwork/lnd/channeldb.NewCachedPolicy
      10MB 18.85% 38.63%    26.02MB 49.04%  github.com/lightningnetwork/lnd/channeldb.(*GraphCache).AddChannel
    7.01MB 13.20% 51.84%     7.01MB 13.20%  runtime.allocm
    5.52MB 10.41% 62.24%     5.52MB 10.41%  github.com/lightningnetwork/lnd/channeldb.(*GraphCache).updateOrAddEdge
    4.39MB  8.26% 70.51%     4.39MB  8.26%  github.com/lightningnetwork/lnd/channeldb.NewGraphCache
    2.66MB  5.01% 75.52%     2.66MB  5.01%  google.golang.org/grpc/internal/transport.newBufWriter
    2.28MB  4.30% 79.82%     2.28MB  4.30%  github.com/lightningnetwork/lnd/channeldb.newRejectCache
    1.11MB  2.09% 81.91%     1.11MB  2.09%  github.com/btcsuite/btcd/btcec.loadS256BytePoints
    1.03MB  1.94% 83.85%     1.03MB  1.94%  bufio.NewReaderSize
    1.03MB  1.93% 85.78%     1.03MB  1.93%  regexp/syntax.(*compiler).inst

This PR (0cca4549)

Database open time: 3.3s

Memory profile:

Showing nodes accounting for 78MB, 93.68% of 83.27MB total
Showing top 10 nodes out of 126
      flat  flat%   sum%        cum   cum%
   45.01MB 54.05% 54.05%    47.01MB 56.45%  github.com/lightningnetwork/lnd/channeldb.deserializeChanEdgePolicyRaw
   13.55MB 16.27% 70.32%    60.55MB 72.72%  github.com/lightningnetwork/lnd/channeldb.(*ChannelGraph).getChannelMap.func1
       5MB  6.01% 76.33%        5MB  6.01%  runtime.allocm
    4.39MB  5.27% 81.60%     4.39MB  5.27%  github.com/lightningnetwork/lnd/channeldb.NewGraphCache
    2.66MB  3.19% 84.79%     2.66MB  3.19%  google.golang.org/grpc/internal/transport.newBufWriter
    2.28MB  2.74% 87.53%     2.28MB  2.74%  github.com/lightningnetwork/lnd/channeldb.newRejectCache
    1.50MB  1.80% 89.34%     1.50MB  1.80%  runtime.malg
    1.50MB  1.80% 91.14%     1.50MB  1.80%  github.com/btcsuite/btcd/wire.ReadVarBytes
    1.11MB  1.33% 92.47%     1.11MB  1.33%  github.com/btcsuite/btcd/btcec.loadS256BytePoints
    1.01MB  1.21% 93.68%     1.01MB  1.21%  github.com/jackc/chunkreader/v2.(*ChunkReader).newBuf

So while this PR massively decreases the time to initialize the graph cache (and also fixes the DescribeGraph RPC being very slow), this comes at some allocation cost. I assume this is because we have yet another layer of cache (the channelMap) which can be seen in the profile. But that is memory that should be garbage collected eventually.

While I think we could still optimize the memory usage (by effectively making it possible to re-use the channelMap for the graph cache directly), I don't think it's worth doing so at this point. Mobile users for example won't be turning on the graph cache in the first place to save on memory, so they won't be running into this increased use anyway (unless DescribeGraph is used).

Copy link
Collaborator

@bhandras bhandras left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM 👍


err := kvdb.ForAll(edges, func(k, edgeBytes []byte) error {
// Skip embedded buckets.
if bytes.Equal(k, edgeIndexBucket) ||
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm thinking maybe a more future proof way to avoid iterating the embedded buckets is to check if the value is nil. Not perfect either, but AFAIK there's no nil valued key in this bucket other than sub buckets. The unknown policy is an empty slice ([]byte{}) so that would work too.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I made it explicit as a sanity check that there is nothing unexpected in this bucket. But you could also consider it less future proof to not skip over unexpected data.


// Skip ahead:
// - LastUpdate (8 bytes)
if _, err := r.Read(node.nodeScratch[:]); err != nil {
return err
if _, err := r.Read(nodeScratch[:]); err != nil {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I digged into this a bit and I think many of the allocations might be just due to us using the io.Reader interface here, since both Read and ReadFull do allocate. We may in the future want to switch to just passing around the byte slices to these deserialize functions and just copy out the important parts.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Seems related to: #4884

@joostjager
Copy link
Contributor Author

joostjager commented Jan 13, 2022

@guggero at what point exactly did you insert the memory profile write action? Haven't used the mem profiler much and trying to repro.

[update]
I can repro now, but only if I don't garbage collect before writing the heap profile. Did you call the GC? Otherwise I think that it is a bit more random what you're getting, depending on when the gc last ran.

@guggero
Copy link
Collaborator

guggero commented Jan 13, 2022

I extracted the memory profile through HTTP (using --profile=1111) so I didn't explicitly call GC.

@joostjager
Copy link
Contributor Author

So you timed it somewhere during the cache population phase? As mentioned, I know embarrassing little about mem profiling, but will what you did get you the info you're looking for?

@guggero
Copy link
Collaborator

guggero commented Jan 13, 2022

I waited until the log said the cache was filled. Then I ran curl http://localhost:1111/debug/pprof/heap > heap-6111-done.prof followed by go tool pprof heap-6111-done.prof where I typed top to get that output I posted.
I'm also not extremely proficient with profiling, maybe I'm doing it incorrectly (or at the wrong time)...

@Roasbeef
Copy link
Member

Does this PR also need #6136? Or will that one be what ultimately gets merged?

@Roasbeef
Copy link
Member

I waited until the log said the cache was filled. Then I ran curl http://localhost:1111/debug/pprof/heap > heap-6111-done.prof followed by go tool pprof heap-6111-done.prof where I typed top to get that output I posted.

Alternatively you could do something like this @joostjager:

Assuming --profile=5000 is set, this'll open up the nicer web based UI that also can give you break downs like the above, but with svg's and flame graphs, etc.

Timing is somewhat tricky as Oliver mentions since it's based on that logging point. Alternatively you could rig lnd to shut down as soon as the graph is populated, and set the cpuprofile (I think this'll also write the memory as well?) and use oli's command above.

defer cancel()

for rows.Next() {
var key, value []byte
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Declare at the top of this loop, with the assumption that the caller won't use the byte slices outside of the scope of the closure? A similar assumption exists w.r.t the way bbolt works.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For ForEach, the bbolt docs say:

Please note that keys and values in ForEach() are only valid while the transaction is open. If you need to use a key or value outside of the transaction, you must use copy() to copy it to another byte slice.

So the scope for this is wider than just the callback?

channeldb/graph.go Outdated Show resolved Hide resolved
channeldb/graph.go Show resolved Hide resolved

// First, load all edges in memory indexed by node and channel
// id.
channelMap, err := c.getChannelMap(edges)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This impacts all other callers of this method (which ideally should just be hitting the main graph cache instead?) to optimize for only the start up case.

Copy link
Contributor Author

@joostjager joostjager Jan 14, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There are no other callers besides graph cache population and DescribeGraph which needs this optimization as well.

It can't hit the main graph cache, because that only contains a subset of the graph data needed for pathfinding.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

and DescribeGraph which needs this optimization as well.
FWIW we now have in-memory caching of the proto serialization now here.


// Skip ahead:
// - LastUpdate (8 bytes)
if _, err := r.Read(node.nodeScratch[:]); err != nil {
return err
if _, err := r.Read(nodeScratch[:]); err != nil {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Seems related to: #4884

channeldb/graph.go Show resolved Hide resolved
Copy link
Member

@Roasbeef Roasbeef left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

hit wrong button...

@Roasbeef
Copy link
Member

Another heavy call that shows up in profiles: func (c *ChannelGraph) FilterKnownChanIDs. It is called with 85k channels during initial sync.

Yeah this can just hit the channel graph cache instead. Prob worthy of spinning out into another issue. Alternatively it can used a bloom filter, not that big of a deal if we fetch some stuff we don't actually need.

},
)
if err != nil {
return nil, err
}

err = g.ForEachChannel(func(info *ChannelEdgeInfo,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Related to my comment elsewhere: perhaps we just want to have a new private forEachChannelX method here that skips the intermediate map and inserts directly into the cache?

@joostjager
Copy link
Contributor Author

Timing is somewhat tricky as Oliver mentions since it's based on that logging point. Alternatively you could rig lnd to shut down as soon as the graph is populated, and set the cpuprofile (I think this'll also write the memory as well?) and use oli's command above.

Yes, this is what I did. I saved a mem profile right after graph population. But interestingly some allocations were already gone. Perhaps because the go compiler already knew that I wasn't going to use that data anymore even though the function hadn't returned yet.

I guess what we really want to know here is peak allocations rather than a snap shot.

In this commit, we modify the implementation of ForEachChannel to
utilize the new kvdb method ForAll. This greatly reduces the number of
round-trips to the database needed to iterate over all channels
in the graph.
Allows cacheableNode to be used outside of the callback. This is a
preparation for optimization of the graph cache population.
Use the optimized ForEachChannel method to reduce the graph cache
loading time.
@joostjager
Copy link
Contributor Author

Does this PR also need #6136? Or will that one be what ultimately gets merged?

This PR doesn't need #6136

@bhandras
Copy link
Collaborator

Does this PR also need #6136? Or will that one be what ultimately gets merged?

We can cherry-pick it into this PR, however it's OK to separate them too. See #6111 (comment)

Copy link
Member

@Roasbeef Roasbeef left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM ☄️


// First, load all edges in memory indexed by node and channel
// id.
channelMap, err := c.getChannelMap(edges)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

and DescribeGraph which needs this optimization as well.
FWIW we now have in-memory caching of the proto serialization now here.

@Roasbeef Roasbeef merged commit d67e6d5 into lightningnetwork:master Jan 21, 2022
@joostjager
Copy link
Contributor Author

@guggero @bhandras the repo needs a kvdb v1.3.0 tag

@guggero
Copy link
Collaborator

guggero commented Jan 25, 2022

Ah right, thanks for the reminder! Pushed the tag now.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
database Related to the database/storage of LND graph optimization postgres
Projects
None yet
5 participants