Skip to content

[Argus] Fix distributor node syncing query#4921

Merged
mnaamani merged 1 commit intoJoystream:masterfrom
zeeshanakram3:argus_fix-distributor-node-syncing-query
Oct 10, 2023
Merged

[Argus] Fix distributor node syncing query#4921
mnaamani merged 1 commit intoJoystream:masterfrom
zeeshanakram3:argus_fix-distributor-node-syncing-query

Conversation

@zeeshanakram3
Copy link
Contributor

Problem

The distributor node executes getDistributionBucketsWithObjectsByWorkerId query to get all the data objects that given node is supposed to distribute. The problem is that with the consistent growth of the storage directory, the response size of this query is becoming larger and larger. For some time this query was experiencing Timeout, upon investigating it turned out that while executing this query the graphql-server was consistently crashing with FATAL ERROR: Reached heap limit Allocation failed - JavaScript heap out of memory

image

Fix

This PR divides the given query into multiple smaller queries so that the graphql-server is successfully able to process it, until we find the reason for the memory leak happening in the graphql-server and create a proper fix.

Copy link
Collaborator

@kdembler kdembler left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Tested this and works properly! I've checked that the number of objects is same as the previous version.
You can also see less pressure on resources:
CleanShot 2023-10-09 at 21 45 08@2x

Thanks for your work @zeeshanakram3!

GetDataObjectsWithBagsByIdsQuery,
GetDataObjectsWithBagsByIdsQueryVariables
>(GetDataObjectsWithBagsByIds, { bagIds: bagIdsBatch, limit: bagIdsBatch.length }, 'storageBags'))
)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What do you think of adding a small delay between batches? Like 0.5s? I'd rather have this operation take slightly longer but not put such a heavy load on server

Copy link
Member

@mnaamani mnaamani left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great work.
I wouldn't be surprised if the storage-node needs a similar fix.
@kdembler suggested adding a small delay between fetching batches. That is probably a good idea. What might work better is to use generators, yielding one batch at a time, allow consumer/caller to process the batch of objects before coming back for more.

I will merge then bump the version of argus and prepare a docker release (using the #4886 branch)

@mnaamani mnaamani merged commit 462d157 into Joystream:master Oct 10, 2023
@bwhm
Copy link
Collaborator

bwhm commented Oct 10, 2023

Great work. I wouldn't be surprised if the storage-node needs a similar fix. @kdembler suggested adding a small delay between fetching batches. That is probably a good idea. What might work better is to use generators, yielding one batch at a time, allow consumer/caller to process the batch of objects before coming back for more.

I will merge then bump the version of argus and prepare a docker release (using the #4886 branch)

The storage node already uses pagination at least, and from my testing, it wasn't as bad. It would help the overall system load to bump the syncInterval from the current 1 minute...

@mnaamani
Copy link
Member

The storage node already uses pagination at least, and from my testing, it wasn't as bad. It would help the overall system load to bump the syncInterval from the current 1 minute...

#4924

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants