Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

atproto-hub ndb context failures in handle #1315

Closed
snarfed opened this issue Sep 10, 2024 · 8 comments
Closed

atproto-hub ndb context failures in handle #1315

snarfed opened this issue Sep 10, 2024 · 8 comments

Comments

@snarfed
Copy link
Owner

snarfed commented Sep 10, 2024

Such a weird one, I don't understand this at all yet.

When we run more than one handle thread in atproto-hub, we're fine for a while, but after 9-12h we eventually start seeing this:

google.cloud.ndb.exceptions.ContextError: No current context. NDB calls must be made in context established by google.cloud.ndb.Client.context.
  at .get_toplevel_context ( google/cloud/ndb/context.py:151 )
  at .rpc_call ( google/cloud/ndb/_datastore_api.py:89 )
  at ._advance_tasklet ( google/cloud/ndb/tasklets.py:323 )
  at .begin_transaction ( google/cloud/ndb/_datastore_api.py:1037 )
  at ._advance_tasklet ( google/cloud/ndb/tasklets.py:319 )
  at ._transaction_async ( google/cloud/ndb/_transaction.py:257 )
  at ._advance_tasklet ( google/cloud/ndb/tasklets.py:319 )
  at .retry_wrapper ( google/cloud/ndb/_retry.py:82 )
  at .retry_wrapper ( google/cloud/ndb/_retry.py:97 )
  at ._advance_tasklet ( google/cloud/ndb/tasklets.py:319 )
  at .check_success ( google/cloud/ndb/tasklets.py:157 )
  at .result ( google/cloud/ndb/tasklets.py:210 )
  at .transaction ( google/cloud/ndb/_transaction.py:186 )
  at .transactional_inner_wrapper ( google/cloud/ndb/_transaction.py:347 )
  at ._handle ( /workspace/atproto_firehose.py:340 )

This is on Object.get_or_create:

try:
obj = Object.get_or_create(id=obj_id, authed_as=op.repo, status='new',
users=[ATProto(id=op.repo).key],
source_protocol=ATProto.LABEL, **record_kwarg)
create_task(queue='receive', obj=obj.key.urlsafe(), authed_as=op.repo)
# when running locally, comment out above and uncomment this
# logger.info(f'enqueuing receive task for {at_uri}')
except BaseException:
report_exception()

The ndb context is long-lived, outside the handle loop:

with ndb_client.context(
cache_policy=cache_policy, global_cache=global_cache,
global_cache_policy=global_cache_policy,
global_cache_timeout_policy=global_cache_timeout_policy):
while True:
try:
handle()
# if we return cleanly, that means we hit the limit
break
except BaseException:
report_exception()

So, sure, I buy that ndb contexts may not thread safe enough, even if we're making a different one in each thread and we should be fine. From https://googleapis.dev/python/python-ndb/latest/client.html#google.cloud.ndb.client.Client.context :

Code within an asynchronous context should be single threaded. Internally, a threading.local instance is used to track the current event loop.

(We're not async.)

It's weird that this doesn't happen until 9-12h after the server starts though, right?!

image
@snarfed
Copy link
Owner Author

snarfed commented Sep 11, 2024

This time it's lasted 20h and counting since the last restart. 🤷
image

@snarfed
Copy link
Owner Author

snarfed commented Sep 12, 2024

Still hasn't started back up again yet. Surprising!

@snarfed
Copy link
Owner Author

snarfed commented Sep 13, 2024

Hasn't recurred in >3d. Not sure why, but I won't argue.

@snarfed snarfed closed this as completed Sep 13, 2024
@snarfed
Copy link
Owner Author

snarfed commented Sep 15, 2024

Dammit, it started up again. Wish I knew why this was so intermittent! For now I'll restart atproto-hub and see what happens.

@snarfed snarfed reopened this Sep 15, 2024
@snarfed
Copy link
Owner Author

snarfed commented Sep 29, 2024

Hasn't recurred since we cut a lot of load with #1329 (comment) . I don't think it's actually fixed, but triggering it may be load related, we may not see it again until we get back to that level of load organically.

@snarfed snarfed closed this as not planned Won't fix, can't repro, duplicate, stale Sep 29, 2024
@snarfed
Copy link
Owner Author

snarfed commented Oct 1, 2024

Goddammit, this started up again last night.

@snarfed
Copy link
Owner Author

snarfed commented Oct 1, 2024

trying this tweak ^, will see.

snarfed added a commit that referenced this issue Oct 2, 2024
@snarfed
Copy link
Owner Author

snarfed commented Oct 2, 2024

Also dropped handle threads in atproto-hub from 100 down to 10. 🤞🤞🤞

@snarfed snarfed closed this as completed Oct 2, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

1 participant