atproto-hub ndb context failures in handle #1315

snarfed · 2024-09-10T18:04:51Z

Such a weird one, I don't understand this at all yet.

When we run more than one handle thread in atproto-hub, we're fine for a while, but after 9-12h we eventually start seeing this:

google.cloud.ndb.exceptions.ContextError: No current context. NDB calls must be made in context established by google.cloud.ndb.Client.context.
  at .get_toplevel_context ( google/cloud/ndb/context.py:151 )
  at .rpc_call ( google/cloud/ndb/_datastore_api.py:89 )
  at ._advance_tasklet ( google/cloud/ndb/tasklets.py:323 )
  at .begin_transaction ( google/cloud/ndb/_datastore_api.py:1037 )
  at ._advance_tasklet ( google/cloud/ndb/tasklets.py:319 )
  at ._transaction_async ( google/cloud/ndb/_transaction.py:257 )
  at ._advance_tasklet ( google/cloud/ndb/tasklets.py:319 )
  at .retry_wrapper ( google/cloud/ndb/_retry.py:82 )
  at .retry_wrapper ( google/cloud/ndb/_retry.py:97 )
  at ._advance_tasklet ( google/cloud/ndb/tasklets.py:319 )
  at .check_success ( google/cloud/ndb/tasklets.py:157 )
  at .result ( google/cloud/ndb/tasklets.py:210 )
  at .transaction ( google/cloud/ndb/_transaction.py:186 )
  at .transactional_inner_wrapper ( google/cloud/ndb/_transaction.py:347 )
  at ._handle ( /workspace/atproto_firehose.py:340 )

This is on Object.get_or_create:

bridgy-fed/atproto_firehose.py

Lines 334 to 342 in 60c92e1

    
           try: 
        
               obj = Object.get_or_create(id=obj_id, authed_as=op.repo, status='new', 
        
                                          users=[ATProto(id=op.repo).key], 
        
                                          source_protocol=ATProto.LABEL, **record_kwarg) 
        
               create_task(queue='receive', obj=obj.key.urlsafe(), authed_as=op.repo) 
        
               # when running locally, comment out above and uncomment this 
        
               # logger.info(f'enqueuing receive task for {at_uri}') 
        
           except BaseException: 
        
               report_exception()

The ndb context is long-lived, outside the handle loop:

bridgy-fed/atproto_firehose.py

Lines 290 to 300 in 60c92e1

    
           with ndb_client.context( 
        
                   cache_policy=cache_policy, global_cache=global_cache, 
        
                   global_cache_policy=global_cache_policy, 
        
                   global_cache_timeout_policy=global_cache_timeout_policy): 
        
               while True: 
        
                   try: 
        
                       handle() 
        
                       # if we return cleanly, that means we hit the limit 
        
                       break 
        
                   except BaseException: 
        
                       report_exception()

So, sure, I buy that ndb contexts may not thread safe enough, even if we're making a different one in each thread and we should be fine. From https://googleapis.dev/python/python-ndb/latest/client.html#google.cloud.ndb.client.Client.context :

Code within an asynchronous context should be single threaded. Internally, a threading.local instance is used to track the current event loop.

(We're not async.)

It's weird that this doesn't happen until 9-12h after the server starts though, right?!

The text was updated successfully, but these errors were encountered:

snarfed · 2024-09-11T14:11:17Z

This time it's lasted 20h and counting since the last restart. 🤷

snarfed · 2024-09-12T15:46:50Z

Still hasn't started back up again yet. Surprising!

snarfed · 2024-09-13T22:07:15Z

Hasn't recurred in >3d. Not sure why, but I won't argue.

snarfed · 2024-09-15T22:40:55Z

Dammit, it started up again. Wish I knew why this was so intermittent! For now I'll restart atproto-hub and see what happens.

snarfed · 2024-09-29T23:04:23Z

Hasn't recurred since we cut a lot of load with #1329 (comment) . I don't think it's actually fixed, but triggering it may be load related, we may not see it again until we get back to that level of load organically.

snarfed · 2024-10-01T15:38:07Z

Goddammit, this started up again last night.

…e context for #1315, https://console.cloud.google.com/errors/detail/CIvwj_7MmsfOWw;time=PT1H;locations=global?project=bridgy-federated 😠

snarfed · 2024-10-01T15:54:44Z

trying this tweak ^, will see.

for #1315, maybe

snarfed · 2024-10-02T22:18:27Z

Also dropped handle threads in atproto-hub from 100 down to 10. 🤞🤞🤞

snarfed added infra now labels Sep 10, 2024

snarfed closed this as completed Sep 13, 2024

snarfed reopened this Sep 15, 2024

snarfed closed this as not planned Won't fix, can't repro, duplicate, stale Sep 29, 2024

snarfed reopened this Oct 1, 2024

snarfed added a commit that referenced this issue Oct 1, 2024

atproto_firehose.handle: when we get an ndb ContextError, recreate th…

6147753

…e context for #1315, https://console.cloud.google.com/errors/detail/CIvwj_7MmsfOWw;time=PT1H;locations=global?project=bridgy-federated 😠

snarfed added a commit that referenced this issue Oct 2, 2024

atproto_hub: drop handle threads down from 100 to 10

fcfc047

for #1315, maybe

snarfed closed this as completed Oct 2, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

atproto-hub ndb context failures in handle #1315

atproto-hub ndb context failures in handle #1315

snarfed commented Sep 10, 2024 •

edited

Loading

snarfed commented Sep 11, 2024

snarfed commented Sep 12, 2024

snarfed commented Sep 13, 2024

snarfed commented Sep 15, 2024

snarfed commented Sep 29, 2024

snarfed commented Oct 1, 2024

snarfed commented Oct 1, 2024

snarfed commented Oct 2, 2024

atproto-hub ndb context failures in handle #1315

atproto-hub ndb context failures in handle #1315

Comments

snarfed commented Sep 10, 2024 • edited Loading

snarfed commented Sep 11, 2024

snarfed commented Sep 12, 2024

snarfed commented Sep 13, 2024

snarfed commented Sep 15, 2024

snarfed commented Sep 29, 2024

snarfed commented Oct 1, 2024

snarfed commented Oct 1, 2024

snarfed commented Oct 2, 2024

snarfed commented Sep 10, 2024 •

edited

Loading