Description
Problem description
We use @googlecloud/datastore
dependency in our code. Since some version of grpc-js
(currently we're on 0.6.9
) we started to receive the following error in our production and staging backends, also in our cron jobs that stream over ~100K / 1M records in Datastore (sometimes after ~5 minutes, sometimes after ~30 minutes). Error details as seen in our Sentry:
Error: 8 RESOURCE_EXHAUSTED: Bandwidth exhausted
at Object.callErrorFromStatus (/root/repo/node_modules/@grpc/grpc-js/build/src/call.js:30:26)
at Http2CallStream.<anonymous> (/root/repo/node_modules/@grpc/grpc-js/build/src/client.js:96:33)
at Http2CallStream.emit (events.js:215:7)
at Http2CallStream.EventEmitter.emit (domain.js:476:20)
at /root/repo/node_modules/@grpc/grpc-js/build/src/call-stream.js:75:22
at processTicksAndRejections (internal/process/task_queues.js:75:11) {
code: 8,
details: 'Bandwidth exhausted',
metadata: Metadata { internalRepr: Map {}, options: {} },
note: 'Exception occurred in retry method that was not classified as transient'
}
Reproduction steps
Very hard to give reproduction steps. Stack trace is not "async", in a way that it doesn't like to exact place where it was called in the code (like it would have done with return await
). We know that in the backend service we're doing all kinds of Datastore calls, but NOT stream. In cron jobs we DO stream as well as other (get, save) api calls.
Environment
- Backend service is in AppEngine (was in Node10, now in Node12 beta, which runs node 12.4.0)
@grpc/grpc-js@0.6.9
We definitely did NOT see this error in 0.5.x
, but I don't remember exactly since which version of 0.6.x
it started to appear.
Additional context
Error happens quite seldomly, maybe ~1-2 times a days on a backend service that serves ~1M requests a day. But when it fails - it fails hard, it's impossible to try/catch
such error, and usually one "occurrence" of such error fails multiple requests from our clients. For example, last night it failed in our staging environment that was running e2e tests (many browsers open in parallel) which produced ~480 errors in one spike. So, looks like this error does not "recover the connection" very quickly.
Another annoying thing of this error is that if it happens inside a long-running cron job that streams some table - we have no way to recover from that error and the whole cron job becomes "failed in the middle" (imagine running a DB migration that fails in the middle in a non-transactional way). So, if our cron job needs to run for ~3 hours and fails after 2 hours - we have no choice but to restart it from the very beginning (paying all the datastore costs).