Skip to content

Bigtable: read_rows: no deadline causing stuck clients; no way to change it #6

Closed
googleapis/python-api-core
#20
@evanj

Description

@evanj

Table.read_rows does not set any deadline, so it can hang forever if the Bigtable server connection hangs. We see this happening once every week or two when running inside GCP, which causes our server to get stuck indefinitely. There appears to be no way in the API to set a deadline, even though the documentation says that the retry parameter should do this. Due to a bug, it does not.

Details:

We are calling Table.read_rows to read ~2 rows from BigTable. Using pyflame on a stuck process, both worker threads were waiting on Bigtable, with the stack trace below. I believe the bug is the following:

  1. Call Table.read_rows. This calls PartialRowsData, passing the retry argument which defaults to DEFAULT_RETRY_READ_ROWS. The default misleadingly sets deadline=60.0. ; It also passes read_method=self._instance._client.table_data_client.transport.read_rows to PartialRowsData, which is a method on BigtableGrpcTransport.
  2. PartialRowsData.__init__ calls read_method(); this is actually raw gRPC _UnaryStreamMultiCallable, not the gapic BigtableClient.read_rows, which AFAICS, is never called. Hence, this gRPC streaming call is started with any deadline.
  3. PartialRowsData.__iter__ calls self._read_next_response, which calls return self.retry(self._read_next, on_error=self._on_error)(). This gives the impression that retry is used, but if I understand gRPC streams correctly, I'm not sure that even makes sense. I think even if the gRPC stream return some error, calling next won't actually retry the gRPC, it will just immediately raise the same exception. To retry, I believe you need to actually restart it by calling read_rows again.
  4. If the Bigtable server now "hangs", the client hangs forever.

Possible fix:

Change Table.read_rows call the gapic BigtableClient.read_rows with the retry parameter., and change PartialRowsData.__init__ to take this response iterator, and not take a retry parameter at all. This would at least allow setting the gRPC streaming call deadline, although I don't think it will make retrying work (since I think the gRPC streaming client will just immediately returns an iterator without actually waiting for a response from the server?)

I haven't actually tried implementing this to see if it works. For now, we will probably just make a raw gRPC read_rows call so we can set an appropriate timeout.

Environment details

OS: Linux, ContainerOS (GKE), Container is Debian9 (using distroless)
Python: 3.5.3
API: google-cloud-bigtable 0.33.0

Steps to reproduce

This program loads the Bigtable emulator with 1000 rows, calls read_rows(retry=DEFAULT.with_deadline(5.0)), then sends SIGSTOP to pause the emulator. This SHOULD cause a DeadlineExceeded exception to be raised after 5 seconds. Instead, it hangs forever.

  1. Start the Bigtable emulator: gcloud beta emulators bigtable start
  2. Find the PID: ps ax | grep cbtemulator
  3. Run the following program with BIGTABLE_EMULATOR_HOST=localhost:8086 python3 bug.py $PID
from google.api_core import exceptions
from google.cloud import bigtable
from google.rpc import code_pb2
from google.rpc import status_pb2
import os
import signal
import sys

COLUMN_FAMILY_ID = 'column_family_id'

def main():
    emulator_pid = int(sys.argv[1])
    client = bigtable.Client(project="testing", admin=True)
    instance = client.instance("emulator")

    # create/open a table
    table = instance.table("emulator_table")
    column_family = table.column_family(COLUMN_FAMILY_ID)
    try:
        table.create()
        column_family.create()
    except exceptions.AlreadyExists:
        print('table exists')

    # write a bunch of data
    for i in range(1000):
        k = 'some_key_{:04d}'.format(i)
        print(k)
        row = table.row(k)
        row.set_cell(COLUMN_FAMILY_ID, 'column', 'some_value{:d}'.format(i) * 1000)
        result = table.mutate_rows([row])
        assert len(result) == 1 and result[0].code == code_pb2.OK
        assert table.read_row(k) is not None

    print('starting read')
    rows = table.read_rows(retry=bigtable.table.DEFAULT_RETRY_READ_ROWS.with_deadline(5.0))
    rows_iter = iter(rows)
    r1 = next(rows_iter)
    print('read', r1)
    os.kill(emulator_pid, signal.SIGSTOP)
    print('sent sigstop')
    for r in rows_iter:
        print(r)
    print('done')


if __name__ == '__main__':
    main()

Stack trace of hung server (using slightly older version of the google-cloud-bigtable library

/usr/local/lib/python2.7/threading.py:wait:340
/usr/local/lib/python2.7/site-packages/grpc/_channel.py:_next:348
/usr/local/lib/python2.7/site-packages/grpc/_channel.py:next:366
/usr/local/lib/python2.7/site-packages/google/cloud/bigtable/row_data.py:_read_next:426
/usr/local/lib/python2.7/site-packages/google/api_core/retry.py:retry_target:179
/usr/local/lib/python2.7/site-packages/google/api_core/retry.py:retry_wrapped_func:270
/usr/local/lib/python2.7/site-packages/google/cloud/bigtable/row_data.py:_read_next_response:430
/usr/local/lib/python2.7/site-packages/google/cloud/bigtable/row_data.py:__iter__:441

Metadata

Metadata

Assignees

Labels

🚨This issue needs some love.api: bigtableIssues related to the googleapis/python-bigtable API.priority: p2Moderately-important priority. Fix may not be included in next release.type: bugError or flaw in code with unintended results or allowing sub-optimal usage patterns.

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions