Skip to content

GCS Blob does not support the io.IOBase interface #3903

Closed
@eap

Description

I'm branching this issue from the (currently resolved) issue #2871. The title covers the request, but to add some context see below.

Many python libraries (pandas, biopython, pillow, etc.) support reading from a 'python file object' which canonically are objects that implement the io.IOBase interface. If we could open a blob as a file object, we could do operations as shown in the example below where I'm just reading a few lines from a (possibly very large) file.

from Bio import SeqIO
from google.cloud.storage import blob

with blob.Blob('/data/example.fastq', 'mybucket').open('rb') as fileobj:
    for i, rec in enumerate(SeqIO.parse(fileobj, 'fastq')):
        print(rec.seq)
        print('  name=%s\n  annotations=%r' % (rec.name, rec.annotations))
        if i > 5:
            break

google-resumable-media covers some of the need expressed in this bug, but it does not satisfy users who need blobs to be parsed by libraries expecting standard file objects. The standard advice might be to download files first but that advice ignores the expense and time of downloading large files when indexed operations are available.

Some work has been done towards this end - the following two repositories have solutions. That said, I think most people would be far more happy to adopt a solution from this project since it has a far larger community and more active maintenance / governance.

Metadata

Assignees

Labels

api: storageIssues related to the Cloud Storage API.type: feature request‘Nice-to-have’ improvement, new feature or different behavior or design.

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions