Download and save a large file as an artifact #135

multimeric · 2020-02-20T07:57:40Z

One of the steps of my workflow is simply downloading a large data file:

@step
def download_file(self):
    req = requests.get(self.input['url'], allow_redirects=True)
    self.large_file = req.content

Now, this fails with a MemoryError because req.content tries to read the whole file into memory. However, even though requests has a streaming API, via iter_content(), I don't think it's possible to use this because metaflow doesn't expose a file object to write into. If I try to store a generator object as an artifact it doesn't work either:

def download_file(self):
    req = requests.get(self.input['url'], allow_redirects=True)
    self.large_file = req.iter_content(chunk_size=1024)

TypeError: can't pickle generator objects

Finally, I can't use req.raw:

@step
def download_file(self):
    req = requests.get(self.input['url'], allow_redirects=True)
    self.large_file = req.raw

TypeError: cannot serialize '_io.BufferedReader' object

If you somehow exposed the file object we were writing to, I could stream each chunk of the file separately and pickle them:

req = requests.get(self.input['url'], allow_redirects=True)
for chunk in req.iter_content(chunk_size=1024):
    pickle.dump(chunk, fp)

Or ideally not use pickle at all:

req = requests.get(self.input['url'], allow_redirects=True)
for chunk in req.iter_content(chunk_size=1024):
    fp.write(chunk)

Is exposing the file object, or allowing non-pickle files currently possible? If not, is it on the radar?

The text was updated successfully, but these errors were encountered:

multimeric · 2020-02-21T05:26:32Z

Discussion on gitter from @tuulos:

@TMiguelT internally at Netflix we rely mostly on in-memory processing. While this might not be feasible on a laptop, it works fine with the @resources decorator which allows you to request large cloud instances (e.g. with AWS Batch).

When a dataset doesn't fit in a single instance, we shard the data.

also when it comes to handling large datasets as artifacts, we tend to store pointers to (immutable) datasets as artifacts, not the dataset itself. This is what we do e.g. with Hive tables that are often used as datasets

we are actively working on improving the data layer (related to #4). It'd be great to hear more about your use case / size of data etc., so we can make sure it'll be handled smoothly in upcoming releases

multimeric · 2020-02-21T05:40:16Z

I'm happy to write a PR to handle this.

My immediate thought on how to fix this for the common use-case of "storing a file" is to add a new check in MetaflowDataStore._save_object for file-like objects (instances of IOBase), and if it is one, we don't pickle, and instead just save the file using shutil.copyfileobj() or similar.

A more flexible approach, which allows any data type to be persisted, even if it's not pickleable, would be to expose the target file-object somehow (which would correspond to a local file for the local data store, and an S3 stream for an AWS file etc), and let us define types that know how to load and save themselves using that file:

class RawFile(ArtifactType):
    # Artifact types all have a value field that indicates the data they're storing
    value: IOBase

    # Tell metaflow how to save this to a file
    def serialize(fp):
        shutil.copyfileobj(self.value, self.fp)

    # Tell metaflow how to hydrate this from a file
    def deserialize(fp):
        self.value = fp

# Use the RawFile type in a step
class Workflow(FlowSpec):
    @step
    def download_file(self):
        req = requests.get(self.input['url'], allow_redirects=True)
        self.big_file = RawFile(req)
        self.next(self.use_file)
    @step
    def use_file(self):
        process_file(self.big_file)

serialize() and deserialize() could then be called by the internal workflow machinery.

holmrenser · 2020-04-16T15:09:20Z

Handling (intermediate) files is a useful feature for data scientists working the life sciences (aka bioinformaticians), as the data files are often too big to keep in memory, and many efficient algorithms are implemented in standalone applications.

Given that most of these bioinformatics tools are available through the bioconda conda channel, using metaflow seems straightforward for anything except handling (intermediate) data files.

savingoyal · 2021-09-29T07:47:58Z

The new datastore implementation now allows for custom serde.

multimeric · 2021-09-29T07:58:45Z

Great! I guess that isn't yet stable though? Are there usage examples that involve file storage anywhere?

romain-intel assigned bergdavidj Feb 23, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Download and save a large file as an artifact #135

Download and save a large file as an artifact #135

multimeric commented Feb 20, 2020 •

edited

Loading

multimeric commented Feb 21, 2020

multimeric commented Feb 21, 2020 •

edited

Loading

holmrenser commented Apr 16, 2020

savingoyal commented Sep 29, 2021

multimeric commented Sep 29, 2021

Download and save a large file as an artifact #135

Download and save a large file as an artifact #135

Comments

multimeric commented Feb 20, 2020 • edited Loading

multimeric commented Feb 21, 2020

multimeric commented Feb 21, 2020 • edited Loading

holmrenser commented Apr 16, 2020

savingoyal commented Sep 29, 2021

multimeric commented Sep 29, 2021

multimeric commented Feb 20, 2020 •

edited

Loading

multimeric commented Feb 21, 2020 •

edited

Loading