Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Download and save a large file as an artifact #135

Open
multimeric opened this issue Feb 20, 2020 · 5 comments
Open

Download and save a large file as an artifact #135

multimeric opened this issue Feb 20, 2020 · 5 comments
Assignees

Comments

@multimeric
Copy link

multimeric commented Feb 20, 2020

One of the steps of my workflow is simply downloading a large data file:

@step
def download_file(self):
    req = requests.get(self.input['url'], allow_redirects=True)
    self.large_file = req.content

Now, this fails with a MemoryError because req.content tries to read the whole file into memory. However, even though requests has a streaming API, via iter_content(), I don't think it's possible to use this because metaflow doesn't expose a file object to write into. If I try to store a generator object as an artifact it doesn't work either:

def download_file(self):
    req = requests.get(self.input['url'], allow_redirects=True)
    self.large_file = req.iter_content(chunk_size=1024)
TypeError: can't pickle generator objects

Finally, I can't use req.raw:

@step
def download_file(self):
    req = requests.get(self.input['url'], allow_redirects=True)
    self.large_file = req.raw
TypeError: cannot serialize '_io.BufferedReader' object

If you somehow exposed the file object we were writing to, I could stream each chunk of the file separately and pickle them:

req = requests.get(self.input['url'], allow_redirects=True)
for chunk in req.iter_content(chunk_size=1024):
    pickle.dump(chunk, fp)

Or ideally not use pickle at all:

req = requests.get(self.input['url'], allow_redirects=True)
for chunk in req.iter_content(chunk_size=1024):
    fp.write(chunk)

Is exposing the file object, or allowing non-pickle files currently possible? If not, is it on the radar?

@multimeric
Copy link
Author

Discussion on gitter from @tuulos:

@TMiguelT internally at Netflix we rely mostly on in-memory processing. While this might not be feasible on a laptop, it works fine with the @resources decorator which allows you to request large cloud instances (e.g. with AWS Batch).

When a dataset doesn't fit in a single instance, we shard the data.

also when it comes to handling large datasets as artifacts, we tend to store pointers to (immutable) datasets as artifacts, not the dataset itself. This is what we do e.g. with Hive tables that are often used as datasets

we are actively working on improving the data layer (related to #4). It'd be great to hear more about your use case / size of data etc., so we can make sure it'll be handled smoothly in upcoming releases

@multimeric
Copy link
Author

multimeric commented Feb 21, 2020

I'm happy to write a PR to handle this.

My immediate thought on how to fix this for the common use-case of "storing a file" is to add a new check in MetaflowDataStore._save_object for file-like objects (instances of IOBase), and if it is one, we don't pickle, and instead just save the file using shutil.copyfileobj() or similar.

A more flexible approach, which allows any data type to be persisted, even if it's not pickleable, would be to expose the target file-object somehow (which would correspond to a local file for the local data store, and an S3 stream for an AWS file etc), and let us define types that know how to load and save themselves using that file:

class RawFile(ArtifactType):
    # Artifact types all have a value field that indicates the data they're storing
    value: IOBase

    # Tell metaflow how to save this to a file
    def serialize(fp):
        shutil.copyfileobj(self.value, self.fp)

    # Tell metaflow how to hydrate this from a file
    def deserialize(fp):
        self.value = fp

# Use the RawFile type in a step
class Workflow(FlowSpec):
    @step
    def download_file(self):
        req = requests.get(self.input['url'], allow_redirects=True)
        self.big_file = RawFile(req)
        self.next(self.use_file)
    @step
    def use_file(self):
        process_file(self.big_file)

serialize() and deserialize() could then be called by the internal workflow machinery.

@holmrenser
Copy link

Handling (intermediate) files is a useful feature for data scientists working the life sciences (aka bioinformaticians), as the data files are often too big to keep in memory, and many efficient algorithms are implemented in standalone applications.

Given that most of these bioinformatics tools are available through the bioconda conda channel, using metaflow seems straightforward for anything except handling (intermediate) data files.

@savingoyal
Copy link
Collaborator

The new datastore implementation now allows for custom serde.

@multimeric
Copy link
Author

Great! I guess that isn't yet stable though? Are there usage examples that involve file storage anywhere?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants