Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support gcs caching for parallel processing #113

Merged
merged 1 commit into from
Apr 24, 2020

Conversation

namanjain
Copy link
Collaborator

No description provided.

@namanjain namanjain requested a review from jqmp April 23, 2020 23:49
@namanjain namanjain force-pushed the naman/parallel-processing-gcs branch from 814b06f to 59bb496 Compare April 23, 2020 23:53
@jqmp
Copy link
Collaborator

jqmp commented Apr 24, 2020

Would you mind adding some text to your commit message explaining what problem this addresses, and how? I'm assuming there's an issue where the GCS client and bucket are not being sent from one process to another, but it seems like we could solve that by adding some custom pickling behavior. But maybe this caching is also intended to speed up tests?

@jqmp
Copy link
Collaborator

jqmp commented Apr 24, 2020

Also, if this fixes any of the tests, should we remove some noparallel annotations?

@namanjain
Copy link
Collaborator Author

Also, if this fixes any of the tests, should we remove some noparallel annotations?

This change fixes the gcs persistence tests and I already removed no_parallel annotation from it 😄

@jqmp
Copy link
Collaborator

jqmp commented Apr 24, 2020

Also, if this fixes any of the tests, should we remove some noparallel annotations?

This change fixes the gcs persistence tests and I already removed no_parallel annotation from it 😄

Oops, somehow I missed that -- sorry!

@namanjain
Copy link
Collaborator Author

Would you mind adding some text to your commit message explaining what problem this addresses, and how? I'm assuming there's an issue where the GCS client and bucket are not being sent from one process to another, but it seems like we could solve that by adding some custom pickling behavior. But maybe this caching is also intended to speed up tests?

Whoops, I probably lost my original commit message with explanation here 😬. I'll add one.

Yeah, the issue is that pickling of client objects is not supported by google cloud storage library. This was raised by someone before and the authors decided to not support client pickling. They just added an error message if someone tried to pickle it.

I tried the workaround of setting client._datastore_api_internal = None but that still didn't work for me. Maybe the way they throw the error, they don't even check for datastore anymore and throw an error as soon as they see a client.

I can also see why caching is tripping you a bit. I'll add some explanation as code comment too but I'm caching buckets since that's what the GcsTool finally uses. Since we can't send the bucket over (as it contains the client too), we'll end up creating a bucket object for every function call on GcsTool which turned out to be pretty inefficient even in our persistent tests and roughly increased the test time by 50% (15 secs).

@namanjain namanjain force-pushed the naman/parallel-processing-gcs branch from 59bb496 to 1a2d180 Compare April 24, 2020 19:30
@namanjain
Copy link
Collaborator Author

Would you mind adding some text to your commit message explaining what problem this addresses, and how? I'm assuming there's an issue where the GCS client and bucket are not being sent from one process to another, but it seems like we could solve that by adding some custom pickling behavior. But maybe this caching is also intended to speed up tests?

Whoops, I probably lost my original commit message with explanation here 😬. I'll add one.

Yeah, the issue is that pickling of client objects is not supported by google cloud storage library. This was raised by someone before and the authors decided to not support client pickling. They just added an error message if someone tried to pickle it.

I tried the workaround of setting client._datastore_api_internal = None but that still didn't work for me. Maybe the way they throw the error, they don't even check for datastore anymore and throw an error as soon as they see a client.

I can also see why caching is tripping you a bit. I'll add some explanation as code comment too but I'm caching buckets since that's what the GcsTool finally uses. Since we can't send the bucket over (as it contains the client too), we'll end up creating a bucket object for every function call on GcsTool which turned out to be pretty inefficient even in our persistent tests and roughly increased the test time by 50% (15 secs).

As we talked today morning, I changed it to use custom pickling logic by recreating the client objects when unserializing the GcsTool object in subprocess using __getstate__ and __setstate__.

GCS caching is broken for parallel processing because GCS client objects
cannot be pickled even with cloudpickle. This change stops attempting to
pickle those GCS objects and recreates them back in the subprocess.
@namanjain namanjain force-pushed the naman/parallel-processing-gcs branch from 1a2d180 to 8f9aab6 Compare April 24, 2020 19:53
Copy link
Collaborator

@jqmp jqmp left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM!

bionic/cache.py Show resolved Hide resolved
@namanjain namanjain merged commit c49b47a into master Apr 24, 2020
@namanjain namanjain deleted the naman/parallel-processing-gcs branch April 26, 2020 22:58
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants