Support gcs caching for parallel processing #113

namanjain · 2020-04-23T23:49:35Z

No description provided.

jqmp · 2020-04-24T12:52:58Z

Would you mind adding some text to your commit message explaining what problem this addresses, and how? I'm assuming there's an issue where the GCS client and bucket are not being sent from one process to another, but it seems like we could solve that by adding some custom pickling behavior. But maybe this caching is also intended to speed up tests?

jqmp · 2020-04-24T12:54:41Z

Also, if this fixes any of the tests, should we remove some noparallel annotations?

namanjain · 2020-04-24T14:29:32Z

Also, if this fixes any of the tests, should we remove some noparallel annotations?

This change fixes the gcs persistence tests and I already removed no_parallel annotation from it 😄

jqmp · 2020-04-24T14:40:04Z

Also, if this fixes any of the tests, should we remove some noparallel annotations?

This change fixes the gcs persistence tests and I already removed no_parallel annotation from it 😄

Oops, somehow I missed that -- sorry!

namanjain · 2020-04-24T14:55:48Z

Would you mind adding some text to your commit message explaining what problem this addresses, and how? I'm assuming there's an issue where the GCS client and bucket are not being sent from one process to another, but it seems like we could solve that by adding some custom pickling behavior. But maybe this caching is also intended to speed up tests?

Whoops, I probably lost my original commit message with explanation here 😬. I'll add one.

Yeah, the issue is that pickling of client objects is not supported by google cloud storage library. This was raised by someone before and the authors decided to not support client pickling. They just added an error message if someone tried to pickle it.

I tried the workaround of setting client._datastore_api_internal = None but that still didn't work for me. Maybe the way they throw the error, they don't even check for datastore anymore and throw an error as soon as they see a client.

I can also see why caching is tripping you a bit. I'll add some explanation as code comment too but I'm caching buckets since that's what the GcsTool finally uses. Since we can't send the bucket over (as it contains the client too), we'll end up creating a bucket object for every function call on GcsTool which turned out to be pretty inefficient even in our persistent tests and roughly increased the test time by 50% (15 secs).

namanjain · 2020-04-24T19:32:08Z

Would you mind adding some text to your commit message explaining what problem this addresses, and how? I'm assuming there's an issue where the GCS client and bucket are not being sent from one process to another, but it seems like we could solve that by adding some custom pickling behavior. But maybe this caching is also intended to speed up tests?

Whoops, I probably lost my original commit message with explanation here 😬. I'll add one.

Yeah, the issue is that pickling of client objects is not supported by google cloud storage library. This was raised by someone before and the authors decided to not support client pickling. They just added an error message if someone tried to pickle it.

I tried the workaround of setting client._datastore_api_internal = None but that still didn't work for me. Maybe the way they throw the error, they don't even check for datastore anymore and throw an error as soon as they see a client.

I can also see why caching is tripping you a bit. I'll add some explanation as code comment too but I'm caching buckets since that's what the GcsTool finally uses. Since we can't send the bucket over (as it contains the client too), we'll end up creating a bucket object for every function call on GcsTool which turned out to be pretty inefficient even in our persistent tests and roughly increased the test time by 50% (15 secs).

As we talked today morning, I changed it to use custom pickling logic by recreating the client objects when unserializing the GcsTool object in subprocess using __getstate__ and __setstate__.

GCS caching is broken for parallel processing because GCS client objects cannot be pickled even with cloudpickle. This change stops attempting to pickle those GCS objects and recreates them back in the subprocess.

jqmp

LGTM!

bionic/cache.py

namanjain requested a review from jqmp April 23, 2020 23:49

namanjain force-pushed the naman/parallel-processing-gcs branch from 814b06f to 59bb496 Compare April 23, 2020 23:53

namanjain force-pushed the naman/parallel-processing-gcs branch from 59bb496 to 1a2d180 Compare April 24, 2020 19:30

Support GCS caching for parallel processing

8f9aab6

GCS caching is broken for parallel processing because GCS client objects cannot be pickled even with cloudpickle. This change stops attempting to pickle those GCS objects and recreates them back in the subprocess.

namanjain force-pushed the naman/parallel-processing-gcs branch from 1a2d180 to 8f9aab6 Compare April 24, 2020 19:53

jqmp approved these changes Apr 24, 2020

View reviewed changes

bionic/cache.py Show resolved Hide resolved

namanjain merged commit c49b47a into master Apr 24, 2020

namanjain deleted the naman/parallel-processing-gcs branch April 26, 2020 22:58

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Support gcs caching for parallel processing #113

Support gcs caching for parallel processing #113

namanjain commented Apr 23, 2020

jqmp commented Apr 24, 2020

jqmp commented Apr 24, 2020

namanjain commented Apr 24, 2020

jqmp commented Apr 24, 2020

namanjain commented Apr 24, 2020

namanjain commented Apr 24, 2020

jqmp left a comment

Support gcs caching for parallel processing #113

Support gcs caching for parallel processing #113

Conversation

namanjain commented Apr 23, 2020

jqmp commented Apr 24, 2020

jqmp commented Apr 24, 2020

namanjain commented Apr 24, 2020

jqmp commented Apr 24, 2020

namanjain commented Apr 24, 2020

namanjain commented Apr 24, 2020

jqmp left a comment

Choose a reason for hiding this comment