Parallel recursive breadth-first `w.workspace.list(..., recursive=True, threads=os.cpu_count())` to iterate over 10K notebooks faster #284

nfx · 2023-08-11T19:49:15Z

for when we need to list 10K notebooks a bit faster.

pietern

Couple notes, I think this probably hangs as-is.

databricks/sdk/mixins/workspace.py

mgyucht

I think this is a good addition to the SDK. Let's drop the reporter for now, or if you want, add it as a parameter to this API. Aside from this, I think there is one bug in the _list method.

Can we add an integration test where we actually create some recursive directory structure, try to list everything, and check that it worked correctly?

mgyucht · 2023-10-12T08:01:46Z

databricks/sdk/mixins/workspace.py

+    def _list(self, path):
+        listing = self._listing(path, notebooks_modified_after=self._notebooks_modified_after)
+        for object_info in sorted(listing, key=lambda _: _.path):
+            if object_info.object_type != ObjectType.DIRECTORY:


I think this should be ==? The code in this block refers to what happens in a directory.

mgyucht · 2023-10-12T08:07:40Z

databricks/sdk/mixins/workspace.py

+        with self._cond:
+            return self._in_progress > 0
+
+    def _reporter(self):


The reporter is a bit weird. Given that this is a generator and users may not list all of the items "right away", this would just continue to print logs until the generator is consumed. Alternatively, this could be provided as a parameter to the parallel list operation, allowing users to provide a callback that is periodically called with statistics like this if they desire, then they can control when/whether they want to log or do something else.

(PoC) Parallel recursive listing for w.workspace.list()

72d5c38

nfx added the do-not-merge label Aug 11, 2023

pietern reviewed Aug 14, 2023

View reviewed changes

databricks/sdk/mixins/workspace.py Outdated Show resolved Hide resolved

databricks/sdk/mixins/workspace.py Show resolved Hide resolved

databricks/sdk/mixins/workspace.py Outdated Show resolved Hide resolved

databricks/sdk/mixins/workspace.py Outdated Show resolved Hide resolved

working parallel implementation

46e3a7f

nfx changed the title ~~(PoC) Parallel recursive listing for w.workspace.list()~~ Parallel recursive listing for w.workspace.list() Sep 21, 2023

don't hang on reporter thread

bc9f140

nfx changed the title ~~Parallel recursive listing for w.workspace.list()~~ Parallel recursive breadth-first listing for w.workspace.list() to iterate over 10K notebooks faster Sep 21, 2023

nfx changed the title ~~Parallel recursive breadth-first listing for w.workspace.list() to iterate over 10K notebooks faster~~ Parallel recursive breadth-first w.workspace.list(..., recursive=True, threads=os.cpu_count()) to iterate over 10K notebooks faster Sep 21, 2023

nfx added 2 commits September 21, 2023 12:43

ensure that results are consumed

4c69aaf

refactor more

1c1cb57

nfx added the ergonomics UX of SDK label Sep 25, 2023

mgyucht reviewed Oct 12, 2023

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Parallel recursive breadth-first `w.workspace.list(..., recursive=True, threads=os.cpu_count())` to iterate over 10K notebooks faster #284

Parallel recursive breadth-first `w.workspace.list(..., recursive=True, threads=os.cpu_count())` to iterate over 10K notebooks faster #284

nfx commented Aug 11, 2023 •

edited

Loading

pietern left a comment

mgyucht left a comment

mgyucht Oct 12, 2023

mgyucht Oct 12, 2023

Parallel recursive breadth-first w.workspace.list(..., recursive=True, threads=os.cpu_count()) to iterate over 10K notebooks faster #284

Are you sure you want to change the base?

Parallel recursive breadth-first w.workspace.list(..., recursive=True, threads=os.cpu_count()) to iterate over 10K notebooks faster #284

Conversation

nfx commented Aug 11, 2023 • edited Loading

pietern left a comment

Choose a reason for hiding this comment

mgyucht left a comment

Choose a reason for hiding this comment

mgyucht Oct 12, 2023

Choose a reason for hiding this comment

mgyucht Oct 12, 2023

Choose a reason for hiding this comment

Parallel recursive breadth-first `w.workspace.list(..., recursive=True, threads=os.cpu_count())` to iterate over 10K notebooks faster #284

Parallel recursive breadth-first `w.workspace.list(..., recursive=True, threads=os.cpu_count())` to iterate over 10K notebooks faster #284

nfx commented Aug 11, 2023 •

edited

Loading