Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Distributed labeling #94

Merged
merged 47 commits into from
Feb 10, 2019
Merged

Distributed labeling #94

merged 47 commits into from
Feb 10, 2019

Conversation

jni
Copy link
Contributor

@jni jni commented Feb 2, 2019

None of the below guidelines are met, but that's why there's a WIP in the title. =D

Currently dask_image.label creates a giant array to do the labeling. This is suboptimal. This presents initial work to do it in a distributed fashion using a graph to relabel independently-labeled blocks.

PR template:

  1. The pull request should include tests.
  2. If the pull request adds functionality, the docs should be updated. Put
    your new functionality into a function with a docstring, and add the
    feature to the list in README.rst.
  3. The pull request should work for Python 2.7, 3.5, and 3.6. Check
    https://travis-ci.org/dask/dask-image/pull_requests
    and make sure that the tests pass for all supported Python versions.

Fixes #29

@jni
Copy link
Contributor Author

jni commented Feb 4, 2019

This is now working on this tiny example:

[[0 0 1 3 0 0]
 [0 2 0 0 0 0]
 [0 2 0 4 4 0]
 [5 5 0 0 7 0]
 [0 0 6 0 0 0]
 [0 0 0 0 8 8]]

If you read this into a dask array with chunks (3, 3), you get the following awesome graph:

mydask

import numpy as np
from scipy import ndimage as ndi
import dask.array as da
from dask_image.ndmeasure import label

selem = ndi.generate_binary_structure(2, 1)
labeled_array = np.load('labels.npy')
dalabels = da.from_array(labeled_array, chunks=(3, 3))
labeled = label(dalabels, selem)
print(labeled.compute())
labeled.visualize()

jakirkham and others added 20 commits February 4, 2019 23:46
By specifying the chunking of the result from `map_blocks`, we are able
to work with an older version of Dask.
As we can concatenate along a different axis, which serves our purpose
just as well, go ahead and change the code accordingly to avoid a
transpose.
As older versions of NumPy that we support and test against don't
include `isin`, switching to using `in1d`, which has been around longer.
Since the array in question is already 1-D, there is no need for us to
reshape the result after calling `in1d`. So this is sufficient for our
use case.
When there is a singleton chunk, there are no shared faces between
chunks to construct an adjacency graph from. In this case ensure there
is at least an empty array to start with. This doesn't alter the cases
where there are multiple chunks that do share faces. Though it does
avoid branching when there is a single chunk with no shared faces.
Previously we were incorrectly determining the connected components
dtype. This fixes it by inspecting the result on a trivial case and
seeing what the dtype is. Then using that to set the delayed type when
converting to a Dask Array.
As we are already calling a delayed wrapped copy of `label`, there is no
need to use `partial` to bind arguments first. So go ahead and drop
`partial` and pass the arguments directly.
As we are already calling a delayed wrapped copy of
`connected_components`, there is no need to use `partial` to bind
arguments first. So go ahead and drop `partial` and pass the arguments
directly.
As we are now making use of the `blocks` accessor of Dask Arrays and
this requires a newer version of Dask, bump the minimum version of Dask
to 0.18.2.
Ensure that `total` is a `LABEL_DTYPE` scalar. This is needed by
`where`, which checks the `dtype` of the arguments it is provided.
Make sure that `0` in `da.where` is also of `LABEL_DTYPE`. This way we
can ensure that the array generated by `where` has the expected type and
thus avoid using `astype` to copy and cast the array.
Go ahead and make `n` an array in `block_ndi_label_delayed`. This
ensures it matches what we expect later. Plus it avoids some boilerplate
in `label` that makes things less clear.
Make sure to exercise the test case where a labeled region crosses the
chunk boundary in two locations instead of just one. This is done to
ensure that the multiple chunk implementation is able to resolve this
down to a single labeled region.
Changes the `_utils` module into the `_utils` package. This should make
it easier to add more specific collections of utility functions and
group them accordingly.
Moves some utility functions from `dask_image.ndmeasure`'s `__init__`
used to perform `label` over multiple chunks to `_utils._label`.
@jni jni changed the title WIP: proper distributed labeling Distributed labeling Feb 7, 2019
By viewing the array as a structured array that merely groups together
the other dimensions within the structure, `_unique_axis_0` is able to
call NumPy's `unique` function on the array keeping the type information
unchanged. Thus if `unique` is able to handle the specific type more
efficiently, we benefit from that speed up still.

Note: More work would be needed if we wanted to handle F-contiguous
arrays, but that is not needed for our use case.
Copy link
Member

@jakirkham jakirkham left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks so much for working on this with me, @jni! 😄

Have done some tidying as discussed. This looks ready to merge to me. Do you want to give this another look before we merge?

jakirkham and others added 2 commits February 9, 2019 18:20
Allow an arbitrary `axis` to specified in `_unique_axis`, but have it
default to `0` if not specified. This keeps the previous behavior while
making the function more generally useful.
@jni
Copy link
Contributor Author

jni commented Feb 10, 2019

@jakirkham I fixed a minor indentation issue in a doctest, but otherwise feel free to pull the trigger when the builds pass! 🎉

@jakirkham
Copy link
Member

Thanks @jni! Merging 😄

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Support multiple chunks in label
2 participants