Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

masked_select: Filtering tensors with a boolean mask tensor #400

Closed
brentp opened this issue Nov 4, 2019 · 8 comments · Fixed by #434
Closed

masked_select: Filtering tensors with a boolean mask tensor #400

brentp opened this issue Nov 4, 2019 · 8 comments · Fixed by #434

Comments

@brentp
Copy link
Contributor

brentp commented Nov 4, 2019

given a 2d array/tensor, how would i do the arraymancer equivalent of this numpy expression:

a[:,a.mean(axis=0) > 0.5] = -1

I think it's probably some combination of the map/apply/fold_inline, but it's not obvious how to do that.
Once I understand, I can open a PR with some common examples like this if it's helpful.

@brentp
Copy link
Contributor Author

brentp commented Nov 12, 2019

I have come up with this:

import arraymancer

var T = randomTensor[float32](8, 3, 1'f32)
let m = T.mean(axis=0).broadcast(T.shape)

apply2_inline(T, m, if x > y: x else: -1)
echo T

which seems reasonable. not sure if there's a simpler way.

@mratsim
Copy link
Owner

mratsim commented Nov 13, 2019

So what's happening is that Numpy accepts a Tensor of booleans as a mask for selecting things.

Your solution works for your case because you don't need to discard/filter the values, however implementing filtering with a Tensor of bool would also be very useful for dataframes/analysis related stuff.

For filtering, I see the following difficulties:

  • Implementing the algorithm: it may be easier to use PyTorch's masked_select as a reference but while not as complex as Numpy, the codebase is very complex at the moment due to their 3 backens mixed together: C Torch, Aten and C10.
  • Updating the indexing macro to accept a Tensor of bool and dispatch to the masked_select proc(s). This may require multiple dispatch/allocations if 2 dimensions are sliced at the same time.

@mratsim mratsim changed the title docs on common operations masked_select: Filtering tensors with a boolean mask tensor Nov 13, 2019
@brentp
Copy link
Contributor Author

brentp commented Nov 13, 2019

Thanks for considering it. For my additional 0.02...
I guess that, for example indexing with a tensor of booleans or indexes as in numpy is a fundamental feature, but for other stuff, I think I'd be fine just writing the 2-3 lines myself if I had a better handle on the map* and apply* functions.

To that end, it'd be nice if the broadcasting was done auto-magically (as in numpy) as well.

@mratsim
Copy link
Owner

mratsim commented Nov 13, 2019

Broadcast is done auto-magically with the .+ and other series of dot function. I intentionally have 2 different operators because it's a pain in Numpy to not broadcast and have a silent error to debug.

proc `.+`*[T: SomeNumber|Complex[float32]|Complex[float64]](a, b: Tensor[T]): Tensor[T] {.noInit,inline.} =
## Broadcasted addition for tensors of incompatible but broadcastable shape.
let (tmp_a, tmp_b) = broadcast2(a, b)
result = tmp_a + tmp_b

@brentp
Copy link
Contributor Author

brentp commented Nov 21, 2019

I am also having trouble just using the map/apply stuff. for example, I can't understand how to change this (which either segfaults or runs endlessly):

import arraymancer

var T = randomTensor[float32](250, 17384, 1'f32)
let m = T.map_inline():
  if x < 0: 1'f32 else: 0'f32

echo m.shape
echo m
echo m.mean(axis=0)

@Vindaar
Copy link
Collaborator

Vindaar commented Mar 31, 2020

@brentp It's been a while since your last post, so maybe you've noticed this after posting at some point (or you encountered a real bug):

I believe the reason you're seeing the code run endlessly, is simply that the tensor you create is huge and arraymancer's printing is pretty slow (and doesn't just cut off after a fixed N elements).

@mratsim
Copy link
Owner

mratsim commented Apr 1, 2020

Note that masked_select implementation is planned soon.

@mratsim
Copy link
Owner

mratsim commented Apr 5, 2020

Tentative implementation and names at #429

If you have suggestion on proc name and description to limit confusion especially for

proc masked_axis_fill(t: var Tensor[T], mask: Tensor[bool], axis: int, value: T) 
  ## Take a 1D-mask 
  ## iterate on t along the axis and fill the slice of t with `value`
  ## if the mask[current_iteration_index] is true

proc masked_fill_along_axis(t: var Tensor[T], mask: Tensor[bool], axis: int, value: T) 
  ## Take a N-D mask. Dimension along the axis must be 1
  ## iterate on t along the axis
  ##   On the slice of t, apply masked_fill

I'm taking them (let's have the name discussion in the PR)

mratsim added a commit that referenced this issue Apr 19, 2020
* index_select should use SomeInteger not SOmeNumber

* Overload index_select for arrays and sequences

* Masked Selector overload for openarrays

* Add masked overload for regular arrays and sequences

* Initial support of Numpy fancy indexing: index select

* Fix broadcast operators from #429 using deprecated syntax

* Stash dispatcher, working with types in macros is a minefield nim-lang/Nim#14021

* Masked indexing: closes #400, workaround nim-lang/Nim#14021

* Test for full masked fancy indexing

* Add index_fill

* Tensor mutation via fancy indexing

* Add tests for index mutation via fancy indexing

* Fancy indexing: supports broadcasting a value to a masked assignation

* Detect wrong mask or tensor axis length

* masked axis assign value test

* Add masked assign of broadcastable tensor

* Tag for changelog [skip ci]
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants