-
Notifications
You must be signed in to change notification settings - Fork 3.9k
ARROW-4226: [C++] Add sparse CSF tensor support #5716
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
Thanks for opening a pull request! Could you open an issue for this pull request on JIRA? Then could you also rename pull request title in the following format? See also: |
pitrou
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just a pre-review.
0afbe8e to
8f337af
Compare
|
This is missing the |
7a31b0e to
5c89552
Compare
44720aa to
d49074c
Compare
|
@rok Thank you for working to implement CSF format. We should research the existing implementations of CSF format so that our implementation can be used to exchange the data without copying buffers. Did you find the existing libraries that have CSF format? If you know some such libraries, could you tell me them for the investigation of the data layout? |
e33143d to
f93be66
Compare
Agreed! :)
To my knowledge CSF is implemented in taco and in pydata/sparse as My main reference for implementation is |
|
Another point: CSF is a generalization of CSR and CSC. Because of this transformations |
|
I confirmed that taco supports CSF format. Now I'm reading this paper and the implementation of taco to study taco's abstraction sparse tensor format. I guess we need more researches on the real-world implementations of CSF format right now. |
|
There is also a CSF implementation by @ShadenSmith in SPLATT. |
|
Hello! It's great to see your interest in the CSF format. I'm happy to answer any questions that you have. |
|
Hi @ShadenSmith. We're currently discussing layout of arrow's CSF implementation. As @mrkn mentioned we want to design it in a way that lets us avoid copying data when interfacing with existing libraries and rather make it possible to just pass pointers. This PR currently proposes a structure like: Here If I understand correctly SPLATT uses a different structure where @ShadenSmith could you say:
|
|
Hi @rok, the other CSF implementations that I have seen either include the code from SPLATT or also just use the 2D layout for the per-mode structures. Here is an example. I'm not sure if taco uses the 1D or 2D implementation and did not find the details from a quick search of the repo. I think five or six modes (dimensions) is the largest I have encountered in an application, with three being the most common. I have heard of some ongoing research with O(100-1000) mode tensors, but haven't seen results yet and am not sure something so high dimensional will pan out. |
Got it, thanks a lot! This is good to know. :)
Interesting, do you think higher dimensional tensors will be used if available? |
Let me first say that I can really only speak for the tensor factorization community, and even there I'm more of a scalability researcher than an application researcher. Today's software can deal with very high dimensional tensors (SPLATT for example just needs a compile-time flag to increase the max dimensionality). I think the issue is more on the side of the data and quality of a tensor factorization with many dimensions; you may just be learning noise unless the data truly has a nice tensor form with many dimensions. It's quite possible that future application areas do use tensors of that form...I just haven't seen it. Off the top of my head, some of the electronic health record tensors have ~5-10 dimensions, but still not 100-1000 AFAIK. Also, I would usually not recommend using CSF for tensors with more than 5 or 6 dimensions, unless you have a lot of data that has a non-uniform sparsity pattern. A simple coordinate format would be better unless the non-zeros arrange nicely into fibers. As you increase the number of dimensions in a tensor, the idea of a fiber gets more and more specific, meaning that the savings you get from storing the sparse tensor in CSF disappear because most fibers will only have a few non-zeros in them. |
e7c67a0 to
aaffe58
Compare
|
Thanks for the review @pitrou! I've pushed changes you suggested. |
|
It seems flight tests are failing in appveyor. |
pitrou
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A couple of details yet. We're getting there :-)
cpp/src/arrow/sparse_tensor_test.cc
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You don't seem to mutate those vector inputs, so should make them a const reference really :-)
|
Thanks for the review @pitrou. I pushed the suggested changes :). |
|
Thank you @rok :-) |
|
Thanks @pitrou @mrkn and @ShadenSmith! |
This is to resolve [ARROW-4226](https://issues.apache.org/jira/browse/ARROW-4226). Closes #5716 from rok/ARROW-4226 and squashes the following commits: 9ca93ab <Rok> Implementing review feedback. 1b922f6 <Rok> Implementing review feedback. 11b81bb <Rok> Factoring out index incrementing for dense to COO and CSF indices. 6f4f4a8 <Rok> Implementing feedback review. 28d38cb <Rok> Removing backslashes from comments. 3291abc <Rok> Marking indptrBuffers, indicesBuffers and axisOrder required. d9ff47e <Rok> Further work and implementing review feedback. 24a831f <Rok> Style. 4f2bf00 <Rok> Work on CSF index tests. 6ceb406 <Rok> Implementing review feedback. bd0d8c2 <Rok> Dense to sparse CSF conversion now in order of dimension size. eb51947 <Rok> Switching SparseCSFIndex to '2D' data structure. a322ff5 <Rok> Adding tests for multiple index value types for SparseCSFIndex. f44d92c <Rok> Adding SparseCSFIndex::Make. 7d17995 <Rok> Adding Tensor to SparseCSFTensor conversion. 05a47a5 <Rok> Using axis_order in CSF. 6b938f7 <Rok> Documentation. 2d10104 <Rok> WIP Authored-by: Rok <rok@mihevc.org> Signed-off-by: Antoine Pitrou <antoine@python.org>
This is to resolve ARROW-4226.