Skip to content

Conversation

@rok
Copy link
Member

@rok rok commented Oct 22, 2019

This is to resolve ARROW-4226.

@github-actions
Copy link

Thanks for opening a pull request!

Could you open an issue for this pull request on JIRA?
https://issues.apache.org/jira/browse/ARROW

Then could you also rename pull request title in the following format?

ARROW-${JIRA_ID}: [${COMPONENT}] ${SUMMARY}

See also:

Copy link
Member

@pitrou pitrou left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just a pre-review.

@rok rok force-pushed the ARROW-4226 branch 2 times, most recently from 0afbe8e to 8f337af Compare November 10, 2019 00:20
@rok
Copy link
Member Author

rok commented Nov 10, 2019

This is missing the MakeSparseTensorFromTensor part for CSF.

@rok rok force-pushed the ARROW-4226 branch 3 times, most recently from 7a31b0e to 5c89552 Compare November 18, 2019 17:06
@rok rok force-pushed the ARROW-4226 branch 2 times, most recently from 44720aa to d49074c Compare November 25, 2019 01:03
@mrkn
Copy link
Member

mrkn commented Nov 25, 2019

@rok Thank you for working to implement CSF format.

We should research the existing implementations of CSF format so that our implementation can be used to exchange the data without copying buffers.

Did you find the existing libraries that have CSF format? If you know some such libraries, could you tell me them for the investigation of the data layout?

@rok rok force-pushed the ARROW-4226 branch 4 times, most recently from e33143d to f93be66 Compare November 25, 2019 04:01
@rok
Copy link
Member Author

rok commented Nov 25, 2019

We should research the existing implementations of CSF format so that our implementation can be used to exchange the data without copying buffers.

Agreed! :)

Did you find the existing libraries that have CSF format? If you know some such libraries, could you tell me them for the investigation of the data layout?

To my knowledge CSF is implemented in taco and in pydata/sparse as GXCS, see this PR. I did not research the taco implementation.
As for sparse.GXCS: I'm skeptical of the way it indexes with a global index, see discussion here. My proposal uses offsets to enable per dimension indexing and it should be easily possible to interface it with sparse.GXCS.

My main reference for implementation is Figure 2 in this paper. I'm not sure this implementation is optimal, please give me your thoughts.

@rok
Copy link
Member Author

rok commented Nov 25, 2019

Another point: CSF is a generalization of CSR and CSC. Because of this transformations CSF <-> CSC and CSF <-> CSR don't require changes to the index. I'm proposing a potential COO -> CSF in this PR, so we just need COO <- CSF and we have all sparse index transformations covered.
This would enable us to compare SparseTensors with different indices.

@mrkn
Copy link
Member

mrkn commented Nov 26, 2019

GCXS in pydata/sparse seems an implementation of GCRS and GCCS proposed in this article.

I confirmed that taco supports CSF format.

Now I'm reading this paper and the implementation of taco to study taco's abstraction sparse tensor format.

I guess we need more researches on the real-world implementations of CSF format right now.

@rok
Copy link
Member Author

rok commented Nov 26, 2019

There is also a CSF implementation by @ShadenSmith in SPLATT.
Another nice resource for this discussion is FROSTT, a collection of sparse tensors by the same author.

@ShadenSmith
Copy link

Hello! It's great to see your interest in the CSF format. I'm happy to answer any questions that you have.

@rok
Copy link
Member Author

rok commented Nov 27, 2019

Hi @ShadenSmith. We're currently discussing layout of arrow's CSF implementation. As @mrkn mentioned we want to design it in a way that lets us avoid copying data when interfacing with existing libraries and rather make it possible to just pass pointers.

This PR currently proposes a structure like:

table SparseTensorIndexCSF {
  indptrType: Int;
  indptrBuffer: Buffer;
  indptrOffsets: [int];
  indicesType: Int;
  indicesBuffer: Buffer;
  indicesOffsets: [int];
  axisOrder: [long];
}

Here indices and inptrs are stored in indptrBuffer and indicesBuffer respectively. Offsets are used to split these buffers into per dimension indices and indptrs. See here for comments.

If I understand correctly SPLATT uses a different structure where indices and indptrs are stored in a '2D' data structure where dimension is used to accesses data per dimension, eg: fptr[dimension][fiber] and fids[dimension][fiber].

@ShadenSmith could you say:

  • do you think other CSF implementations variables are laid out like SPLATT is?
  • what number (range) of dimensions would you expect to encounter / need in real world applications?

@ShadenSmith
Copy link

Hi @rok, the other CSF implementations that I have seen either include the code from SPLATT or also just use the 2D layout for the per-mode structures. Here is an example. I'm not sure if taco uses the 1D or 2D implementation and did not find the details from a quick search of the repo.

I think five or six modes (dimensions) is the largest I have encountered in an application, with three being the most common. I have heard of some ongoing research with O(100-1000) mode tensors, but haven't seen results yet and am not sure something so high dimensional will pan out.

@rok
Copy link
Member Author

rok commented Dec 3, 2019

Hi @rok, the other CSF implementations that I have seen either include the code from SPLATT or also just use the 2D layout for the per-mode structures. Here is an example. I'm not sure if taco uses the 1D or 2D implementation and did not find the details from a quick search of the repo.

Got it, thanks a lot! This is good to know. :)
I'll make another proposal with a 2D structure.

I think five or six modes (dimensions) is the largest I have encountered in an application, with three being the most common. I have heard of some ongoing research with O(100-1000) mode tensors, but haven't seen results yet and am not sure something so high dimensional will pan out.

Interesting, do you think higher dimensional tensors will be used if available?

@ShadenSmith
Copy link

Interesting, do you think higher dimensional tensors will be used if available?

Let me first say that I can really only speak for the tensor factorization community, and even there I'm more of a scalability researcher than an application researcher.

Today's software can deal with very high dimensional tensors (SPLATT for example just needs a compile-time flag to increase the max dimensionality). I think the issue is more on the side of the data and quality of a tensor factorization with many dimensions; you may just be learning noise unless the data truly has a nice tensor form with many dimensions. It's quite possible that future application areas do use tensors of that form...I just haven't seen it. Off the top of my head, some of the electronic health record tensors have ~5-10 dimensions, but still not 100-1000 AFAIK.

Also, I would usually not recommend using CSF for tensors with more than 5 or 6 dimensions, unless you have a lot of data that has a non-uniform sparsity pattern. A simple coordinate format would be better unless the non-zeros arrange nicely into fibers. As you increase the number of dimensions in a tensor, the idea of a fiber gets more and more specific, meaning that the savings you get from storing the sparse tensor in CSF disappear because most fibers will only have a few non-zeros in them.

@rok rok force-pushed the ARROW-4226 branch 3 times, most recently from e7c67a0 to aaffe58 Compare December 8, 2019 18:56
@rok
Copy link
Member Author

rok commented Feb 4, 2020

Thanks for the review @pitrou! I've pushed changes you suggested.
It's good to see this is slowly coming together.

@rok
Copy link
Member Author

rok commented Feb 4, 2020

It seems flight tests are failing in appveyor.

Copy link
Member

@pitrou pitrou left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A couple of details yet. We're getting there :-)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You don't seem to mutate those vector inputs, so should make them a const reference really :-)

@rok
Copy link
Member Author

rok commented Feb 5, 2020

Thanks for the review @pitrou. I pushed the suggested changes :).

@pitrou
Copy link
Member

pitrou commented Feb 5, 2020

Thank you @rok :-)

@pitrou pitrou closed this in c02d376 Feb 5, 2020
@rok
Copy link
Member Author

rok commented Feb 6, 2020

Thanks @pitrou @mrkn and @ShadenSmith!

kszucs pushed a commit that referenced this pull request Feb 7, 2020
This is to resolve [ARROW-4226](https://issues.apache.org/jira/browse/ARROW-4226).

Closes #5716 from rok/ARROW-4226 and squashes the following commits:

9ca93ab <Rok> Implementing review feedback.
1b922f6 <Rok> Implementing review feedback.
11b81bb <Rok> Factoring out index incrementing for dense to COO and CSF indices.
6f4f4a8 <Rok> Implementing feedback review.
28d38cb <Rok> Removing backslashes from comments.
3291abc <Rok> Marking indptrBuffers, indicesBuffers and axisOrder required.
d9ff47e <Rok> Further work and implementing review feedback.
24a831f <Rok> Style.
4f2bf00 <Rok> Work on CSF index tests.
6ceb406 <Rok> Implementing review feedback.
bd0d8c2 <Rok> Dense to sparse CSF conversion now in order of dimension size.
eb51947 <Rok> Switching SparseCSFIndex to '2D' data structure.
a322ff5 <Rok> Adding tests for multiple index value types for SparseCSFIndex.
f44d92c <Rok> Adding SparseCSFIndex::Make.
7d17995 <Rok> Adding Tensor to SparseCSFTensor conversion.
05a47a5 <Rok> Using axis_order in CSF.
6b938f7 <Rok> Documentation.
2d10104 <Rok> WIP

Authored-by: Rok <rok@mihevc.org>
Signed-off-by: Antoine Pitrou <antoine@python.org>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants