Skip to content

Conversation

@mrkn
Copy link
Member

@mrkn mrkn commented Sep 11, 2018

I'm interested in making a language-agnostic sparse tensor format. I believe one of the suitable places to do this is Apache Arrow, so let me propose my idea of this here.

First of all, I found that there is no common memory layout of sparse tensor representations in my investigation. It means we need some kinds of conversion to share sparse tensors among different systems even if the data format is logically the same. It is the same situation as dataframe, and this is the reason why I believe Apache Arrow is the suitable place.

There are many formats to represent a sparse tensor. Most of them are specialized for a matrix, which has two dimensions. There are few formats for general sparse tensor with more than two dimensions.

I think the COO format is suitable to start because COO can handle any dimensions, and many systems support the COO format. In my investigation, the systems support COO are SciPy, dask, pydata/sparse, TensorFlow, and PyTorch.

Additionally, CSR format for matrices may also be good to support at the first time. The reason is that CSR format is efficient to extract row slices, that may be important for extracting samples from tidy data, and it is supported by SciPy, MXNet, and R's Matrix library.

I add my prototype definition of SparseTensor format in this pull-request. I designed this prototype format to be extensible so that we can support additional sparse formats. I think we at least need to support additional sparse tensor format for more than two dimensions in addition to COO so we will need this extensibility.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The name indptr comes from scipy's CSR implementation.
I want to replace this with a more suitable name.

@mrkn mrkn force-pushed the sparse_tensor_proposal branch from 1c903a1 to 43c7cd7 Compare September 11, 2018 14:45
@kou kou changed the title Add tentative SparseTensor format ARROW-854: Add tentative SparseTensor format Sep 12, 2018
@kou kou changed the title ARROW-854: Add tentative SparseTensor format ARROW-854: [Format] Add tentative SparseTensor format Sep 12, 2018
@mrkn
Copy link
Member Author

mrkn commented Sep 12, 2018

All the jobs in Travis CI were passed.
Would you please give me some comments?

@wesm
Copy link
Member

wesm commented Sep 12, 2018

I will review. This is a difficult week because of the Strata conference in New York City so I won't have too much time to review until later in the day on Thursday my time

Copy link
Member

@wesm wesm left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for working on this. I agree that it would be valuable to be able to continue to have standardized metadata and an IPC protocol for data structures outside of Arrow columnar structures, like tensors and sparse tensors.

We should probably solicit the feedback of folks with experience with scipy.sparse. I can help with this if you like

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What do you think about including this definition in Tensor.fbs also?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK. I'll move the definition of SparseTensor in Tensor.fbs.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you clarify what data you expect to be in indicesBuffer? I am not that familiar with the COO representation and this comment is not sufficient to understand what I would find here. Maybe a simple example would help?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wrote an example of the COO format in 5094c98.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It seems like this could be large. Should this be a Buffer also? We don't want the metadata to be too big

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree with you. I should make it a Buffer.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done in 7416960.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe a small example would help here too

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've described an example of CSR format in 152cb10.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In this patch maybe we want to write down how the Buffers will be arranged in the IPC body.

I'm thinking

<metadata> <padding> <indicesBuffer> <data>

and possibly in the case of sparse matrix CSR format:

<metadata> <padding> <indptr> <indicesBuffer> <data>

Let me know what you think

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Both layouts you wrote are the same as my thought.
I'll write them in the comment.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Where is the suitable place to describe the message layout of SparseTensor?
I think adding to IPC.md can be acceptable, but I want to know the policy if exists.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

IPC.md sounds good to me

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wrote the description of sparse tensor format in IPC.md.

I have one question. @wes, I found that the format of Tensor is the following order in IPC.md:

<padding> <metadata length> <metadata> <tensor body>

I followed this order in the sparse tensor format in 29ed742, but in the example format you mention above, the padding and the metadata were swapped.

Would you please tell me do you think which is better.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I noticed that the additional padding between the sparse index and the sparse tensor body is necessary for aligning sparse tensor body to be multiple of 64 bytes.

So I added that in f0ec57d.

@dhirschfeld
Copy link

You may already be aware that the Python sparse ndarray implementation is being developed over at https://github.com/pydata/sparse.

It was a project started by Matthew Rocklin and now continued by Hameer Abbasi so they may have some valuable insights to add here.

@mrkn
Copy link
Member Author

mrkn commented Sep 17, 2018

We should probably solicit the feedback of folks with experience with scipy.sparse. I can help with this if you like

@wesm @dhirschfeld Yes, I'm considering so, too.

I referred scipy.sparse and pydata/sparse when I wrote this pull request, especially the implementation of scipy.sparse and these two issues: scipy/scipy#8162 and pydata/sparse#125.

I need to cooperate with those people in scipy.sparse and pydata/sparse to make Arrow's SparseTensor interoperable widely. So I want to have the feedback from them.

@wesm Could you please connect me to the folks in scipy.sparse? I don't know who is best person to mention here.

And, I'll consult with Hameer Abbasi about cooperate with pydata/sparse.

@wesm
Copy link
Member

wesm commented Sep 17, 2018

@rgommers we are looking at adding sparse matrix and sparse ndarray metadata descriptors to allow for zero-copy reads of such objects from shared memory using the Arrow library ecosystem, and eventually develop interop with scipy.sparse and other frameworks. Would you or someone from the scipy.sparse community be able to review and give some feedback on the proposed metadata?

@rgommers
Copy link

nice:)

Would you or someone from the scipy.sparse community be able to review and give some feedback on the proposed metadata?

I'm pretty much out of bandwidth till the end of October, and there are people with more expertise - let me ping a few: @hameerabbasi @pv @perimosocordiae

Copy link

@hameerabbasi hameerabbasi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This, in general, looks okay to me. Perhaps in the future, if zero-copy and future-proof-ness is really what we want, we might want to add the CSF (compressed sparse fiber) format, a generalisation of CSR/CSC. I'm currently working on adding it to PyData/Sparse, and I plan to make it the preferred format (COO will still be around though).

CSF involves two-dimensional jagged arrays, I'm not entirely sure how that is handled in this grammar, otherwise I'd send a PR myself.

@pv
Copy link

pv commented Sep 17, 2018

COO and CSR are probably sufficient for many use cases with matrixes. If CSR is included, I wonder if it would be also useful to also include CSC, as it iirc was the default io format for used in several sparse matrix libraries (umfpack, superlu).

@mrkn
Copy link
Member Author

mrkn commented Sep 24, 2018

@wesm @rgommers Thank you very much for arranging.

@hameerabbasi @pv Thank you for giving your comments.

@hameerabbasi I was hoping to consult with you about introducing CSF in Arrow, so it was great that you gave me a comment to recommend CSF.

@pv OK. CSC is just the transpose of CSR, so we can introduce CSC very easily.

@codecov-io
Copy link

codecov-io commented Sep 24, 2018

Codecov Report

Merging #2546 into master will increase coverage by 1%.
The diff coverage is n/a.

Impacted file tree graph

@@            Coverage Diff            @@
##           master    #2546     +/-   ##
=========================================
+ Coverage   87.55%   88.56%     +1%     
=========================================
  Files         403      342     -61     
  Lines       62113    58322   -3791     
=========================================
- Hits        54386    51651   -2735     
+ Misses       7657     6671    -986     
+ Partials       70        0     -70
Impacted Files Coverage Δ
rust/src/record_batch.rs
go/arrow/datatype_nested.go
rust/src/util/bit_util.rs
go/arrow/math/uint64_amd64.go
go/arrow/internal/testing/tools/bool.go
go/arrow/internal/bitutil/bitutil.go
go/arrow/memory/memory_avx2_amd64.go
go/arrow/array/null.go
rust/src/lib.rs
rust/src/array.rs
... and 52 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 0689a58...74c6c7b. Read the comment docs.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If you'd like zero-copy writes to scipy.sparse.coo_matrix and sparse then I'd recommend that this array be transposed, or stored as F-contiguous.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for your valuable advice.
I made COO with a column-major coordinates tensor.

@mrkn mrkn force-pushed the sparse_tensor_proposal branch from cd79dda to 74c6c7b Compare October 16, 2018 02:16
pitrou pushed a commit that referenced this pull request Oct 25, 2018
This commit defines the new NumericTensor<T> class as a subclass
of Tensor class. NumericTensor<T> extends Tensor class by adding
a member function to access element values in a tensor.

I want to use this new feature for writing tests of SparseTensor in #2546.

Author: Kenta Murata <mrkn@mrkn.jp>

Closes #2759 from mrkn/tensor_element_access and squashes the following commits:

37f0bb4 <Kenta Murata> Add tests for column-major strides
14fa527 <Kenta Murata> Remove needless cases
1646461 <Kenta Murata> Introduce NumericTensor class
@mrkn mrkn force-pushed the sparse_tensor_proposal branch from 74c6c7b to 9b7468b Compare October 25, 2018 07:46
@wesm
Copy link
Member

wesm commented Nov 10, 2018

What are next steps here? I am not enough of an expert to give my seal of approval, if someone more experienced in sparse matrices could weigh in it would be helpful

@mrkn
Copy link
Member Author

mrkn commented Nov 10, 2018

@wes I'm working on implementing sparse index of COO and CSR formats.
I think I can finish them until the end of this month. Please wait a moment.

@mrkn mrkn force-pushed the sparse_tensor_proposal branch 5 times, most recently from c61bed1 to cb0efef Compare November 14, 2018 09:03
@mrkn mrkn force-pushed the sparse_tensor_proposal branch 3 times, most recently from 76e205b to 61aa7f5 Compare November 20, 2018 15:35
@mrkn mrkn force-pushed the sparse_tensor_proposal branch from 00a7e3c to 148bff8 Compare January 9, 2019 08:53
@mrkn
Copy link
Member Author

mrkn commented Jan 9, 2019

@wesm Thank you for giving your comments.
I've finished fixing things you've pointed out, and rebase the commits.
There is one failure on travis-ci, but it doesn't seem my fault.

@kou
Copy link
Member

kou commented Jan 9, 2019

I've restarted the failed job and CI is green now.

@wesm
Copy link
Member

wesm commented Jan 9, 2019

Merging. Thank you again for your patience and care on this

@wesm wesm closed this in b8aeb79 Jan 9, 2019
@mrkn mrkn deleted the sparse_tensor_proposal branch January 10, 2019 04:45
data: Buffer;
}

root_type SparseTensor;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@wesm @mrkn - (I'm not a flat buffer expert by any means), but can you have 2 root_types in a single .fbs file?

I was updating the C# flat buffer generated code, and I noticed that the Tensor generated struct no longer has the methods:

  public static void FinishTensorBuffer(FlatBufferBuilder builder, Offset<Tensor> offset) { builder.Finish(offset.Value); }
  public static void FinishSizePrefixedTensorBuffer(FlatBufferBuilder builder, Offset<Tensor> offset) { builder.FinishSizePrefixed(offset.Value); }

I traced this to the code generator here:

https://github.com/google/flatbuffers/blob/7d3930a2fd71774fdec063160ebd168bbff6db8b/src/idl_gen_general.cpp#L1425-L1442

Which makes it look like the code generator is only expecting a single root_type for the parser.

Looking at the doc, it makes it sound like there should only be one:

The last part of the schema is the root_type. The root type declares what will be the root table for the serialized data. In our case, the root type is our Monster table.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes it looks like we should remove the other root_type. Can you open a JIRA about fixing this?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@wesm Shall I make a new pull-request to fix it?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm working on it.

kszucs pushed a commit that referenced this pull request Feb 18, 2019
- Update FlatBuffers code to latest version.
- Mark all FlatBuffer types as internal.

Note: I didn't use the latest `.fbs` file version because it included the SparseTensor support. Using the latest `.fbs` change caused problems with the `Tensor` generation. I have a [question if the latest `.fbs` file is correct](#2546 (comment)).

/cc @chutchinson @stephentoub - FYI

Author: Eric Erhardt <eric.erhardt@microsoft.com>

Closes #3637 from eerhardt/UpdateFlatBufferCode and squashes the following commits:

15e0831 <Eric Erhardt> Adding how to update FlatBuffers code to the README
043ce7e <Eric Erhardt> ARROW-4543:  Update Flat Buffers code to latest version
QuietCraftsmanship pushed a commit to QuietCraftsmanship/arrow that referenced this pull request Jul 7, 2025
- Update FlatBuffers code to latest version.
- Mark all FlatBuffer types as internal.

Note: I didn't use the latest `.fbs` file version because it included the SparseTensor support. Using the latest `.fbs` change caused problems with the `Tensor` generation. I have a [question if the latest `.fbs` file is correct](apache/arrow#2546 (comment)).

/cc @chutchinson @stephentoub - FYI

Author: Eric Erhardt <eric.erhardt@microsoft.com>

Closes #3637 from eerhardt/UpdateFlatBufferCode and squashes the following commits:

15e0831c <Eric Erhardt> Adding how to update FlatBuffers code to the README
043ce7e2 <Eric Erhardt> ARROW-4543:  Update Flat Buffers code to latest version
kou pushed a commit to apache/arrow-dotnet that referenced this pull request Jul 30, 2025
- Update FlatBuffers code to latest version.
- Mark all FlatBuffer types as internal.

Note: I didn't use the latest `.fbs` file version because it included the SparseTensor support. Using the latest `.fbs` change caused problems with the `Tensor` generation. I have a [question if the latest `.fbs` file is correct](apache/arrow#2546 (comment)).

/cc @chutchinson @stephentoub - FYI

Author: Eric Erhardt <eric.erhardt@microsoft.com>

Closes #3637 from eerhardt/UpdateFlatBufferCode and squashes the following commits:

15e0831c <Eric Erhardt> Adding how to update FlatBuffers code to the README
043ce7e2 <Eric Erhardt> ARROW-4543:  Update Flat Buffers code to latest version
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

10 participants