ARROW-854: [Format] Add tentative SparseTensor format #2546

mrkn · 2018-09-11T14:41:46Z

I'm interested in making a language-agnostic sparse tensor format. I believe one of the suitable places to do this is Apache Arrow, so let me propose my idea of this here.

First of all, I found that there is no common memory layout of sparse tensor representations in my investigation. It means we need some kinds of conversion to share sparse tensors among different systems even if the data format is logically the same. It is the same situation as dataframe, and this is the reason why I believe Apache Arrow is the suitable place.

There are many formats to represent a sparse tensor. Most of them are specialized for a matrix, which has two dimensions. There are few formats for general sparse tensor with more than two dimensions.

I think the COO format is suitable to start because COO can handle any dimensions, and many systems support the COO format. In my investigation, the systems support COO are SciPy, dask, pydata/sparse, TensorFlow, and PyTorch.

Additionally, CSR format for matrices may also be good to support at the first time. The reason is that CSR format is efficient to extract row slices, that may be important for extracting samples from tidy data, and it is supported by SciPy, MXNet, and R's Matrix library.

I add my prototype definition of SparseTensor format in this pull-request. I designed this prototype format to be extensible so that we can support additional sparse formats. I think we at least need to support additional sparse tensor format for more than two dimensions in addition to COO so we will need this extensibility.

mrkn · 2018-09-11T14:42:51Z

format/SparseTensor.fbs

The name indptr comes from scipy's CSR implementation.
I want to replace this with a more suitable name.

mrkn · 2018-09-12T02:10:06Z

All the jobs in Travis CI were passed.
Would you please give me some comments?

wesm · 2018-09-12T14:05:27Z

I will review. This is a difficult week because of the Strata conference in New York City so I won't have too much time to review until later in the day on Thursday my time

wesm

Thank you for working on this. I agree that it would be valuable to be able to continue to have standardized metadata and an IPC protocol for data structures outside of Arrow columnar structures, like tensors and sparse tensors.

We should probably solicit the feedback of folks with experience with scipy.sparse. I can help with this if you like

wesm · 2018-09-13T23:22:37Z

cpp/src/arrow/ipc/CMakeLists.txt

What do you think about including this definition in Tensor.fbs also?

OK. I'll move the definition of SparseTensor in Tensor.fbs.

wesm · 2018-09-13T23:24:32Z

format/SparseTensor.fbs

Could you clarify what data you expect to be in indicesBuffer? I am not that familiar with the COO representation and this comment is not sufficient to understand what I would find here. Maybe a simple example would help?

I wrote an example of the COO format in 5094c98.

wesm · 2018-09-13T23:25:44Z

format/SparseTensor.fbs

It seems like this could be large. Should this be a Buffer also? We don't want the metadata to be too big

I agree with you. I should make it a Buffer.

Done in 7416960.

wesm · 2018-09-13T23:26:02Z

format/SparseTensor.fbs

Maybe a small example would help here too

I've described an example of CSR format in 152cb10.

wesm · 2018-09-13T23:27:47Z

format/SparseTensor.fbs

In this patch maybe we want to write down how the Buffers will be arranged in the IPC body.

I'm thinking

<metadata> <padding> <indicesBuffer> <data>

and possibly in the case of sparse matrix CSR format:

<metadata> <padding> <indptr> <indicesBuffer> <data>

Let me know what you think

Both layouts you wrote are the same as my thought.
I'll write them in the comment.

Where is the suitable place to describe the message layout of SparseTensor?
I think adding to IPC.md can be acceptable, but I want to know the policy if exists.

IPC.md sounds good to me

I wrote the description of sparse tensor format in IPC.md.

I have one question. @wes, I found that the format of Tensor is the following order in IPC.md:

<padding> <metadata length> <metadata> <tensor body>

I followed this order in the sparse tensor format in 29ed742, but in the example format you mention above, the padding and the metadata were swapped.

Would you please tell me do you think which is better.

I noticed that the additional padding between the sparse index and the sparse tensor body is necessary for aligning sparse tensor body to be multiple of 64 bytes.

So I added that in f0ec57d.

dhirschfeld · 2018-09-17T04:17:12Z

You may already be aware that the Python sparse ndarray implementation is being developed over at https://github.com/pydata/sparse.

It was a project started by Matthew Rocklin and now continued by Hameer Abbasi so they may have some valuable insights to add here.

mrkn · 2018-09-17T05:27:03Z

We should probably solicit the feedback of folks with experience with scipy.sparse. I can help with this if you like

@wesm @dhirschfeld Yes, I'm considering so, too.

I referred scipy.sparse and pydata/sparse when I wrote this pull request, especially the implementation of scipy.sparse and these two issues: scipy/scipy#8162 and pydata/sparse#125.

I need to cooperate with those people in scipy.sparse and pydata/sparse to make Arrow's SparseTensor interoperable widely. So I want to have the feedback from them.

@wesm Could you please connect me to the folks in scipy.sparse? I don't know who is best person to mention here.

And, I'll consult with Hameer Abbasi about cooperate with pydata/sparse.

wesm · 2018-09-17T17:49:32Z

@rgommers we are looking at adding sparse matrix and sparse ndarray metadata descriptors to allow for zero-copy reads of such objects from shared memory using the Arrow library ecosystem, and eventually develop interop with scipy.sparse and other frameworks. Would you or someone from the scipy.sparse community be able to review and give some feedback on the proposed metadata?

rgommers · 2018-09-17T18:28:35Z

nice:)

Would you or someone from the scipy.sparse community be able to review and give some feedback on the proposed metadata?

I'm pretty much out of bandwidth till the end of October, and there are people with more expertise - let me ping a few: @hameerabbasi @pv @perimosocordiae

hameerabbasi

This, in general, looks okay to me. Perhaps in the future, if zero-copy and future-proof-ness is really what we want, we might want to add the CSF (compressed sparse fiber) format, a generalisation of CSR/CSC. I'm currently working on adding it to PyData/Sparse, and I plan to make it the preferred format (COO will still be around though).

CSF involves two-dimensional jagged arrays, I'm not entirely sure how that is handled in this grammar, otherwise I'd send a PR myself.

pv · 2018-09-17T19:13:31Z

COO and CSR are probably sufficient for many use cases with matrixes. If CSR is included, I wonder if it would be also useful to also include CSC, as it iirc was the default io format for used in several sparse matrix libraries (umfpack, superlu).

mrkn · 2018-09-24T05:51:48Z

@wesm @rgommers Thank you very much for arranging.

@hameerabbasi @pv Thank you for giving your comments.

@hameerabbasi I was hoping to consult with you about introducing CSF in Arrow, so it was great that you gave me a comment to recommend CSF.

@pv OK. CSC is just the transpose of CSR, so we can introduce CSC very easily.

codecov-io · 2018-09-24T06:03:02Z

Codecov Report

Merging #2546 into master will increase coverage by 1%.
The diff coverage is n/a.

@@            Coverage Diff            @@
##           master    #2546     +/-   ##
=========================================
+ Coverage   87.55%   88.56%     +1%     
=========================================
  Files         403      342     -61     
  Lines       62113    58322   -3791     
=========================================
- Hits        54386    51651   -2735     
+ Misses       7657     6671    -986     
+ Partials       70        0     -70

Impacted Files	Coverage Δ
rust/src/record_batch.rs
go/arrow/datatype_nested.go
rust/src/util/bit_util.rs
go/arrow/math/uint64_amd64.go
go/arrow/internal/testing/tools/bool.go
go/arrow/internal/bitutil/bitutil.go
go/arrow/memory/memory_avx2_amd64.go
go/arrow/array/null.go
rust/src/lib.rs
rust/src/array.rs
... and 52 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 0689a58...74c6c7b. Read the comment docs.

hameerabbasi · 2018-10-01T13:30:10Z

format/SparseTensor.fbs

If you'd like zero-copy writes to scipy.sparse.coo_matrix and sparse then I'd recommend that this array be transposed, or stored as F-contiguous.

Thank you for your valuable advice.
I made COO with a column-major coordinates tensor.

This commit defines the new NumericTensor<T> class as a subclass of Tensor class. NumericTensor<T> extends Tensor class by adding a member function to access element values in a tensor. I want to use this new feature for writing tests of SparseTensor in #2546. Author: Kenta Murata <mrkn@mrkn.jp> Closes #2759 from mrkn/tensor_element_access and squashes the following commits: 37f0bb4 <Kenta Murata> Add tests for column-major strides 14fa527 <Kenta Murata> Remove needless cases 1646461 <Kenta Murata> Introduce NumericTensor class

wesm · 2018-11-10T04:17:48Z

What are next steps here? I am not enough of an expert to give my seal of approval, if someone more experienced in sparse matrices could weigh in it would be helpful

mrkn · 2018-11-10T04:32:12Z

@wes I'm working on implementing sparse index of COO and CSR formats.
I think I can finish them until the end of this month. Please wait a moment.

- SparseTensorBase to SparseTensor - SparseTensor<...> to SparseTensorImpl<...>

sparse_tensor_format_id -> format_id

mrkn · 2019-01-09T09:38:31Z

@wesm Thank you for giving your comments.
I've finished fixing things you've pointed out, and rebase the commits.
There is one failure on travis-ci, but it doesn't seem my fault.

kou · 2019-01-09T20:05:09Z

I've restarted the failed job and CI is green now.

wesm · 2019-01-09T22:18:08Z

Merging. Thank you again for your patience and care on this

eerhardt · 2019-02-13T19:20:37Z

format/Tensor.fbs

+  data: Buffer;
+}
+
+root_type SparseTensor;


@wesm @mrkn - (I'm not a flat buffer expert by any means), but can you have 2 root_types in a single .fbs file?

I was updating the C# flat buffer generated code, and I noticed that the Tensor generated struct no longer has the methods:

public static void FinishTensorBuffer(FlatBufferBuilder builder, Offset<Tensor> offset) { builder.Finish(offset.Value); } public static void FinishSizePrefixedTensorBuffer(FlatBufferBuilder builder, Offset<Tensor> offset) { builder.FinishSizePrefixed(offset.Value); }

I traced this to the code generator here:

https://github.com/google/flatbuffers/blob/7d3930a2fd71774fdec063160ebd168bbff6db8b/src/idl_gen_general.cpp#L1425-L1442

Which makes it look like the code generator is only expecting a single root_type for the parser.

Looking at the doc, it makes it sound like there should only be one:

The last part of the schema is the root_type. The root type declares what will be the root table for the serialized data. In our case, the root type is our Monster table.

Yes it looks like we should remove the other root_type. Can you open a JIRA about fixing this?

@wesm Shall I make a new pull-request to fix it?

I've opened https://issues.apache.org/jira/browse/ARROW-4571 for this issue.

I'm working on it.

@chutchinson

- Update FlatBuffers code to latest version. - Mark all FlatBuffer types as internal. Note: I didn't use the latest `.fbs` file version because it included the SparseTensor support. Using the latest `.fbs` change caused problems with the `Tensor` generation. I have a [question if the latest `.fbs` file is correct](#2546 (comment)). /cc @chutchinson @stephentoub - FYI Author: Eric Erhardt <eric.erhardt@microsoft.com> Closes #3637 from eerhardt/UpdateFlatBufferCode and squashes the following commits: 15e0831 <Eric Erhardt> Adding how to update FlatBuffers code to the README 043ce7e <Eric Erhardt> ARROW-4543: Update Flat Buffers code to latest version

@chutchinson

- Update FlatBuffers code to latest version. - Mark all FlatBuffer types as internal. Note: I didn't use the latest `.fbs` file version because it included the SparseTensor support. Using the latest `.fbs` change caused problems with the `Tensor` generation. I have a [question if the latest `.fbs` file is correct](apache/arrow#2546 (comment)). /cc @chutchinson @stephentoub - FYI Author: Eric Erhardt <eric.erhardt@microsoft.com> Closes #3637 from eerhardt/UpdateFlatBufferCode and squashes the following commits: 15e0831c <Eric Erhardt> Adding how to update FlatBuffers code to the README 043ce7e2 <Eric Erhardt> ARROW-4543: Update Flat Buffers code to latest version

@chutchinson

- Update FlatBuffers code to latest version. - Mark all FlatBuffer types as internal. Note: I didn't use the latest `.fbs` file version because it included the SparseTensor support. Using the latest `.fbs` change caused problems with the `Tensor` generation. I have a [question if the latest `.fbs` file is correct](apache/arrow#2546 (comment)). /cc @chutchinson @stephentoub - FYI Author: Eric Erhardt <eric.erhardt@microsoft.com> Closes #3637 from eerhardt/UpdateFlatBufferCode and squashes the following commits: 15e0831c <Eric Erhardt> Adding how to update FlatBuffers code to the README 043ce7e2 <Eric Erhardt> ARROW-4543: Update Flat Buffers code to latest version

mrkn commented Sep 11, 2018

View reviewed changes

mrkn force-pushed the sparse_tensor_proposal branch from 1c903a1 to 43c7cd7 Compare September 11, 2018 14:45

kou changed the title ~~Add tentative SparseTensor format~~ ARROW-854: Add tentative SparseTensor format Sep 12, 2018

kou changed the title ~~ARROW-854: Add tentative SparseTensor format~~ ARROW-854: [Format] Add tentative SparseTensor format Sep 12, 2018

wesm reviewed Sep 13, 2018

View reviewed changes

hameerabbasi reviewed Sep 17, 2018

View reviewed changes

hameerabbasi reviewed Oct 1, 2018

View reviewed changes

mrkn mentioned this pull request Oct 15, 2018

ARROW-3515: [C++] Introduce NumericTensor class #2759

Closed

mrkn force-pushed the sparse_tensor_proposal branch from cd79dda to 74c6c7b Compare October 16, 2018 02:16

mrkn force-pushed the sparse_tensor_proposal branch from 74c6c7b to 9b7468b Compare October 25, 2018 07:46

mrkn force-pushed the sparse_tensor_proposal branch 5 times, most recently from c61bed1 to cb0efef Compare November 14, 2018 09:03

mrkn force-pushed the sparse_tensor_proposal branch 3 times, most recently from 76e205b to 61aa7f5 Compare November 20, 2018 15:35

mrkn added 8 commits January 9, 2019 17:53

Use std::make_shared

97e85bd

Remove needless function declaration

37a0a14

Use substitution instead of constructor call

07a6518

Rename sparse tensor classes

90e8b31

- SparseTensorBase to SparseTensor - SparseTensor<...> to SparseTensorImpl<...>

Add type aliases of sparse tensor types

c83ea6a

Rename too-verbose function name

880bbc4

sparse_tensor_format_id -> format_id

Merge sparse_tensor_format.h into sparse_tensor.h

d57e56f

make format

148bff8

mrkn force-pushed the sparse_tensor_proposal branch from 00a7e3c to 148bff8 Compare January 9, 2019 08:53

wesm closed this in b8aeb79 Jan 9, 2019

mrkn deleted the sparse_tensor_proposal branch January 10, 2019 04:45

mrkn mentioned this pull request Jan 27, 2019

ARROW-4320: [C++] Add tests for non-contiguous tensors #3453

Closed

eerhardt reviewed Feb 13, 2019

View reviewed changes

eerhardt mentioned this pull request Feb 13, 2019

ARROW-4543: [C#] Update Flat Buffers code to latest version #3637

Closed

mrkn mentioned this pull request Jul 2, 2019

ARROW-4453: [Python] Cython wrappers for SparseTensor #4446

Closed

ARROW-854: [Format] Add tentative SparseTensor format #2546

ARROW-854: [Format] Add tentative SparseTensor format #2546

Uh oh!

Conversation

mrkn commented Sep 11, 2018

Uh oh!

Choose a reason for hiding this comment

Uh oh!

mrkn commented Sep 12, 2018

Uh oh!

wesm commented Sep 12, 2018

Uh oh!

wesm left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

dhirschfeld commented Sep 17, 2018

Uh oh!

mrkn commented Sep 17, 2018

Uh oh!

wesm commented Sep 17, 2018

Uh oh!

rgommers commented Sep 17, 2018

Uh oh!

hameerabbasi left a comment

Choose a reason for hiding this comment

Uh oh!

pv commented Sep 17, 2018 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

mrkn commented Sep 24, 2018

Uh oh!

codecov-io commented Sep 24, 2018 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

wesm commented Nov 10, 2018

Uh oh!

mrkn commented Nov 10, 2018

Uh oh!

mrkn commented Jan 9, 2019

Uh oh!

kou commented Jan 9, 2019

Uh oh!

wesm commented Jan 9, 2019

Uh oh!

Choose a reason for hiding this comment

pv commented Sep 17, 2018 •

edited

Loading

codecov-io commented Sep 24, 2018 •

edited

Loading