GH-44246: [JS] Add a lightweight array view to Table #44247

swallez · 2024-09-27T19:00:25Z

Rationale for this change

Table.toArray() provides the convenience of using Arrow tables as common object arrays. However it does allocate an array of the size of the table that hold proxies to the table rows. This array can consume a significant amount of memory with large tables.

Providing an array proxy that wraps the table and creates struct row proxies on demand would avoid this and provide a object array view on top of a table at almost zero memory cost.

What changes are included in this PR?

This PR adds a new Table.toArrayView() method that returns an array proxy to the table rows.

It complements Table.toArray() and doesn't replace it for the following reasons:

this is a read-only view, while toArray() returns a plain mutable array. Replacing it would be a breaking change.
the proxy adds some overhead to direct array access. Tests on iterator loops show that access via the array proxy takes 5 more time than direct array access. This can be a concern for some applications, even if it should generally be negligible, and if applications seeking maximum performance should avoid using these convenience wrappers.

Also fixes #30863 by using a singleton proxy handler for StructRowProxyHandler, which further reduces memory usage.

Are these changes tested?

Yes. New tests are added to js/test/unit/table/table-test.ts for both toArrayView() and toArray() that was not tested.

Are there any user-facing changes?

This adds a new Table.toArrayView() method.

GitHub Issue: [JS] Table.toArray() allocates a large array. #44246

github-actions · 2024-09-27T19:00:51Z

⚠️ GitHub issue #44246 has been automatically assigned in GitHub to PR creator.

domoritz

This is great. Would you also add this to vectors or if not, why only tables?

domoritz · 2024-09-28T15:24:05Z

js/src/table.ts

    /**
     * Returns a string representation of the Table rows.
     *
     * @returns A string representation of the Table rows.
     */
    public toString() {
-        return `[\n  ${this.toArray().join(',\n  ')}\n]`;
+        return `[\n  ${this.toArrayView().join(',\n  ')}\n]`;


Since we are stringifying the whole table, is it worth using an array view here?

We need an array (or array-like) to use join(). This was creating an array, and with toArrayView() this now just creates a proxy. Or did I misunderstand your comment?

I understand the difference but I wonder whether this change introduces a performance regression since "the proxy adds some overhead to direct array access."

I'm not against this change but want to make sure we have thought through the implications.

Got it. I don't think this is an issue here, as toString() is generally used for debugging purposes. Actually I'm wondering if it's practically usable for Table beyond tests as it can create huge strings.

After some benchmarking, toString is a perfect candidate for using toArrayView(): the access overhead is actually the same as with toArray(). It is just spread while iterating while with toArrayView() while it's paid once upfront with toString. Since we're not reusing the array, after join, a view that doesn't allocate memory is a better choice.

swallez · 2024-09-28T15:44:53Z

Thank @domoritz!

Would you also add this to vectors or if not, why only tables?

That absolutely makes sense. My immediate need was for tables for an application that uses plain objects, and I will add it for vectors.

Also, do you agree that toArrayView() should not replace toArray()? Or would the slight behavior change be acceptable if we replace it?

domoritz · 2024-09-28T15:47:08Z

Also, do you agree that toArrayView() should not replace toArray()? Or would the slight behavior change be acceptable if we replace it?

Good question. I wonder actually whether we need toArrayView at all. Why not use make the vector/table behave like an array view?

domoritz · 2024-09-28T15:48:46Z

js/test/unit/table/table-test.ts

+        checkArray(table.toArray());
+    });
+    describe('table.toArrayView()', () => {
+        checkArray(table.toArrayView());


Why can't we make this work?

Suggested change

checkArray(table.toArrayView());

checkArray(table);

See the discussion here. We cannot make Table behave like an array.

domoritz · 2024-09-28T15:49:14Z

js/src/table.ts

@@ -238,17 +238,30 @@ export class Table<T extends TypeMap = any> {
     *
     * @returns An Array of Table rows.
     */
-    public toArray() {
+    public toArray(): Array<Struct<T>['TValue']> {


Does this improve the type returned by this method or just make the type explicit?

It just makes it explicit. I added it to know precisely what toArrayView() was supposed to return. And also because I like explicit types 😉

Unlike in other languages, explicitly annotating return types is somewhat of an anti-pattern in TypeScript. Unless there's a special case where we need to inform the compiler what the real return type is, I'm with @domoritz that we should default to letting the compiler infer the type.

Ok, no problem, I removed it in 4b706a9.

swallez · 2024-09-28T16:11:45Z

Why not use make the vector/table behave like an array view?

Mimicking an array is a bit tricky: it has a lot of methods that are implemented by iterating it (like join, map, find, etc) and we would have to reimplement all of them. Also, Array.isArray() would return false even if we reimplemented all array methods.

This is why toArrayView() creates a proxy to an empty array and passes the table to the proxy handler:

return new Proxy([] as Array<Struct<T>['TValue']>, new TableArrayProxyHandler(this));

That way the proxy handler only has to implement iteration and direct access, and benefits from all other array methods by delegating to the wrapped array.

domoritz · 2024-09-28T17:29:57Z

I see. I'll page @trxcllnt to weigh in on this one.

swallez · 2024-09-30T17:37:25Z

I did some extensive benchmarks to understand memory usage and performance overhead of toArray() vs toArrayView(). Discussing this with some colleagues, it turns out that there are a number of options when exposing a Table as an array what we may want to choose from depending on the usage context.

There are options for both the array and its elements:

array: early materialization (like in toArray()), lazy caching (creating and retaining row objects as we iterate), no caching (as in toArrayView())
rows/array items: materialization (creating a plain old JS object by copying the row data), lazy caching of properties (what StructRowProxyHandler does), no caching.

All combinations can be valid depending on how the application uses the resulting array-like object.

To account for this, I'd like to add an optional options parameter to toArray() that will allow configuring how we want the collection and rows be handled. Something like:

enum ToArrayBehavior (eager, cache, noCache);

class ToArrayOptions {
  array: ToArrayBehavior,
  rows: ToArrayBehavior
}

function toArray(options? ToArrayOptions) ...

with a default of { array: eager, rows: cache } which is what toArray() does today.

WDYT?

trxcllnt · 2024-09-30T18:36:53Z

Table itself is already a view... why not just use table.get()?

trxcllnt · 2024-09-30T20:02:29Z

I'm not sure how I feel about this PR yet, and have a few questions. I'm not saying I disagree with it, just trying to provide some context and discuss general strategies.

Table.toArray() provides the convenience of using Arrow tables as common object arrays. However it does allocate an array of the size of the table that hold proxies to the table rows. This array can consume a significant amount of memory with large tables.

Yeah, this is something we've struggled with since the project's inception. Every "convenience" API we've added is almost always also the least-performant way to interact with the data.

While ergonomic for vanilla JS users, these APIs encourage abuse and lead to performance pitfalls in everyday use. Anybody profiling their code blames Arrow for "being slow," when in reality all we did was make it easy for users to do slow things.

I personally would rather focus efforts on making it easy for users to do the fastest thing (so-called helping users "fall into the pit of success").

To date, this has manifested in three ways. First, we ensure all data movement (e.g. IPC) is zero copy and provide/encourage users to use zero-copy views of the data at rest. Second, we provide O(1) sequential access and O(log(m)) random access (where m is the number of RecordBatches in a Table). And third, we try to provide utilities for caching or deferring computation (MemoizedVector, MapRow/StructRow proxies).

We have Table.toArray() because it symmetrical to the (zero-copy where available) Vector.toArray(), while using the O(n) sequential iterator, without adding much to the API surface area or bundle size.

the proxy adds some overhead to direct array access. Tests on iterator loops show that access via the array proxy takes 5 more time than direct array access.

Part of this performance deficit is definitely due to the O(log(m)) random accesses for rows in the chunked Table, compared to the O(1) access for the Array.

A while back I had a branch that attempted this approach for Table/Vector. The idea was to use Object.setPrototypeOf() to give the Table/Vector all the Array methods (as well as pass the Array.isArray() check), and then use a Proxy to implement subscript-style accesses. I abandoned the branch after the performance tests showed similar results as you've reported here.

In reference to to my point above, even though our implementation could be the fastest way to implement a JS-Array-style API via Proxies, we'll still get blamed for encouraging users to use it.

With that context out of the way, I have some questions.

First, why/how are users (or you) using table.toArray() vs. using table.get(), table[Symbol.iterator](), or operating on the underlying TypedArrays directly?

Is it to get the table into a form where it can be used with the list alebra (map/filter/reduce/etc.)?

If so, have you considered a library like IxJS, or the stage 3 TC39 iterator-helpers proposal? One of the main reasons the Arrow types implement the [Async]Iterator spec is to make it easier for users to interoperate with libraries like these, while avoiding the O(n * log(m)) random access pitfalls, and also avoid re-implementing these libraries in Arrow itself.

Second, aside from the Array algebra, what benefit does an Array-like API bring over the existing table.get() method? For example, are there existing 3rd-party APIs that accept Arrays (and do subscript element-accesses) that you'd like to pass a Table?

Or is there a reason that operating directly on the binary data isn't easy/sufficient? For example, is the following insufficient/too inconvenient?

const times2 = table.data.reduce((results, batch) => {
  const [data] = batch.getChild("floats").data;
  return times2.concat(data.values.map((f, i) => data.isNull(i) ? f * 2 : 0));
}, new Float32Array());

Finally, is there any reason this can't live in application code instead (e.g. you create a Proxy from a table that implements the Array-subscript syntax)? I have done this a number of times until I had time to refactor routines handle tables/vectors more efficiently.

I am not intending to come off as obtuse, and I hope my questions make sense in context. I just worry about adding APIs that encourage users to do something convenient at the expense of performance.

swallez · 2024-10-01T15:14:00Z

Thanks for the extensive explanation @trxcllnt. I totally agree that providing helpers like toArray() that provide the comfort of "classic" data structures but come with some inevitable overhead cause some "it's slow compared to my objects" reactions, and I actually experienced this. So let me elaborate on my use case.

I work at Elastic where we're in the process of adding support for Arrow. The products currently use a JSON array-of-objects representation to exchange data, and have done so for more than a decade, even if deep inside Elasticsearch there is some column-oriented data processing happening. We recently added support for Arrow on the Elasticsearch server side and are now considering to consume it from Kibana, the UI front-end to manage and display data. There is a lot of existing code in Kibana based on array-of-objects and refactoring it to use Arrow vectors will take some time, and every iteration needs to bring some benefits in some area to foster adoption.

So the idea is to start from the edge (data exchange) where we can have significant gains on payload size, memory usage and eliminate parsing time, and then progressively use it more deeply in the rest of the code base. Hence the use of toArray(). That's the extent of the first iteration, and the overhead of toArray() has to be acceptable to not outweight the benefits of data transfer and zero copy.

Now I agree that toArray() and similar helpers are really a double-edged sword that may fool people into thinking that this is the way to use Arrow, and in the end be deceptive because of their overhead. But they are necessary nevertheless for transition scenarios in large code bases like the one I described.

So what about providing helpers as what they are, helpers and not primary features, and make this clear by moving them to a different location, like a TableConversionHelper or something similar? That way we don't expose Table.toArray() as something that looks like a nominal feature (and would then be deprecated), while still having room to provide useful utilities for transition scenarios, with the proper warnings saying that this isn't "the real Arrow way".

I hope this makes sense.

domoritz · 2024-10-01T15:25:12Z

Thanks for the additional context. I really appreciate the effort you put into preparing this pull request and hope that you send more in the future as you find wants to improve the library. Maybe for this specific feature it makes sense to just leave arrow as is (with the iterators and toArray as a compatibility method with all the limitations it comes with) and you can use your array view code in your application (or a separate package)? I think that adding it to this repo just adds to the confusion people experience. Thoughts? @swallez @trxcllnt

swallez · 2024-10-02T12:37:41Z

Ok, I understand your concerns and I'm closing this PR. I will keep experimenting on this topic in my benchmarks project, and may ping you at some point to see if anything interesting could be contributed here.

That being said, this discussion outlined the fact that toArray() is convenient but misleading. And yet this is the very first example on the library's home page. This should probably be updated so that users aren't given a wrong first impression.

I also opened PR #44289 with only the change for StructRow's proxy optimization. I guess that will be an easier one 😉

domoritz · 2024-10-02T12:47:48Z

Good point on the examples in the introduction. We should first show how to iterate instead.

domoritz · 2024-10-03T02:27:19Z

I took a stab at encouraging more effective APIs. How's #44292?

swallez added 2 commits September 27, 2024 20:26

Use a singleton row proxy handler to reduce memory usage

65b2b2b

Add a lightweight table array view that only allocates a proxy object

47f04ea

swallez requested review from domoritz and trxcllnt as code owners September 27, 2024 19:00

github-actions bot added Component: JavaScript awaiting review Awaiting review labels Sep 27, 2024

domoritz reviewed Sep 28, 2024

View reviewed changes

github-actions bot added awaiting changes Awaiting changes and removed awaiting review Awaiting review labels Sep 28, 2024

domoritz reviewed Sep 28, 2024

View reviewed changes

swallez mentioned this pull request Sep 30, 2024

[Research] Data format improvements for charting (arrow) elastic/kibana#175695

Closed

Remove type annotation

4b706a9

github-actions bot added awaiting change review Awaiting change review and removed awaiting changes Awaiting changes labels Oct 1, 2024

swallez closed this Oct 2, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

GH-44246: [JS] Add a lightweight array view to Table #44247

GH-44246: [JS] Add a lightweight array view to Table #44247

swallez commented Sep 27, 2024 •

edited by github-actions bot

Loading

github-actions bot commented Sep 27, 2024

domoritz left a comment

domoritz Sep 28, 2024

swallez Sep 28, 2024

domoritz Sep 28, 2024 •

edited

Loading

swallez Sep 28, 2024

swallez Sep 30, 2024

swallez commented Sep 28, 2024

domoritz commented Sep 28, 2024

domoritz Sep 28, 2024

swallez Sep 28, 2024

domoritz Sep 28, 2024

swallez Sep 28, 2024

trxcllnt Sep 30, 2024 •

edited

Loading

swallez Oct 1, 2024

swallez commented Sep 28, 2024

domoritz commented Sep 28, 2024

swallez commented Sep 30, 2024

trxcllnt commented Sep 30, 2024

trxcllnt commented Sep 30, 2024 •

edited

Loading

swallez commented Oct 1, 2024 •

edited

Loading

domoritz commented Oct 1, 2024

swallez commented Oct 2, 2024

domoritz commented Oct 2, 2024

domoritz commented Oct 3, 2024

GH-44246: [JS] Add a lightweight array view to Table #44247

GH-44246: [JS] Add a lightweight array view to Table #44247

Conversation

swallez commented Sep 27, 2024 • edited by github-actions bot Loading

Rationale for this change

What changes are included in this PR?

Are these changes tested?

Are there any user-facing changes?

github-actions bot commented Sep 27, 2024

domoritz left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

domoritz Sep 28, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

swallez commented Sep 28, 2024

domoritz commented Sep 28, 2024

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

trxcllnt Sep 30, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

swallez commented Sep 28, 2024

domoritz commented Sep 28, 2024

swallez commented Sep 30, 2024

trxcllnt commented Sep 30, 2024

trxcllnt commented Sep 30, 2024 • edited Loading

swallez commented Oct 1, 2024 • edited Loading

domoritz commented Oct 1, 2024

swallez commented Oct 2, 2024

domoritz commented Oct 2, 2024

domoritz commented Oct 3, 2024

swallez commented Sep 27, 2024 •

edited by github-actions bot

Loading

domoritz Sep 28, 2024 •

edited

Loading

trxcllnt Sep 30, 2024 •

edited

Loading

trxcllnt commented Sep 30, 2024 •

edited

Loading

swallez commented Oct 1, 2024 •

edited

Loading