Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

GH-44246: [JS] Add a lightweight array view to Table #44247

Closed
wants to merge 3 commits into from

Conversation

swallez
Copy link
Member

@swallez swallez commented Sep 27, 2024

Rationale for this change

Table.toArray() provides the convenience of using Arrow tables as common object arrays. However it does allocate an array of the size of the table that hold proxies to the table rows. This array can consume a significant amount of memory with large tables.

Providing an array proxy that wraps the table and creates struct row proxies on demand would avoid this and provide a object array view on top of a table at almost zero memory cost.

What changes are included in this PR?

This PR adds a new Table.toArrayView() method that returns an array proxy to the table rows.

It complements Table.toArray() and doesn't replace it for the following reasons:

  • this is a read-only view, while toArray() returns a plain mutable array. Replacing it would be a breaking change.
  • the proxy adds some overhead to direct array access. Tests on iterator loops show that access via the array proxy takes 5 more time than direct array access. This can be a concern for some applications, even if it should generally be negligible, and if applications seeking maximum performance should avoid using these convenience wrappers.

Also fixes #30863 by using a singleton proxy handler for StructRowProxyHandler, which further reduces memory usage.

Are these changes tested?

Yes. New tests are added to js/test/unit/table/table-test.ts for both toArrayView() and toArray() that was not tested.

Are there any user-facing changes?

This adds a new Table.toArrayView() method.

Copy link

⚠️ GitHub issue #44246 has been automatically assigned in GitHub to PR creator.

Copy link
Member

@domoritz domoritz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is great. Would you also add this to vectors or if not, why only tables?

/**
* Returns a string representation of the Table rows.
*
* @returns A string representation of the Table rows.
*/
public toString() {
return `[\n ${this.toArray().join(',\n ')}\n]`;
return `[\n ${this.toArrayView().join(',\n ')}\n]`;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since we are stringifying the whole table, is it worth using an array view here?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We need an array (or array-like) to use join(). This was creating an array, and with toArrayView() this now just creates a proxy. Or did I misunderstand your comment?

Copy link
Member

@domoritz domoritz Sep 28, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I understand the difference but I wonder whether this change introduces a performance regression since "the proxy adds some overhead to direct array access."

I'm not against this change but want to make sure we have thought through the implications.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Got it. I don't think this is an issue here, as toString() is generally used for debugging purposes. Actually I'm wondering if it's practically usable for Table beyond tests as it can create huge strings.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

After some benchmarking, toString is a perfect candidate for using toArrayView(): the access overhead is actually the same as with toArray(). It is just spread while iterating while with toArrayView() while it's paid once upfront with toString. Since we're not reusing the array, after join, a view that doesn't allocate memory is a better choice.

@github-actions github-actions bot added awaiting changes Awaiting changes and removed awaiting review Awaiting review labels Sep 28, 2024
@swallez
Copy link
Member Author

swallez commented Sep 28, 2024

Thank @domoritz!

Would you also add this to vectors or if not, why only tables?

That absolutely makes sense. My immediate need was for tables for an application that uses plain objects, and I will add it for vectors.

Also, do you agree that toArrayView() should not replace toArray()? Or would the slight behavior change be acceptable if we replace it?

@domoritz
Copy link
Member

Also, do you agree that toArrayView() should not replace toArray()? Or would the slight behavior change be acceptable if we replace it?

Good question. I wonder actually whether we need toArrayView at all. Why not use make the vector/table behave like an array view?

checkArray(table.toArray());
});
describe('table.toArrayView()', () => {
checkArray(table.toArrayView());
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why can't we make this work?

Suggested change
checkArray(table.toArrayView());
checkArray(table);

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

See the discussion here. We cannot make Table behave like an array.

js/src/table.ts Outdated
@@ -238,17 +238,30 @@ export class Table<T extends TypeMap = any> {
*
* @returns An Array of Table rows.
*/
public toArray() {
public toArray(): Array<Struct<T>['TValue']> {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does this improve the type returned by this method or just make the type explicit?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It just makes it explicit. I added it to know precisely what toArrayView() was supposed to return. And also because I like explicit types 😉

Copy link
Contributor

@trxcllnt trxcllnt Sep 30, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Unlike in other languages, explicitly annotating return types is somewhat of an anti-pattern in TypeScript. Unless there's a special case where we need to inform the compiler what the real return type is, I'm with @domoritz that we should default to letting the compiler infer the type.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok, no problem, I removed it in 4b706a9.

@swallez
Copy link
Member Author

swallez commented Sep 28, 2024

Why not use make the vector/table behave like an array view?

Mimicking an array is a bit tricky: it has a lot of methods that are implemented by iterating it (like join, map, find, etc) and we would have to reimplement all of them. Also, Array.isArray() would return false even if we reimplemented all array methods.

This is why toArrayView() creates a proxy to an empty array and passes the table to the proxy handler:

return new Proxy([] as Array<Struct<T>['TValue']>, new TableArrayProxyHandler(this));

That way the proxy handler only has to implement iteration and direct access, and benefits from all other array methods by delegating to the wrapped array.

@domoritz
Copy link
Member

I see. I'll page @trxcllnt to weigh in on this one.

@swallez
Copy link
Member Author

swallez commented Sep 30, 2024

I did some extensive benchmarks to understand memory usage and performance overhead of toArray() vs toArrayView(). Discussing this with some colleagues, it turns out that there are a number of options when exposing a Table as an array what we may want to choose from depending on the usage context.

There are options for both the array and its elements:

  • array: early materialization (like in toArray()), lazy caching (creating and retaining row objects as we iterate), no caching (as in toArrayView())
  • rows/array items: materialization (creating a plain old JS object by copying the row data), lazy caching of properties (what StructRowProxyHandler does), no caching.

All combinations can be valid depending on how the application uses the resulting array-like object.

To account for this, I'd like to add an optional options parameter to toArray() that will allow configuring how we want the collection and rows be handled. Something like:

enum ToArrayBehavior (eager, cache, noCache);

class ToArrayOptions {
  array: ToArrayBehavior,
  rows: ToArrayBehavior
}

function toArray(options? ToArrayOptions) ...

with a default of { array: eager, rows: cache } which is what toArray() does today.

WDYT?

@trxcllnt
Copy link
Contributor

Table itself is already a view... why not just use table.get()?

@trxcllnt
Copy link
Contributor

trxcllnt commented Sep 30, 2024

I'm not sure how I feel about this PR yet, and have a few questions. I'm not saying I disagree with it, just trying to provide some context and discuss general strategies.

Table.toArray() provides the convenience of using Arrow tables as common object arrays. However it does allocate an array of the size of the table that hold proxies to the table rows. This array can consume a significant amount of memory with large tables.

Yeah, this is something we've struggled with since the project's inception. Every "convenience" API we've added is almost always also the least-performant way to interact with the data.

While ergonomic for vanilla JS users, these APIs encourage abuse and lead to performance pitfalls in everyday use. Anybody profiling their code blames Arrow for "being slow," when in reality all we did was make it easy for users to do slow things.

I personally would rather focus efforts on making it easy for users to do the fastest thing (so-called helping users "fall into the pit of success").

To date, this has manifested in three ways. First, we ensure all data movement (e.g. IPC) is zero copy and provide/encourage users to use zero-copy views of the data at rest. Second, we provide O(1) sequential access and O(log(m)) random access (where m is the number of RecordBatches in a Table). And third, we try to provide utilities for caching or deferring computation (MemoizedVector, MapRow/StructRow proxies).

We have Table.toArray() because it symmetrical to the (zero-copy where available) Vector.toArray(), while using the O(n) sequential iterator, without adding much to the API surface area or bundle size.

the proxy adds some overhead to direct array access. Tests on iterator loops show that access via the array proxy takes 5 more time than direct array access.

Part of this performance deficit is definitely due to the O(log(m)) random accesses for rows in the chunked Table, compared to the O(1) access for the Array.

A while back I had a branch that attempted this approach for Table/Vector. The idea was to use Object.setPrototypeOf() to give the Table/Vector all the Array methods (as well as pass the Array.isArray() check), and then use a Proxy to implement subscript-style accesses. I abandoned the branch after the performance tests showed similar results as you've reported here.

In reference to to my point above, even though our implementation could be the fastest way to implement a JS-Array-style API via Proxies, we'll still get blamed for encouraging users to use it.


With that context out of the way, I have some questions.

First, why/how are users (or you) using table.toArray() vs. using table.get(), table[Symbol.iterator](), or operating on the underlying TypedArrays directly?

Is it to get the table into a form where it can be used with the list alebra (map/filter/reduce/etc.)?

If so, have you considered a library like IxJS, or the stage 3 TC39 iterator-helpers proposal? One of the main reasons the Arrow types implement the [Async]Iterator spec is to make it easier for users to interoperate with libraries like these, while avoiding the O(n * log(m)) random access pitfalls, and also avoid re-implementing these libraries in Arrow itself.

Second, aside from the Array algebra, what benefit does an Array-like API bring over the existing table.get() method? For example, are there existing 3rd-party APIs that accept Arrays (and do subscript element-accesses) that you'd like to pass a Table?

Or is there a reason that operating directly on the binary data isn't easy/sufficient? For example, is the following insufficient/too inconvenient?

const times2 = table.data.reduce((results, batch) => {
  const [data] = batch.getChild("floats").data;
  return times2.concat(data.values.map((f, i) => data.isNull(i) ? f * 2 : 0));
}, new Float32Array());

Finally, is there any reason this can't live in application code instead (e.g. you create a Proxy from a table that implements the Array-subscript syntax)? I have done this a number of times until I had time to refactor routines handle tables/vectors more efficiently.


I am not intending to come off as obtuse, and I hope my questions make sense in context. I just worry about adding APIs that encourage users to do something convenient at the expense of performance.

@swallez
Copy link
Member Author

swallez commented Oct 1, 2024

Thanks for the extensive explanation @trxcllnt. I totally agree that providing helpers like toArray() that provide the comfort of "classic" data structures but come with some inevitable overhead cause some "it's slow compared to my objects" reactions, and I actually experienced this. So let me elaborate on my use case.

I work at Elastic where we're in the process of adding support for Arrow. The products currently use a JSON array-of-objects representation to exchange data, and have done so for more than a decade, even if deep inside Elasticsearch there is some column-oriented data processing happening. We recently added support for Arrow on the Elasticsearch server side and are now considering to consume it from Kibana, the UI front-end to manage and display data. There is a lot of existing code in Kibana based on array-of-objects and refactoring it to use Arrow vectors will take some time, and every iteration needs to bring some benefits in some area to foster adoption.

So the idea is to start from the edge (data exchange) where we can have significant gains on payload size, memory usage and eliminate parsing time, and then progressively use it more deeply in the rest of the code base. Hence the use of toArray(). That's the extent of the first iteration, and the overhead of toArray() has to be acceptable to not outweight the benefits of data transfer and zero copy.

Now I agree that toArray() and similar helpers are really a double-edged sword that may fool people into thinking that this is the way to use Arrow, and in the end be deceptive because of their overhead. But they are necessary nevertheless for transition scenarios in large code bases like the one I described.

So what about providing helpers as what they are, helpers and not primary features, and make this clear by moving them to a different location, like a TableConversionHelper or something similar? That way we don't expose Table.toArray() as something that looks like a nominal feature (and would then be deprecated), while still having room to provide useful utilities for transition scenarios, with the proper warnings saying that this isn't "the real Arrow way".

I hope this makes sense.

@domoritz
Copy link
Member

domoritz commented Oct 1, 2024

Thanks for the additional context. I really appreciate the effort you put into preparing this pull request and hope that you send more in the future as you find wants to improve the library. Maybe for this specific feature it makes sense to just leave arrow as is (with the iterators and toArray as a compatibility method with all the limitations it comes with) and you can use your array view code in your application (or a separate package)? I think that adding it to this repo just adds to the confusion people experience. Thoughts? @swallez @trxcllnt

@github-actions github-actions bot added awaiting change review Awaiting change review and removed awaiting changes Awaiting changes labels Oct 1, 2024
@swallez
Copy link
Member Author

swallez commented Oct 2, 2024

Ok, I understand your concerns and I'm closing this PR. I will keep experimenting on this topic in my benchmarks project, and may ping you at some point to see if anything interesting could be contributed here.

That being said, this discussion outlined the fact that toArray() is convenient but misleading. And yet this is the very first example on the library's home page. This should probably be updated so that users aren't given a wrong first impression.

I also opened PR #44289 with only the change for StructRow's proxy optimization. I guess that will be an easier one 😉

@swallez swallez closed this Oct 2, 2024
@domoritz
Copy link
Member

domoritz commented Oct 2, 2024

Good point on the examples in the introduction. We should first show how to iterate instead.

@domoritz
Copy link
Member

domoritz commented Oct 3, 2024

I took a stab at encouraging more effective APIs. How's #44292?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[JS] Use a flywheel for struct row
3 participants