-
Couldn't load subscription status.
- Fork 0
js-cpp-refactor #1
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
… patterns to fix ci
@kou something is failing here, I'm not sure what's different here vs. Travis CI (not sure if this is what's failing the doc build): ``` DOC Building HTML ../arrow-glib-docs.xml:25: warning: failed to load external entity "../xml/gtkdocentities.ent" %gtkdocentities; ^ Entity: line 1: %gtkdocentities; ^ ../arrow-glib-docs.xml:29: parser error : Entity 'package_name' not defined <title>&package_name; Reference Manual</title> ^ ../arrow-glib-docs.xml:31: parser error : Entity 'package_string' not defined for &package_string;. ^ warning: failed to load external entity "../xml/basic-array.xml" ../arrow-glib-docs.xml:43: element include: XInclude error : could not load ../xml/basic-array.xml, and no fallback was found warning: failed to load external entity "../xml/composite-array.xml" ../arrow-glib-docs.xml:44: element include: XInclude error : could not load ../xml/composite-array.xml, and no fallback was found ../xml/array-builder.xml:25: warning: failed to load external entity "../xml/xml/gtkdocentities.ent" %gtkdocentities; ``` Author: Wes McKinney <wes.mckinney@twosigma.com> Author: Kouhei Sutou <kou@clear-code.com> Closes apache#1472 from wesm/fix-gen-apidocs and squashes the following commits: 5b907ac [Wes McKinney] Add explicit instructions for uploading API docs to website 5734a65 [Wes McKinney] Use JDK7 for Java on Ubuntu 16.04 0fcb1e1 [Wes McKinney] Use gcc 4.9 rather than default gcc because of gcc5 ABI issues dbf8be8 [Kouhei Sutou] Disable auto-reconfigure b1b5050 [Kouhei Sutou] Fix GLib doc build 8b2d7e4 [Wes McKinney] Fixes for glib doc build 9da9e14 [Wes McKinney] Add BOOST_ROOT
…is alive before enqueue new record when download file. use pyarrow download file will raise queue.Full exceptions sometimes. jira: https://issues.apache.org/jira/browse/ARROW-2002 Author: kmiku7 <kakoimiku@gmail.com> Closes apache#1485 from kmiku7/master and squashes the following commits: 8d5f905 [kmiku7] fix queue.FULL exception when writer thread write data slowly. 722182b [kmiku7] Merge pull request #1 from apache/master
…ET_HOME Author: Korn, Uwe <Uwe.Korn@blue-yonder.com> Closes apache#1477 from xhochy/ARROW-1856 and squashes the following commits: a34ade4 [Korn, Uwe] ARROW-1856: [Python] Auto-detect Parquet ABI version when using PARQUET_HOME
…e, add Reserve method I also relaxed the requirement to pass `const uint8_t*` so that one can pass `const void*` when writing to a `BufferBuilder`. This will not affect any downstream users Author: Wes McKinney <wes.mckinney@twosigma.com> Closes apache#1486 from wesm/ARROW-2004 and squashes the following commits: 2d6660a [Wes McKinney] Add shrink_to_fit parameter to BufferBuilder::Resize, add Reserve method, relax pointer type in Append
…e/ directory, or is the full path to directory with libjvm Some users ran into a rough edge where they had a non-standard JRE directory (possibly related to some recent changes by Oracle in their JDK installer) Author: Wes McKinney <wes.mckinney@twosigma.com> Closes apache#1487 from wesm/ARROW-1966 and squashes the following commits: 7e14923 [Wes McKinney] Add note to API documentation about JAVA_HOME f77b31e [Wes McKinney] Accommodate a JAVA_HOME containing the jre/ directory, or an absolute path to directory containing libjvm
…tor-merge_with-table-scan-perf
…-merge_with-table-scan-perf
|
Wrong base? |
|
@wesm this is the merged version of @TheNeuralBit's branch + my |
|
OK, I was confused why there's ~1100 commits in the PR, something about the way the PR is set up on GitHub |
cb6a473 to
a993b3b
Compare
a993b3b to
614b688
Compare
…XXX in plasma protocol. Related to apache#878, add DCHECK for ReadXXX. Author: Yeolar <yeolar@gmail.com> Closes apache#887 from Yeolar/fixtypo_plasma_and_add_DCHECK and squashes the following commits: 4df63bc [Yeolar] clang-format for too long lines. 143d254 [Yeolar] Update, compile passed. 09ff103 [Yeolar] Fix conflicts. b951d8d [Yeolar] Merge pull request #1 from apache/master ebae611 [Yeolar] Fix typo in plasma protocol; add DCHECK for ReadXXX in plasma protocol.
…ties As per apache#872 I am upgrading Jackson to the latest version on the current train (2.7.1 --> 2.7.9) Author: Matt Darwin <(none)> Author: Matt <mattdarwin@yahoo.co.uk> Closes apache#929 from mattdarwin/ARROW-1242-upgrade-jackson and squashes the following commits: d059517 [Matt Darwin] 1242 upgraing jackson to 2.7.9 bc3b6a0 [Matt] Merge pull request #1 from apache/master
NB this commit excludes Jackson and logback upgrades, since they are dealt with in 871 and 872 Author: Matt Darwin <(none)> Author: Matt Darwin <matt.darwin@oracle.com> Author: Matt <mattdarwin@yahoo.co.uk> Closes apache#873 from mattdarwin/upgrade-libs and squashes the following commits: 9b51f46 [Matt Darwin] Merge branch 'master' into upgrade-libs 284a4ce [Matt Darwin] Merge branch 'master' of https://github.com/apache/arrow 79550b1 [Matt Darwin] rolling back lilith to 0.9.44 since 8 doesn't support java 7 c63eef6 [Matt Darwin] Merge branch 'master' into upgrade-libs bc3b6a0 [Matt] Merge pull request #1 from apache/master 8599ba0 [Matt Darwin] backing out guava upgrade 80d81e6 [Matt Darwin] downgrading guava to 20 for java 7 compatibility 806f348 [Matt Darwin] Merge branch 'master' into upgrade-libs 8aafb7e [Matt Darwin] correcting indentation in BaseValueVector 94c1469 [Matt Darwin] upgrading netty to 4.0.49 cff5596 [Matt Darwin] reverting to netty 4.0.41.Final 568737d [Matt Darwin] switching to Collections from Guava for empty iterator c194e48 [Matt Darwin] upgraded hppc to 0.7.2 38be468 [Matt Darwin] upgrading libs except jackson and logback
…(take 2) sorry, this was still not fixed properly. logback version is separately specified in 2 places. Fixed properly this time. Author: Matt Darwin <(none)> Author: Matt <mattdarwin@yahoo.co.uk> Closes apache#960 from mattdarwin/ARROW-1240-upgrade-logback and squashes the following commits: 3492f66 [Matt Darwin] upgrading logback in tools/pom.xml 206b48d [Matt Darwin] Merge branch 'master' into ARROW-1240-upgrade-logback 284a4ce [Matt Darwin] Merge branch 'master' of https://github.com/apache/arrow bc3b6a0 [Matt] Merge pull request #1 from apache/master 3e2f676 [Matt Darwin] Merge branch 'master' into ARROW-1240-upgrade-logback caed163 [Matt Darwin] upgrading slf4j to 1.7.25
…ties (take 2) sorry, PR apache#929 failed to actually change the Jackson version, since the `jackson.version` variable defined in java/pom.xml is not used in java/vector/pom.xml That's now fixed in this PR. Author: Matt Darwin <(none)> Author: Matt <mattdarwin@yahoo.co.uk> Closes apache#957 from mattdarwin/ARROW-1242-upgrade-jackson and squashes the following commits: ad15e5f [Matt Darwin] Merge branch 'master' into ARROW-1242-upgrade-jackson ee29d65 [Matt Darwin] Merge branch 'master' of https://github.com/apache/arrow into ARROW-1242-upgrade-jackson 06d7745 [Matt Darwin] upgrading jackson to 2.7.9 PROPERLY this time... 284a4ce [Matt Darwin] Merge branch 'master' of https://github.com/apache/arrow d059517 [Matt Darwin] 1242 upgraing jackson to 2.7.9 bc3b6a0 [Matt] Merge pull request #1 from apache/master
|
The master branch of our fork hadn't been updated in quite a while. This should make more sense now. @trxcllnt My open PR is for ccri/table-scan-perf, shouldn't we merge into that rather than ccri/master? |
js/src/vector.ts
Outdated
| this.indicies = view.indicies; | ||
| this.dictionary = view.dictionary; | ||
| } else if (view instanceof ChunkedView) { | ||
| this.dictionary = (view.chunks[0] as DictionaryVector<T>).dictionary; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this making the assumption that the dictionary is the same throughout a chunked vector (i.e. no deltas)?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@TheNeuralBit Ah you're right, I think this is assuming the dictionary of the first chunk is the dictionary with the deltas, but that's not necessarily true. Should probably change this to read the dictionary of the last chunk in the chunked data vectors.
|
|
||
| export class DictionaryVector<T extends DataType = DataType> extends Vector<Dictionary<T>> { | ||
| // @ts-ignore | ||
| public readonly indicies: Vector<Int>; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Would you be opposed to renaming this indices?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am not -- I only called them that here because that's what they're called in arrow-cpp, but not married to the name
js/src/type.ts
Outdated
|
|
||
| export type FlatListType = Utf8 | Binary; // <-- these types have `offset`, `data`, and `validity` buffers | ||
| export type FlatType = Bool | PrimitiveType | FlatListType; // <-- these types have `data` and `validity` buffers | ||
| export type ListType = List<any> | FixedSizeList<any>; // <-- these types have `offset` and `validity` buffers |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Its not true that FixedSizeList has an offset buffer, which is causing some read errors for me. I'm looking into fixing this myself to try to get familiar with the changes.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah you're right, looks like that's an oversight. I moved some things around after defining these types the first time, so comments are probably out of date.
Don't read OFFSET vector for FixedSizeList
…etChildAt(i) method in ChunkedView
Table from struct
… similar to other similarly named fields
Fix exception for empty Table
…lue data Modified BinaryBuilder::Resize(int64_t) so that when building BinaryArrays with a known size, space is also reserved for value_data_builder_ to prevent internal reallocation. Author: Panchen Xue <pan.panchen.xue@gmail.com> Closes apache#1481 from xuepanchen/master and squashes the following commits: 707b67b [Panchen Xue] ARROW-1712: [C++] Fix lint errors 360e601 [Panchen Xue] Merge branch 'master' of https://github.com/xuepanchen/arrow d4bbd15 [Panchen Xue] ARROW-1712: [C++] Modify test case for BinaryBuilder::ReserveData() and change arguments for offsets_builder_.Resize() 77f8f3c [Panchen Xue] Merge pull request apache#5 from apache/master bc5db7d [Panchen Xue] ARROW-1712: [C++] Remove unneeded data member in BinaryBuilder and modify test case 5a5b70e [Panchen Xue] Merge pull request #4 from apache/master 8e4c892 [Panchen Xue] Merge pull request #3 from xuepanchen/xuepanchen-arrow-1712 d3c8202 [Panchen Xue] ARROW-1945: [C++] Fix a small typo 0b07895 [Panchen Xue] ARROW-1945: [C++] Add data_capacity_ to track capacity of value data 18f90fb [Panchen Xue] ARROW-1945: [C++] Add data_capacity_ to track capacity of value data bbc6527 [Panchen Xue] ARROW-1945: [C++] Update test case for BinaryBuild data value space reservation 15e045c [Panchen Xue] Add test case for array-test.cc 5a5593e [Panchen Xue] Update again ReserveData(int64_t) method for BinaryBuilder 9b5e805 [Panchen Xue] Update ReserveData(int64_t) method signature for BinaryBuilder 8dd5eaa [Panchen Xue] Update builder.cc b002e0b [Panchen Xue] Remove override keyword from ReserveData(int64_t) method for BinaryBuilder de318f4 [Panchen Xue] Implement ReserveData(int64_t) method for BinaryBuilder e0434e6 [Panchen Xue] Add ReserveData(int64_t) and value_data_capacity() for methods for BinaryBuilder 5ebfb32 [Panchen Xue] Add capacity() method for TypedBufferBuilder 5b73c1c [Panchen Xue] Update again BinaryBuilder::Resize(int64_t capacity) in builder.cc d021c54 [Panchen Xue] Merge pull request #2 from xuepanchen/xuepanchen-arrow-1712 232024e [Panchen Xue] Update BinaryBuilder::Resize(int64_t capacity) in builder.cc c2f8dc4 [Panchen Xue] Merge pull request #1 from apache/master
This PR moves the `Table` class out of the Vector hierarchy and adds optimized dataframe operations to it. Currently implements an optimized `scan()` method, `filter(predicate)`, `count()`, and `countBy(column_name)` (only works on dictionary-encoded columns).
Some usage examples, based on the file generated by `js/test/data/tables/generate.py`:
``` js
> let table = Table.from(...);
> table.count()
1000000
> table.filter(col('lat').gteq(0)).count()
499718
> table.countBy('origin').toJSON()
{ Charlottesville: 166839,
'New York': 166251,
'San Francisco': 166642,
Seattle: 166659,
'Terre Haute': 166756,
'Washington, DC': 166853 }
> table.filter(col('lng').gteq(0)).countBy('origin').toJSON()
{ Charlottesville: 83109,
'New York': 83221,
'San Francisco': 83515,
Seattle: 83362,
'Terre Haute': 83314,
'Washington, DC': 83479 }
```
There are performance tests for the dataframe operations, to run them you must first generate the test data by running `npm run create:perfdata`.
The PR also includes @trxcllnt's refactor of the JS implementation to make it more closely resemble the C++ implementation. This refactor resolves multiple JIRAs: ARROW-1903, ARROW-1898, ARROW-1502, ARROW-1952 (partially), and ARROW-1985
Author: Paul Taylor <paul.e.taylor@me.com>
Author: Brian Hulette <brian.hulette@ccri.com>
Author: Brian Hulette <hulettbh@gmail.com>
Closes apache#1482 from TheNeuralBit/table-scan-perf and squashes the following commits:
52f1e0e [Brian Hulette] <, > are not commutative, misc cleanup
04b1838 [Brian Hulette] even more table tests
16b9ccb [Brian Hulette] Merge pull request #4 from trxcllnt/js-cpp-refactor
fe300df [Paul Taylor] fix closure es5/umd toString() iterator
3d5240a [Paul Taylor] fix more externs
10c48ad [Paul Taylor] Merge branch 'table-scan-perf' of github.com:ccri/arrow into js-cpp-refactor
dbe7f81 [Brian Hulette] Add more Table unit tests
1910962 [Brian Hulette] Add optional bind callback to scan
5bdf17f [Brian Hulette] Fix perf
8cf2473 [Brian Hulette] Merge remote-tracking branch 'origin/master' into table-scan-perf
4a41b18 [Paul Taylor] add src/predicate to the list of exports we should save from uglify
5a91fab [Paul Taylor] add more view, predicate externs
f6adfb3 [Brian Hulette] Create predicate namespace
f7bb0ed [Paul Taylor] Merge branch 'table-scan-perf' of github.com:ccri/arrow into js-cpp-refactor
e148ee4 [Paul Taylor] Merge branch 'extern-woes' into js-cpp-refactor
25cdc4a [Paul Taylor] add src/predicate to the list of exports we should save from uglify
dc7c728 [Paul Taylor] add more view, predicate externs
25e6af7 [Brian Hulette] Create predicate namespace
579ab1f [Brian Hulette] Merge pull request #2 from trxcllnt/js-cpp-refactor
f3cde1a [Paul Taylor] fix lint
9769773 [Paul Taylor] fix vector perf tests
016ba78 [Brian Hulette] Merge pull request #1 from trxcllnt/js-cpp-refactor
272d293 [Paul Taylor] Merge pull request #4 from ccri/empty-table
7bc7363 [Brian Hulette] Fix exception for empty Table
8ddce0a [Paul Taylor] check bounds in getChildAt(i) to avoid NPEs
f1dead0 [Paul Taylor] compute chunked nested childData list correctly
18807c6 [Paul Taylor] rename ChunkData's fields so it's more clear they're not semantically similar to other similarly named fields
7e43b78 [Paul Taylor] add test:integration npm script
a5f200f [Paul Taylor] Merge pull request #3 from ccri/table-from-struct
c8cd286 [Brian Hulette] Add Table.fromStruct
a00415e [Brian Hulette] Fix perf
54d4f5b [Paul Taylor] lazily allocate table and recordbatch columns, support NestedView's getChildAt(i) method in ChunkedView
40b3638 [Paul Taylor] run integration tests with local data for coverage stats
fe31ee0 [Paul Taylor] slice the flat data values before returning an iterator of them
e537789 [Paul Taylor] make it easier to run all integration tests from local data
c0fd2f9 [Paul Taylor] use the dictionary of the last chunked vector list for chunked dictionary vectors
e33c068 [Paul Taylor] Merge pull request #2 from ccri/fixed-size-list
5bb63af [Brian Hulette] Don't read OFFSET vector for FixedSizeList
614b688 [Paul Taylor] add asEpochMs to date and timestamp vectors
87334a5 [Paul Taylor] Merge branch 'table-scan-perf' of github.com:ccri/arrow into js-cpp-refactor
b7f5bfb [Paul Taylor] rename numRows to length, add table.getColumn()
e81082f [Paul Taylor] export vector views, allow cloning data as another type
700a47c [Paul Taylor] export visitors
e859e13 [Paul Taylor] fix package.json bin entry
0620cfd [Brian Hulette] use Math.fround
0126dc4 [Brian Hulette] Don't recompute total length
e761eee [Brian Hulette] Rename asJSON to toJSON
6c91ed4 [Paul Taylor] Merge branch 'master' of github.com:apache/arrow into js-cpp-refactor-merge_with-table-scan-perf
d2b18d5 [Paul Taylor] Merge remote-tracking branch 'ccri/table-scan-perf' into js-cpp-refactor-merge_with-table-scan-perf
f3f3b86 [Paul Taylor] rename table.ts to recordbatch.ts in preparation for merging latest
e3f629d [Paul Taylor] fix rest of the mangling issues
fa7c17a [Paul Taylor] passing all tests except es5 umd mangler ones
e20decd [Brian Hulette] Add license headers
edcbdbe [Brian Hulette] cleanup
20717d5 [Brian Hulette] Fixed countBy(string)
7244887 [Brian Hulette] Add table unit tests...
6719147 [Brian Hulette] Add DataFrame.countBy operation
2f4a349 [Brian Hulette] Minor tweaks
2e118ab [Brian Hulette] linter
a788db3 [Brian Hulette] Cleanup
a9fff89 [Brian Hulette] Move Table out of the Vector hierarchy
1d60aa1 [Brian Hulette] Moved DataFrame ops to Table. DataFrame is now an interface
e8979ba [Brian Hulette] Refactor DataFrame to extend Vector<StructRow>
6a41d68 [Brian Hulette] clean up table benchmarks
2744c63 [Brian Hulette] Remove Chunked/Simple DataFrame distinction
aa999f8 [Brian Hulette] Add DictionaryVector optimization for equals predicate
4d9e8c0 [Brian Hulette] Add concept of predicates for filtering dataframes
796f45d [Brian Hulette] add DataFrame filter and count ops
30f0330 [Brian Hulette] Add basic DataFrame impl ...
a1edac2 [Brian Hulette] Add perf tests for table scans
d18d915 [Paul Taylor] fix struct and map rows
61dc699 [Paul Taylor] WIP -- refactor types to closer match arrow-cpp
62db338 [Paul Taylor] update dependencies and add es6+ umd targets to jest transform ignore patterns to fix ci
6ff18e9 [Paul Taylor] ship es2015 commonJS in main package to avoid confusion
74e828a [Paul Taylor] fix typings issues (ARROW-1903)
No description provided.