Skip to content

Conversation

@trxcllnt
Copy link

Current version of proxy-bench:

Table Iterate "tracks":
length: 1000000
 x 6.88 ops/sec ±0.82% (21 runs sampled)
   avg: 145.4ms
   872.4% of a frame @ 60FPS 

This PR:

Table Iterate "tracks":
length: 1000000
 x 8.36 ops/sec ±1.02% (25 runs sampled)
   avg: 119.58ms
   717.48% of a frame @ 60FPS 

TheNeuralBit pushed a commit that referenced this pull request Mar 21, 2019
I'm sure I'll need some guidance on this one from @sunchao or @liurenjie1024 but I am keen to get parquet support added for primitive types so that I can actually use DataFusion and Arrow in production at some point.

Author: Andy Grove <andygrove73@gmail.com>
Author: Neville Dipale <nevilledips@gmail.com>
Author: Andy Grove <andygrove@users.noreply.github.com>

Closes apache#3851 from andygrove/ARROW-4466 and squashes the following commits:

3158529 <Andy Grove> add test for reading small batches
549c829 <Andy Grove> Remove hard-coded batch size, fix nits
8d2df06 <Andy Grove> move schema projection function from arrow into datafusion
204db83 <Andy Grove> fix timestamp nano issue
73aa934 <Andy Grove> Remove println from test
25d34ac <Andy Grove> Make INT32/64/96 handling consistent with C++ implementation
9b1308f <Andy Grove> clean up handling of INT96 and DATE/TIME/TIMESTAMP types in schema converter
1ec815b <Andy Grove> Clean up imports
023dc25 <Andy Grove> Merge pull request #2 from nevi-me/ARROW-4466
02b2ed3 <Neville Dipale> fix int96 conversion to read timestamps correctly
2aeea24 <Andy Grove> remove println from tests
9d3047a <Andy Grove> code cleanup
639e13e <Andy Grove> null handling for int96
1503855 <Andy Grove> handle nulls for binary data
80cf303 <Andy Grove> add date support
5a3368c <Andy Grove> Remove unnecessary slice, fix null handling
306d07a <Neville Dipale> fmt
3c711a5 <Neville Dipale> immediately allocate vec
e6cbbaa <Neville Dipale> replace read_column! macro with generic
607a29f <Andy Grove> return result if there are null values
e8aa784 <Andy Grove> revert temp debug change to error messages
6457c36 <Andy Grove> use parquet::reader::schema::parquet_to_arrow_schema
c56510e <Andy Grove> projection takes slice instead of vec
7e1a98f <Andy Grove> remove println and unwrap
dddb7d7 <Andy Grove> update to use partition-aware changes from master
157512e <Andy Grove> Remove invalid TODO comment
debb2fb <Andy Grove> code cleanup
6c3b7e2 <Andy Grove> add support for all primitive parquet types
b4981ed <Andy Grove> implement more parquet column types and tests
5ce3086 <Andy Grove> revert to columnar reads
c3f71d7 <Andy Grove> add integration test
aea9f8a <Andy Grove> convert to use row iter
f46e6f7 <Andy Grove> save
eaddafb <Andy Grove> save
322fc87 <Andy Grove> add test for reading strings from parquet
3a412b1 <Andy Grove> first parquet test passes
ff3e5b7 <Andy Grove> test
10710a2 <Andy Grove> Parquet datasource
TheNeuralBit pushed a commit that referenced this pull request Jan 22, 2020
This updates the language in `install_arrow()` to follow the README revision that will land in https://github.com/apache/arrow/pull/4948/files#diff-563b2cb2c8c2d51b2ff6b177e2d84286R33.

The [Jira ticket](https://issues.apache.org/jira/browse/ARROW-6142) requested three things; this is `#2` in the list. On `#1`, I defer to the C++ installation docs, which are already included in the install_arrow message, rather than duplicating content here. `#3` is out of scope.

Closes apache#5027 from nealrichardson/no-ppa and squashes the following commits:

80b142e <Neal Richardson> s/arrow/Arrow/
44c9659 <Neal Richardson> Tweak language again
36cfe28 <Neal Richardson> Further linux install revisions
79bd7e0 <Neal Richardson> One more PPurge
63f75bd <Neal Richardson> Revise install_arrow instructions for Linux

Authored-by: Neal Richardson <neal.p.richardson@gmail.com>
Signed-off-by: Sutou Kouhei <kou@clear-code.com>
TheNeuralBit pushed a commit that referenced this pull request May 12, 2025
…n timezone (apache#45051)

### Rationale for this change

If the timezone database is present on the system, but does not contain a timezone referenced in a ORC file, the ORC reader will crash with an uncaught C++ exception.

This can happen for example on Ubuntu 24.04 where some timezone aliases have been removed from the main `tzdata` package to a `tzdata-legacy` package. If `tzdata-legacy` is not installed, trying to read a ORC file that references e.g. the "US/Pacific" timezone would crash.

Here is a backtrace excerpt:
```
apache#12 0x00007f1a3ce23a55 in std::terminate() () from /lib/x86_64-linux-gnu/libstdc++.so.6
apache#13 0x00007f1a3ce39391 in __cxa_throw () from /lib/x86_64-linux-gnu/libstdc++.so.6
apache#14 0x00007f1a3f4accc4 in orc::loadTZDB(std::basic_string<char, std::char_traits<char>, std::allocator<char> > const&) ()
   from /tmp/arrow-HEAD.ArqTs/venv-wheel-3.12-manylinux_2_17_x86_64.manylinux2014_x86_64/lib/python3.12/site-packages/pyarrow/libarrow.so.1900
apache#15 0x00007f1a3f4ad392 in std::call_once<orc::LazyTimezone::getImpl() const::{lambda()#1}>(std::once_flag&, orc::LazyTimezone::getImpl() const::{lambda()#1}&&)::{lambda()#2}::_FUN() () from /tmp/arrow-HEAD.ArqTs/venv-wheel-3.12-manylinux_2_17_x86_64.manylinux2014_x86_64/lib/python3.12/site-packages/pyarrow/libarrow.so.1900
apache#16 0x00007f1a4298bec3 in __pthread_once_slow (once_control=0xa5ca7c8, init_routine=0x7f1a3ce69420 <__once_proxy>) at ./nptl/pthread_once.c:116
apache#17 0x00007f1a3f4a9ad0 in orc::LazyTimezone::getEpoch() const ()
   from /tmp/arrow-HEAD.ArqTs/venv-wheel-3.12-manylinux_2_17_x86_64.manylinux2014_x86_64/lib/python3.12/site-packages/pyarrow/libarrow.so.1900
apache#18 0x00007f1a3f4e76b1 in orc::TimestampColumnReader::TimestampColumnReader(orc::Type const&, orc::StripeStreams&, bool) ()
   from /tmp/arrow-HEAD.ArqTs/venv-wheel-3.12-manylinux_2_17_x86_64.manylinux2014_x86_64/lib/python3.12/site-packages/pyarrow/libarrow.so.1900
apache#19 0x00007f1a3f4e84ad in orc::buildReader(orc::Type const&, orc::StripeStreams&, bool, bool, bool) ()
   from /tmp/arrow-HEAD.ArqTs/venv-wheel-3.12-manylinux_2_17_x86_64.manylinux2014_x86_64/lib/python3.12/site-packages/pyarrow/libarrow.so.1900
apache#20 0x00007f1a3f4e8dd7 in orc::StructColumnReader::StructColumnReader(orc::Type const&, orc::StripeStreams&, bool, bool) ()
   from /tmp/arrow-HEAD.ArqTs/venv-wheel-3.12-manylinux_2_17_x86_64.manylinux2014_x86_64/lib/python3.12/site-packages/pyarrow/libarrow.so.1900
apache#21 0x00007f1a3f4e8532 in orc::buildReader(orc::Type const&, orc::StripeStreams&, bool, bool, bool) ()
   from /tmp/arrow-HEAD.ArqTs/venv-wheel-3.12-manylinux_2_17_x86_64.manylinux2014_x86_64/lib/python3.12/site-packages/pyarrow/libarrow.so.1900
apache#22 0x00007f1a3f4925e9 in orc::RowReaderImpl::startNextStripe() ()
   from /tmp/arrow-HEAD.ArqTs/venv-wheel-3.12-manylinux_2_17_x86_64.manylinux2014_x86_64/lib/python3.12/site-packages/pyarrow/libarrow.so.1900
apache#23 0x00007f1a3f492c9d in orc::RowReaderImpl::next(orc::ColumnVectorBatch&) ()
   from /tmp/arrow-HEAD.ArqTs/venv-wheel-3.12-manylinux_2_17_x86_64.manylinux2014_x86_64/lib/python3.12/site-packages/pyarrow/libarrow.so.1900
apache#24 0x00007f1a3e6b251f in arrow::adapters::orc::ORCFileReader::Impl::ReadBatch(orc::RowReaderOptions const&, std::shared_ptr<arrow::Schema> const&, long) ()
   from /tmp/arrow-HEAD.ArqTs/venv-wheel-3.12-manylinux_2_17_x86_64.manylinux2014_x86_64/lib/python3.12/site-packages/pyarrow/libarrow.so.1900
```

### What changes are included in this PR?

Catch C++ exceptions when iterating ORC batches instead of letting them slip through.

### Are these changes tested?

Yes.

### Are there any user-facing changes?

No.
* GitHub Issue: apache#40633

Authored-by: Antoine Pitrou <antoine@python.org>
Signed-off-by: Sutou Kouhei <kou@clear-code.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant