Conversation
Currently we don't have the option to load just a subset of the columns.
This matters e.g. when compression is the bottleneck.
For example, create a compressed arrow file.
```julia
using Arrow
p = tempname();
N = 1000000
tbl = (
a=rand(N),
b=rand(N),
c=rand(N),
d=rand(N),
e=rand(N),
f=[rand(rand(0:100)) for _ in 1:N],
);
Arrow.write(p, tbl; compress=:zstd);
```
Column `f` is the longest - it has an expected 50*N elements vs N for the rest
Some times we only care for some of the other columns. Currently we must
decompress all columns regardless:
```julia
using BenchmarkTools
@Btime tbl = Arrow.Table(p); # 359.205 ms (530 allocations: 794.23 MiB)
```
With this commit we can load only some of the columns
```julia
@Btime tbl = Arrow.Table(p; filtercolumns=["a"]); # 6.146 ms (231 allocations: 14.33 MiB)
```
|
Converting this to draft as I'm working on something that will supersede this. |
Codecov Report❌ Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## main #412 +/- ##
==========================================
- Coverage 87.45% 85.78% -1.67%
==========================================
Files 26 26
Lines 3283 3356 +73
==========================================
+ Hits 2871 2879 +8
- Misses 412 477 +65 ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
|
Does anyone wanna re-run CI? Looks like macos got stuck |
|
Done. |
|
Hi, what's the status of this PR? Would love to see what I can do @JoaoAparicio |
|
This would be a very important feature for us, too. |
|
for the API, |
|
We need to rebase on main to proceed this. |
Currently we don't have the option to load just a subset of the columns. This matters e.g. when compression is the bottleneck.
For example, create a compressed arrow file.
Column
fis the longest - it has an expected 50*N elements vs N for the rest Some times we only care for some of the other columns. Currently we must decompress all columns regardless:With this commit we can load only some of the columns