Description
Introduction
This ticket is a weekly summary of interesting things happening in DataFusion. Note this is not a complete list (it is what I remember / can find). Please feel free to leave comments on this ticket about things that I may have missed or you think should get wider attention by the community
Loosely inspired by https://this-week-in-rust.org/
DataFusion Related Blogs
- Not sure
Ucoming Releases
- Release DataFusion 43.0.0 #12470 (thanks @andygrove)
- Release sqlparser-rs version
0.52.0
datafusion-sqlparser-rs#1423 (huge kudos to @iffyio for all the reviews)
Recent Releases
Highlights from last week(s):
(I am sorry if I missed you -- please add a note to this ticket with anything you would like to highlight)
FFI Bindings
- @timsaucer added FFI initial implementation #12920 (FFI --> stable ABI for Table Providers) and a killer new example Example: FFI Table Provider as dynamic module loading #13183
LogicalTypes are coming!
- @notfilippo and @findepi have merged the first phase of logical types: feat(logical-types): add NativeType and LogicalType #12853
Performance Highlights
- @jayzhan @Dandandan @berkaysynnada and @2010YOUY01 improved repartition performance on multicpre Round robin polling between tied winners in sort preserving merge #13133
- @Rachelint @jayzhan211 @2010YOUY01 and @Dandandan found another 10% performance improvement in many multi-column aggregate queries: Support vectorized append and compare for multi group by #12996
- Enable reading
StringViewArray
by default from Parquet (8% improvement for entire ClickBench suite) #13101 (finally!)
Others
- @goldmedal started using the new documentation API: Introduce
INFORMATION_SCHEMA.ROUTINES
table #13255 - The work for hardening substrait continues with @akoshchiy, @vbarua, @Blizzara, @LatrecheYasser, @bvolpato, authoring several PRs more more more
- @Omega359 and @jonathanc-n almost wrapped up the new function documentation work: docs: switch completely to generated docs for scalar and aggregate functions #13161
- @findepi has been on a tear cleaning up with PR after PR after PR
- @jonahgao is nearing the final stages of support for the
EXECUTE
statement feat: support logical plan forEXECUTE
statement #13194 - @ngli-me started fixing a long standing rough edge with sort computations: Convert LexOrdering
type
tostruct
. #13146 - @eejbyfeldt continues bashing away at bugs / things that prevent complete TPC-DS run such as this and this
- @LeslieKid added additional aggregate fuzzing test support feat: Add
Time
/Interval
/Decimal
/Utf8View
in aggregate fuzz testing #13226 - Thanks to @mnorfolk03 fix: CSV Infer Schema now properly supports escaped characters. #13214
Major Projects / Discussions under way
- [DISCUSSION] Make DataFusion the fastest engine for querying parquet data in ClickBench #12821 -- show the world what you can do with focused engineering effort. Thanks to the epic work of @Rachelint, @goldmedal, @jayzhan211, @Dandandan @XiangpengHao and others,
- Adaptive Parquet Predicate Pushdown Evaluation arrow-rs#5523 - @XiangpengHao and @tustvold are working to make parquet even better
- [DISCUSS] Document criteria for adding new features / what belongs in core DataFusion (e.g. sql syntax, functions, etc) #12357
- Helping make DataFusion more visible: Enhancing DataFusion's Community Engagement and Visibility #13049 @SamSynnada
Looking to get more involved? Try code review!
DataFusion has a long history of community members contributing in all aspects of the project. Reviewing PRs is an especially great way to get introduced to the project, help the community and grow your own knowledge -- researching and understanding the code enough to review PRs also often inspires additional ideas for improvements.
We have docs about reviews. TLDR is: look for test coverage, if the change is understandable and well documented, and if the code can be improved. When you think the PR looks good to merge, try @
mentioning one of the committers.
Help wanted
Please feel leave your own comments on this ticket if you are looking for help
Community
- Weekly Call
- Slack/Discord: info links
Upcoming meetups:
- 2024 Dec 18 Chicago: https://lu.ma/eq5myc5i @adriangb @timsaucer
- TBD: DISCUSSION: January 2025 DataFusion Meetup in Amsterdam / CIDR 2025 #12988
- 2025 Jan 15 Boston
Background:
Previous update:
Andrew's Focus Areas:
- [DISCUSSION] Make DataFusion the fastest engine for querying parquet data in ClickBench #12821 (thanks to the epic work of @Rachelint, @goldmedal, @jayzhan211, @Dandandan @XiangpengHao and others, we are quite close)
- [Epic] Unify
WindowFunction
Interface (remove built in list ofBuiltInWindowFunction
s) #8709 (very close to finishing thanks @jcsherin @jatin510) - [EPIC] Automatically generate all function documentation from code #12740 (also almost done thanks to @Omega359 and @jonathanc-n)
- Aggregation fuzz testing #12114 (thanks @LeslieKid for all your help so far)