14.0.0 (2022-11-04)
Breaking changes:
- Improve FieldNotFound errors #4084 [sql] (andygrove)
- Refactor: move
simplify_expression.rs
andexpr_simplifier.rs
to a new modsimplify_expressions
#3951 (HaoYang670) - Support for non-u64 types for Window Bound #3916 [sql] (mustafasrepo)
- Expose parquet reader settings using normal DataFusion
ConfigOptions
#3822 (alamb) - Add
Filter::try_new
with validation #3796 [sql] (andygrove) - Change public simplify API and add a public coerce API #3758 (alamb)
Implemented enhancements:
- Automatically register tables if ObjectStore root is configured #4094
- Simplify small
InList
expressions #4089 - Support
SET
command #4067 - add uuid() function to generate unique uuid per row #4045
- Publish benchmark crate so that it can be used as a library in Ballista #4016
- Add statistics methods to
TableProvider
trait for use in cost-based optimizations in the logical plan #3983 - Implement
current_time
Function #3982 - Implement
current_date
Function #3981 - Put common code used for testing code into datafusion/test_utils.rs #3960
- Print the configurations of ConfigOptions in an ordered way so that we can directly compare the equality of two ConfigOptions by their debug strings #3952
- Don't make dependants install protoc #3947
- Implement right anti join and support it in HashBuildProbeOrder #3946
- Implement right semi join and support it in HashBuildProbeOrder #3945
- Refactor
simplify_expressions
andexpr_simplifier
#3934 - Implement serialization for
ScalarValue::FixedSizeBinary
#3928 - Support inlining view / dataframes logical plan #3913
- Plans with tables from
TableProviderFactory
s can't be serialized #3906 - Simplify
a AND a
anda OR a
. #3895 - Allow configuring statistics on TPC-H benchmarks #3888
- CI checks stuck in queued mode #3883
- Multiple optimizer passes #3879
- datafusion-proto does not support view table scan #3874
- TableProviderFactories need to be async and return a Result to be useful #3866
- Factorize common AND factors out of OR predicates to support filterPushDown as possible #3858
- Replace
concat_ws
withconcat
when the delimiter is empty string #3857 - Concatenate contiguous literal arguments of
concat_ws
when doing the expression simplification #3856 - Partition and Sort Enforcement #3854
- Enable mimalloc by default in benchmarks #3851
- Add collect statistics configuration #3847
- [SQL] - Support cache/uncache table syntax #3842
- Filter pushdown doesn't seem to apply for filter on TPC-H Q17 #3839
- Support pushdown multi-columns in PageIndex pruning. #3834
- Consolidate
Expr
manipulation code so it is more discoverable and make it easier to use #3808 - Leverage input array's null buffer for regex replace to optimize sparse arrays #3803
- Improve join cardinality estimation when there is no overlap in the min/max values #3802
- datafusion-cli up to date check is failing on master #3798
- Optimize benchmark q2 subquery filter #3789
- Benchmark should infer schema when running against Parquet #3776
- Allow specialized physical functions to provide hints for the array adapter #3762
- [User Guide] Add
EXPLAIN
to SQL reference #3755 - move
type coercion
for agg/agg udf #3752 - Prevent Cargo.lock for datafusion-cli being out-of-date #3744
- Add example of expr apis including simplification and coercion #3740
- support
type coercion
for ScalarFunction expr in the logical phase #3731 - Add support for DISTINCT projections in
decorrelate_where_exists
#3724 - Add type coercion rule for
CONCAT
andCONCAT_WS
#3720 - Expose and document a simpler public API for simplify expressions #3709
- Expose + document the type coercion API publicly #3708
- Concatenate contiguous literal arguments of
CONCAT
during the expression simplification. #3683 - DataFusion 13.0.0 Release #3671
- Add division by
0
rules in the expression simplification #3663 - Compressed CSV/JSON Read #3641
- remove type coercion for agg #3623
- extract or clause as predicate for join rels #3577
- Improve performance of
regex_replace
#3518 - Add benchmarks for parquet queries with filter pushdown enabled #3457
- Make type coercion rule more robust #3390
ViewTable::scan
ignores filters and limits #3249- Add
CREATE VIEW
documentation to user guide #3211 - Push additional parquet filtering into the parquet scan [EPIC] #3147
- Remove
core/logical_plan
module #2683 - Datafusion Optimizer Enhancement #2255
- [Optimizer] Eliminate self compare self #2252
- Break datafusion crate into smaller crates #1750
- Benchmark
constellation-rs/amadeus
's parquet implementation #1341 - Use
parquet2
async reader inphysical_plan/parquet
#1058 - Table Scan Enhancement Plan #944
- Implement parquet page-level skipping with column index, using min/max stats #847
- Support min/max statistics in ParquetTable and ParquetExec #537
Fixed bugs:
- Clippy failing on master #4100
- Panic when the number of partitions of the pipeline that throws the exception is inconsistent with the number of partitions output by the query #4096
- FieldNotFound when field is available #4083
- SingleDistinctToGroupBy being applied too broadly #4082
- single_distinct_to_groupby strips qualifiers from group-by expressions #4049
- Another Internal error when parquet predicate pushdown is enabled "Error evaluating filter predicate: #4046
- Decimal multiplied by Float produces incorrect results #4035
- Cannot query external table - TableScan replaced with EmptyExec #4027
- benchmark q17 produces incorrect result #4026
- benchmark q14 produces incorrect result #4025
- benchmark q11 producing incorrect results #4023
- Internal error when parquet predicate pushdown is enabled "Error evaluating filter predicate:" #4006
- Incorrect results with parquet filtering pushdown enabled #4005
- Wrong results when parquet page index filtering is enabled #4002
- Output schema of semi join has invalid projection added after HashBuildProbeOrder #4001
async
deserialization functions are unintuitive and possibly insecure #3977Expr::to_bytes
can produce output that hitsExpr::from_bytes
recursion limit #3968- Bug on propagating arrow field metadata #3964
- Predicate still has cast when comparing Timestamp(Nano, None) to a timestamp literal, so can't be pushed down or used for pruning #3938
- Error using
IN
list on dictionary encoded data:InList does not support datatype Dictionary(Int32, Utf8).
#3936 - Internal error in CAST from Timestamp[us] #3922
- ScalarValue not implemented for FixedSizeBinary types #3910
- [DOC] - There are unsupported DDL in the official documentation #3904
- datafusion-proto deserialize with Substring(str [from int] [for int]) fails #3901
count(Literal)
gives wrong column name #3891projection_push_down
adds duplicate projections with multiple passes #3881- Default physical planner generates empty relation for DROP TABLE, CREATE MEMORY TABLE, etc #3873
- Binary expression canonical names are incorrect in some cases #3865
- Using the window function lag causes panic. #3830
- chrono crate : specify 0.4.22 as the minimum version due to spurious build failures #3827
- datafusion-proto deserialize with q16 sql fails #3820
- Filter predicates should not be aliased #3795
- Write csv not save all lines of dataframe #3783
- Regression in simplifying expressions in subqueries #3760
- DataFusionError(Internal("The size of the sorted batch is larger than the size of the input batch: 2120 > 2312")) #3747
- "labeler" PR check is broken #3743
DataFrame::select_columns
doesn't work with names containing "." #3733- TPC-H Query 1 has regressed #3729
- [RUST][Datafusion] What causes "Error: Execution("file size of 4 is less than footer")" error? #3800
- Field names containing periods such as f.c cannot work #3682
- TableProvider implementation for DataFrame does not support filter pushdown #3681
- using Decimal(0) make system panicked #3665
- Cannot query some parquet files in S3, but they work locally #3633
col / col
returns1
whencol = 0
#3615- register_csv allow space in table_path #3589
- Hardcoded u64 for WindowFrameBound fields #3571
docs.rs
cannot builddatafusion-proto
crate #3538- Row Hash loads whole aggregation state to memory before sending #3460
- approx_percentile_cont return wrong result when scan multi parquet files. #3140
- User guide is incorrect regarding using CLI to register CSV files using schema inference #3001
- Exception: Internal error, Exception: Schema error #2938
- Version 0.6.0 Panic error during SQL execution #2738
- wrong result when operation parquet #2044
- Local object store accepts file:/// as base path, but LocalStore returns meta without the prefix. #1923
- Reading nested parquet files results in
index out of bounds
#1383 -
(negation) with NULL literals does not work: can't be evaluated because the expression's type is Utf8, not signed #1192- Inconsistent cast behavior #957
- single_distinct_to_groupby no longer drops qualifiers #4050 [sql] (andygrove)
Documentation updates:
- Clarify in docs that Identifiers are made lower-case in SQL query #2374
- Fix broken links in contributor guide #3956 (Jefffrey)
- add create view explanation #3925 (retikulum)
- Update
datafusion-examples
README #3814 (alamb) - Add Seafowl to list of projects using DataFusion #3792 (mildbyte)
Closed issues:
- [QUESTION] How many times should be the function
create_name
called when executing a query? #3900 - Improve the
Expr
string format #3878 - Simplify division by zero (division by one / multiplication by zero / multiplication by one) for Decimal types as well #3643
- InList: merge check branch #2833
- Optimization InList: compare the float data type using OrderedFloat<T> #2831
- Outdated section of the add function of the contribution guide #2560
- Optimize InList implementation with native types rather than ScalarValue #2165
- Improve testing of optimizers using EXPLAIN #1118
- Crash on parsing sql query with Cyrillic letters #184
- [EPIC] Support all TPC-H queries in benchmark #158
- Implement optional second argument to ltrim and rtrim functions #144
- Benchmark crate does not have a SIMD feature #124
- ColumnarValue::into_array should not require batch #113
- [Rust] Parquet data source does not support complex types #83
Merged pull requests:
- Appease new clippy #4101 (alamb)
- minor: Split parquet reader up into smaller modules #4099 (alamb)
- [MINOR] Update
SET
in cli.md #4098 (waitingkuo) - fix: Scheduler panic routing errors #4097 (yukkit)
- Automatically register tables if ObjectStore root is configured #4095 (avantgardnerio)
- minor: Use Operator::swap #4092 (alamb)
- Simplify small InListExpr #4090 (Dandandan)
- Minor: Add arrow-rs ticket reference and turn some comments into docstrings #4088 (alamb)
- Support Dictionary in InListExpr #4070 (tustvold)
- support
SET
variable #4069 [sql] (waitingkuo) - Add in list bench #4068 (tustvold)
- Improve Error Handling and Readibility for downcasting
StructArray
#4061 (retikulum) - Build tests separately from running #4060 (alamb)
- Simplify InListExpr ~20-70% Faster #4057 (tustvold)
- MINOR: Print unoptimized logical plan in execute_query of tpch benchmark #4056 (viirya)
- Minor: clean the code in
eliminate_filter
#4055 (HaoYang670) - Implement
current_time
scalar function #4054 (naosense) - Cleanup hash_utils adding support for decimal256 and f16 #4053 (tustvold)
- Fix multicolumn parquet predicate pushdown (#4046) #4048 (tustvold)
- Add CI checks that we can serde all benchmark queries #4047 (andygrove)
- Enable more benchmark verification tests #4044 (andygrove)
- Extract common parquet testing code to
parquet-test-util
crate #4042 (alamb) - add uuid() function #4041 (Jimexist)
- Update to arrow 26, change timezones #4039 [sql] (tustvold)
- Fix Decimal and Floating type coerce rule #4038 (viirya)
- Reserve the literal expression of
Count
function #4031 [sql] (HaoYang670) - Implement current_date scalar function #4022 (comphead)
- Fix predicate pushdown bugs: project columns within DatafusionArrowPredicate (#4005) (#4006) #4021 (tustvold)
- minor: remove redundant code/TODO #4019 (jackwener)
- Add CI check to verify that benchmark queries return the expected results #4015 (andygrove)
- Minor: Add TODO and tracking ticket reference #4012 (alamb)
- Add right anti join support and support it in HashBuildProbeOrder #4011 (Dandandan)
- MINOR: Generate expected benchmark query results #4010 (andygrove)
- Minor: remove unecessary clippy allow #4008 (alamb)
- Minor: Do what clippy says and clean up some code #4007 (alamb)
- Improve Error Handling and Readibility for downcasting
Date32Array
#4004 (retikulum) - Don't add projection for semi joins in HashBuildProbeOrder #4000 (Dandandan)
- Minor: use
DataType::is_nested
#3995 (alamb) - [minor] bump prettier version #3992 (Jimexist)
- Add parquet predicate pushdown metrics #3989 (alamb)
- Pin datafusion-proto build dependencies #3987 (tustvold)
- Add TableProvider.statistics method #3986 (andygrove)
- Add Pull Request guidelines to contributor guide #3985 (alamb)
- Update protos #3979 (tustvold)
- Revert async changes but keep deltalake working #3978 (avantgardnerio)
- Correctness integration test for parquet filter pushdown #3976 (alamb)
- MINOR: Stop pretty printing batches in benchmark when there are no results #3974 (andygrove)
- MINOR: Re-export Cast struct #3971 (andygrove)
- fix: check recursion limit in
Expr::to_bytes
#3970 (crepererum) - [Part1] Partition and Sort Enforcement, PhysicalExpr enhancement #3969 (mingmwang)
- Support pushdown multi-columns in PageIndex pruning. #3967 (Ted-Jiang)
- Fix benchmarks README formatting #3966 (Jefffrey)
- Bug fix on DFField to Field conversion: preserve metadata #3965 (metesynnada)
- Informative Error Message for LAG and LEAD functions #3963 (mustafasrepo)
- Minor: Add some docstrings to
FileScanConfig
andRuntimeEnv
#3962 (alamb) - Move common code used for testing code into datafusion/test_utils #3961 (alamb)
- Update minimum chrono dependency to 0.4.22 #3959 (alamb)
- Implement right semi join and support in HashBuildProbeorder #3958 (Dandandan)
- Print the configurations of ConfigOptions in an ordered way so that we can directly compare the equality of two ConfigOptions by their debug strings #3953 (yahoNanJing)
- Vendor Generated Protobuf Code (#3947) #3950 (tustvold)
- Implement serialization for ScalarValue::FixedSizeBinary #3943 (retikulum)
- Consolidate physical join code into
datafusion/core/src/physical_plan/joins
#3942 (alamb) - Add optimizer test for simplifying predicates on timestamps #3939 (alamb)
- Add test for querying predicate on dictionary #3937 (alamb)
- fix: return error for unsupported SQL #3933 (Kikkon)
- doc: fix doc about
CREATE TABLE IF NOT EXISTS
#3932 (jackwener) - Refactor Expr::Cast to use a struct. #3931 [sql] (jackwener)
- minor: fix some typo. #3930 (jackwener)
- chore: update cranelift-related dependencies #3926 (xudong963)
- Change cast error from Internal to NotImplemented #3924 (alamb)
- Support inlining view / dataframes logical plan #3923 (Dandandan)
- Add test for Simplify redundant predicates #3915 (src255)
- Implement ScalarValue for FixedSizeBinary #3911 (maxburke)
- Add serde for plans with tables from
TableProviderFactory
s #3907 (avantgardnerio) - Support filter/limit pushdown for views/dataframes #3905 (Dandandan)
- Factorize common AND factors out of OR predicates to support filterPu… #3903 (Ted-Jiang)
- Add
Substring(str [from int] [for int])
support indatafusion-proto
#3902 (r4ntix) - Revert "Factorize common AND factors out of OR predicates to supportfilter Pu… (#3859)" #3897 (alamb)
- MINOR: Add notes on Apache Reporter #3893 (andygrove)
- Allow configuring collection of statistics during TPC-H benchmarks #3889 (isidentical)
- Improve formatting of binary expressions #3884 [sql] (andygrove)
- Multiple optimizer passes #3880 (andygrove)
- [MINOR] Update docs with newly added configuration values #3877 (alamb)
- [MINOR] Add a hint about how to resolve the
Cargo.lock
CI check #3876 (alamb) - Add
LogicalPlan::ViewTable
support indatafusion-proto
#3875 (r4ntix) - Optimize the
concat_ws
function #3869 (HaoYang670) - Implement foundational filter selectivity analysis #3868 (isidentical)
- Update
TableProviderFactory
trait to support real-world use-cases #3867 (avantgardnerio) - put subquery's equal clause into join on clauses instead of filter cl… #3862 (AssHero)
- Factorize common AND factors out of OR predicates to support filterPu… #3859 (Ted-Jiang)
- Enable mimalloc by default in benchmark #3853 (Dandandan)
- Refactor
Expr::Between
to use a struct #3850 [sql] (b41sh) - Handle cardinality estimation for disjoint inner and outer joins #3848 (isidentical)
- Add setting for statistics collection #3846 (Dandandan)
- Update to arrow 25.0.0 #3844 [sql] (tustvold)
- Tweak list of optimization rules #3841 (Dandandan)
- Refactor Expr::GetIndexedField to use a struct #3838 [sql] (ygf11)
- Infer the count of maximum distinct values from min/max #3837 (isidentical)
- Refactor
Expr::Like
,Expr::ILike
,Expr::SimilarTo
to use a struct #3836 [sql] (b41sh) - Refactor Expr::BinaryExpr to use a struct #3835 [sql] (zhoudongyan)
- update postgres version to 15 in integration test #3831 (Jimexist)
- Fix the panic when lpad/rpad parameter is negative #3829 (ZuoTiJia)
- MINOR: Document SHOW ALL in the users guide #3826 (alamb)
- MINOR: Add datafusion-cli documentation on showing configuration #3825 (alamb)
- Add/Remove Division Rules #3824 (retikulum)
- Minor: Sort the output of SHOW ALL by config name #3823 [sql] (alamb)
- Add
precision != 0
check when making decimal type #3818 [sql] (HaoYang670) - Infer schema when running benchmarks against parquet #3817 (andygrove)
- Finish removing deprecated
datafusion::logical_plan
module #3816 (andygrove) - Clarify initial example with respect to capitalization #3815 (alamb)
- Improve expression simplification by running it twice #3811 (alamb)
- Make expression manipulation consistent and easier to use:
combine/split filter
conjunction
, etc #3810 (alamb) - Consolidate expression manipulation functions into
datafusion_optimizer
#3809 (alamb) - Optimize
regexp_replace
when the input is a sparse array #3804 (isidentical) - Stop ignoring errors when writing DataFrame to csv, parquet, json #3801 (andygrove)
- Update datafusion-cli Cargo.lock to fix CI check on master #3799 (alamb)
- MINOR: Benchmark regression tests #3790 (andygrove)
- MINOR: Optimizer example and docs, deprecate
Expr::name
#3788 (andygrove) - Join cardinality computation for cost-based nested join optimizations #3787 (isidentical)
- Optimizer now simplifies multiplication, division, module arg is a literal Decimal zero or one #3782 (drrtuy)
- Implement parquet page-level skipping with column index, using min/ma… #3780 (Ted-Jiang)
- Bump actions/labeler from 4.0.1 to 4.0.2 #3779 (dependabot[bot])
- MINOR: correct
ListingOptions.try_new
docs to include the enabled stat collection #3775 (isidentical) - Teach a negative NULL expression to return NULL instead of an error #3771 (drrtuy)
- Add benchmarks for testing row filtering #3769 (thinkharderdev)
- move type coercion of agg and agg_udaf to logical phase #3768 (liukun4515)
- User Guide: Add
EXPLAIN
to SQL reference #3767 (unvalley) - Allow specialized implementations to produce hints for the array adapter #3765 (isidentical)
- Fix optimizer regression with simplifying expressions in subquery filters #3764 (andygrove)
- Run all
datafusion-examples
in CI tests #3761 (alamb) - MINOR: Remove deprecated module
datafusion::logical_plan::plan
#3759 (andygrove) - Refactor
Expr::Case
to use a struct #3757 [sql] (andygrove) - Do not run labeler CI check if it would fail due to permissions #3756 (alamb)
- MINOR: Improvements to
scalar_subquery_to_join
error handling #3754 (andygrove) - Always track the final size of the in-mem sorted arrays #3753 (isidentical)
- Fix DataFrame::select_columns to handle column names with a period #3751 (zhoudongyan)
- Fix
ListingTableUrl
to decode percent #3750 (unvalley) - remove
type coercion
for physical ScalarFunction #3749 (liukun4515) - CI: Add a new run to check whether
datafusion-cli
lock file is up-to-date #3745 (isidentical) - Add datafusion example of expression apis #3741 (alamb)
- fix subquery where exists distinct #3732 (b41sh)
- Remove some uneeded code in
CommonSubexprEliminate
#3730 (alamb) - Consolidate and better tests for expression re-rewriting / aliasing #3727 (alamb)
- Fix output schema generated by CommonSubExprEliminate #3726 (alex-natzka)
- Add type coercion rule for
concat
andconcat_ws
#3721 (HaoYang670) - Expose and document a simpler public API for simplify expressions #3719 (ygf11)
- Remove dead code in
UnwrapCastExprRewriter
that may mask errors #3703 (alamb) - Fix
DataFrame::with_column
to handle creating column names with a period #3700 (alamb) - Add simplification rules for the
CONCAT
function #3684 (HaoYang670) - Compressed CSV/JSON support #3642 [sql] (Licht-T)
- Simplify serialization by removing redundant
PrimitiveScalarValue
#3612 (alamb) - Pushdown single column predicates from ON join clauses #3578 (AssHero)
- Simplify the serialization of
ScalarValue::List
#3547 (alamb) - Generate hash aggregation output in smaller record batches #3461 (milenkovicm)
- Improve doc on lowercase treatment of columns on SQL #3385 (nanicpc)