Description
DataFrame performance is relatively slow and can be improved.
As this is a complex issue, it has sence to split it into several independent steps. This Epic is a container for related changes to keep it accessible from one place. Here is the list of proposed changes:
Improve Performance of DataFrame Arithmetic Operations
-
Improve the speed of binary Arithmetic and Comparison operations on columns with the same underlying data type.
This can be achived by improving PrimitiveDataFrame.Clone method to use memory block coping. Avoid using CloneAs method, that involves type conversion for columns with the same data type
PR: Improve performance of column cloning inside DataFrame arithmetics #6814
PR: Improve performance of DataFrame binary comparison operations #6869 -
Reduce the number of copies in binary operations for columns with different type of underlying data (for example In32DataFrameColumn + Int16DataframeColumn).
We make copies of columns in the binary operation APIs mostly to reuse existing code. This is a wellknown issue. there are already tasks for eliminate excessive coping and g the binary operations behavior when types mismatch
Issue: Reduce the number of copies in binary operations in DataFrame #5663
Issue: Improve PrimitiveDataFrameColumn.BinaryOperations.tt #5665 -
Increase speed of PrimitiveDataFrameColumn initialization, by fixing AppendMany(value, count) method, that is used in all PrimitiveDataFrameColumn constructors
-
Improve Nullable support during arithmetic operations
Issue: Improve Nullable support during dataframe arithmetic operations #6825
-
Consider how to implement Nullable support in Elementwise operations without any decrease in performance
Issue: All DataFrame Elementwise methods uncorrectly work with NULL values #6820
-
Use Simd vectorization
-
Add performance benchmarks
Issue: Add performance benchmarks for dataframe arithmetic operations #6826
Improve Performance of Filtering
-
Faster way to Filter
Improve Performance of Indexing
-
Accessing DataFramePrimitiveColumn elements by index involve converting Memory to Span on each operation. That is very slow operation. we can consider using unmanaged memory in DataFrameBuffer instead. This also solves the issue with converting To/From Apache Arrow and heavy load on GC
Issue: Accessing data in a DataFrameColumn is insanely slow. #5966
Issue: DataFrame GetMutableBuffer method and ReadOnlyBuffer issues #6715