Improve DataFrame Performance

DataFrame performance is relatively slow and can be improved. 

As this is a complex issue, it has sence to split it into several independent steps. This Epic is a container for related changes to keep it accessible from one place. Here is the list of proposed changes:
 </br>
**Improve Performance of DataFrame Arithmetic Operations**
- [x]  Improve the speed of  binary Arithmetic and Comparison operations on columns with the same underlying data type. 

  This can be achived by improving  PrimitiveDataFrame.Clone method to use memory block coping. Avoid using CloneAs method, that involves type conversion for columns with the same data type

  PR: https://github.com/dotnet/machinelearning/pull/6814
  PR: https://github.com/dotnet/machinelearning/pull/6869
 
- [ ] Reduce the number of copies in binary operations for columns with different type of underlying data (for example In32DataFrameColumn + Int16DataframeColumn). 

  We make copies of columns in the binary operation APIs mostly to reuse existing code. This is a wellknown issue. there are already tasks for eliminate excessive coping and g the binary operations behavior when types mismatch

  Issue: https://github.com/dotnet/machinelearning/issues/5663
  Issue: https://github.com/dotnet/machinelearning/issues/5665
 
- [x] Increase speed of PrimitiveDataFrameColumn initialization, by fixing AppendMany(value, count) method, that is used in all PrimitiveDataFrameColumn constructors

  PR: https://github.com/dotnet/machinelearning/pull/6822

- [x] Improve Nullable support during arithmetic operations
 
  Issue: https://github.com/dotnet/machinelearning/issues/6825

- [ ] Consider how to implement Nullable support in Elementwise operations without any decrease in performance

  Issue: https://github.com/dotnet/machinelearning/issues/6820

- [ ] Use Simd vectorization

  Issue: https://github.com/dotnet/machinelearning/issues/5695

- [x] Add performance benchmarks

  Issue: https://github.com/dotnet/machinelearning/issues/6826

</br>

**Improve Performance of Filtering**

- [ ] Faster way to Filter

  Issue: https://github.com/dotnet/machinelearning/issues/6164
 </br>

**Improve Performance of Indexing** 

- [ ] Accessing DataFramePrimitiveColumn elements by index involve converting Memory<byte> to Span<T> on each operation. That is very slow operation. we can consider using unmanaged memory in DataFrameBuffer instead. This also solves the issue with converting To/From Apache Arrow and heavy load on GC

  Issue: https://github.com/dotnet/machinelearning/issues/5966
  Issue: https://github.com/dotnet/machinelearning/issues/6715


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Improve DataFrame Performance #6824

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Improve DataFrame Performance #6824

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions