Improve performance of column cloning inside DataFrame arithmetics #6814

asmirnov82 · 2023-09-08T19:16:42Z

The goal of this PR is to perform Arithmetics operation on columns with the same underlying data type approximately 3 times faster.

Detail of changes:

Fix PrimitiveColumnContainer Clone() method to use memory block coping for internal buffer instead of appending values one by one (with memory reallocation on each buffer resizing cycle). Do similar changes for CloneNullBitMapBuffers() method
Improve BinaryOperation.Implementation methods for all Arithmetic operations that happen not in place (default behavior).
Before the change autogenerated code looked like this:

public partial class SingleDataFrameColumn
{
    internal SingleDataFrameColumn AddImplementation(SingleDataFrameColumn column, bool inPlace = false)
    {
        if (column.Length != Length)
        {
            throw new ArgumentException(Strings.MismatchedColumnLengths, nameof(column));
        }
        SingleDataFrameColumn newColumn = inPlace ? this : CloneAsSingleColumn();
        newColumn.ColumnContainer.Add(column.ColumnContainer);
        return newColumn;
    }
}

After PR #6677 CloneAsSingleColumn can be changed to just this.Clone(). This allow to avoid unnecessary type conversion, that happens inside CloneAs... method and use fast Clone() method with bulk memory copy for internal buffers. For example. for Single:

internal PrimitiveColumnContainer<float> CloneAsFloatContainer()
{
 ...
    for (int i = 0; i < span.Length; i++)
   {
          newBuffer.Append(SingleConverter<T>.Instance.GetSingle(span[i]);
   }
}

Fix DataFrameBuffer constructor.
DataFrameBuffer overrides parent ReadOnlyDataFrameBuffer ReadOnlyBuffer to return own new field _memory instead of parent _readOnlyBuffer (after this parent _readonlybuffer is ignored and never used). However in constructor _memory is not created, instead base constructor is called to allocate _readonlybuffer (which is ignored). So after creating Capacity of such buffer is always 0 (ignoring the actual parameter passed to the constructor) and additional memory is allocated
After 3 is fixed, changed code to use DataFrameBuffer constructor with capacity instead of creating empty dataframe buffer and than reallocating memory by calling EnsureCapacity

Result:

Simple tests for 1 million of rows:

[GlobalSetup]
public void SetUp()
{
    var values = Enumerable.Range(0, ItemsCount);
    _column1 = new Int32DataFrameColumn("Column1", values);
    _column2 = new Int32DataFrameColumn("Column2", values);
}

[Benchmark]
public void Sum()
{
    var column = _column1 + _column2;
}

Before PR:
| Method |    Mean |     Error |   StdDev |
|    Sum | 18.02 ms | 0.090 ms | 0.080 ms |

After PR:
| Method |     Mean |     Error |    StdDev |
|    Sum | 6.896 ms | 0.1363 ms | 0.1673 ms |

Part of #6824 issue

codecov · 2023-09-08T20:57:10Z

Codecov Report

Merging #6814 (1eb8a35) into main (d692751) will increase coverage by 0.01%.
Report is 2 commits behind head on main.
The diff coverage is 57.29%.

@@            Coverage Diff             @@
##             main    #6814      +/-   ##
==========================================
+ Coverage   68.99%   69.01%   +0.01%     
==========================================
  Files        1237     1237              
  Lines      253558   253556       -2     
  Branches    26542    26540       -2     
==========================================
+ Hits       174944   174984      +40     
+ Misses      71663    71616      -47     
- Partials     6951     6956       +5

Flag	Coverage Δ
Debug	`69.01% <57.29%> (+0.01%)`	⬆️
production	`63.57% <48.75%> (+0.01%)`	⬆️
test	`88.86% <100.00%> (+<0.01%)`	⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

Files	Coverage Δ
...rosoft.Data.Analysis/ArrowStringDataFrameColumn.cs	`63.54% <100.00%> (ø)`
...lysis/PrimitiveDataFrameColumn.BinaryOperations.cs	`42.50% <100.00%> (+0.50%)`	⬆️
test/Microsoft.Data.Analysis.Tests/BufferTests.cs	`100.00% <100.00%> (ø)`
...st/Microsoft.Data.Analysis.Tests/DataFrameTests.cs	`99.31% <100.00%> (+<0.01%)`	⬆️
...est/Microsoft.ML.Fairlearn.Tests/GridSearchTest.cs	`100.00% <100.00%> (ø)`
...icrosoft.Data.Analysis/PrimitiveColumnContainer.cs	`86.13% <94.73%> (-0.41%)`	⬇️
...Microsoft.Data.Analysis/ReadOnlyDataFrameBuffer.cs	`48.71% <66.66%> (+1.34%)`	⬆️
src/Microsoft.Data.Analysis/DataFrameBuffer.cs	`85.18% <80.00%> (+2.20%)`	⬆️
...eColumn.BinaryOperationImplementations.Exploded.cs	`52.27% <0.00%> (ø)`

... and 12 files with indirect coverage changes

asmirnov82 · 2023-09-14T14:45:21Z

Using Benchmarks from #6826

asmirnov82 · 2023-09-22T18:00:45Z

@JakeRadMSFT could you please take a look

src/Microsoft.Data.Analysis/ReadOnlyDataFrameBuffer.cs

* Update build templates to handle feature branches (#6744) * Update build templates * Update build templates to include all releases/* and feature/* * Update releases to release * Update triggers for PR Validation Build * Add triggers for Code Coverage * Update version to 4.0 for feature branch (#6743) * Add missing implementation for datetime relevant arrow type into dataframe (#6675) * Add missing implementation for datetime relevant arrow type * Return required usage * Fix the behavior or column SetName method (#6676) * Fix the behavior or column SetName method * Fix stack overflow exception * Fix merge issues --------- Co-authored-by: Michael Sharp <51342856+michaelgsharp@users.noreply.github.com> * Fix DataFrame to allow to store columns with size more than 2 Gb (#6710) * Fix error with allocating more than MaxCapacity of Byte Memory Buffer * Remove Unit test as it consumes too much memory * Fix issue with increasing buffer capacity over limit when double it size * avoid empty dataset (#6756) * Fix dataframe arithmetics for columns having several value buffers (column size is more than 2 Gb) (#6724) * Fix dataframe arithmetics * Fix * Run tests that requires more than 2 Gb of Memory only on 64-bit env (#6758) * Reduce coupling of Data.Analysis.Tests project (#6759) * Provide ability to filter dataframe column by null via ElementWise Methods (#6723) * Provide ability to filter by null value * Add comments * Fix code review findings * Fix incorrect DataFrame min max computation with NULL (#6734) * Step 1 * Step 2 * Fixed code review findings * Clean DataFrame meaningless code (#6761) * Add NameEntityRecognition and Q&A deep learning tasks. (#6760) * NER * QA almost done, runtime error * QA finished * fixes from PR comments * fixed build * build fixes * perf changes * made disposable * fixed not disposing model * added some disposables to TensorFlow for memory * build testing * fixing build * added missing dispose * build fixes * build fixes * testing macos fix * fix issue (#6768) * fixed mac build and minor torch sharp changes (#6776) * Improve DataFrame Arithmetics implementation (#6763) * Change methods signature generation * Change DataFrameColumn Arithmetics * Change DataFrameColumn Operations * Fix unit tests * Fix spaces * Fix code review findings * Add QA sweepable estimator in AutoML (#6781) * Add QA sweepable * clean * Modernized some argument checks that still used string literals for parameter names (#6766) Co-authored-by: John Doe <john@doe> * removed deprecated yosemite brew (#6805) * Add TargetType to Type_convert (#6785) * Add target Type in convert type * Add custom type "DataKind" * clean * Add DataKind name space * clean test * File-scoped namespaces in files under `Environment` (`Microsoft.ML.Core`) (#6791) Co-authored-by: Lehonti Ramos <john@doe> * File-scoped namespaces in files under `EntryPoints` (`Microsoft.ML.Core`) (#6790) Co-authored-by: Lehonti Ramos <john@doe> * Fix issue with addIndexColumn in DataFrame.LoadCsv (#6769) * Fix issue with addIndexColumn in DataFrame.LoadCsv * Fix tests * Fix DataFrame.LoadCsv can not load CSV with duplicate column names (#6772) * File-scoped namespaces in files under `ComponentModel` (`Microsoft.ML.Core`) (#6788) Co-authored-by: Lehonti Ramos <john@doe> * File-scoped namespaces in files under `Data` (`Microsoft.ML.Core`) (#6789) Co-authored-by: Lehonti Ramos <john@doe> * Fix inconsistent null handling in DataFrame Arithmetics (#6770) * Fix inconsistent null handling in DataFrame Arithmetics * Fix Null Count and division by zero issues * Minor changes to restart build and rerun flaky tests * File-scoped namespaces in files under `Prediction` (`Microsoft.ML.Core`) (#6792) Co-authored-by: Lehonti Ramos <john@doe> * Allow to define CultureInfo for parsing values on reading DataFrame from csv (#6782) * Use CultureInfo for parsing values in csv file * Fix merge issues * Append dataframe rows based on column names (#6808) * Append dataframe rows based on column names * Update DataFrame.cs --------- Co-authored-by: Michael Sharp <51342856+michaelgsharp@users.noreply.github.com> * removed codecov token (#6811) * Fix wrong type conversion on PrimitiveDataFrameColumn (#6834) * Fix wrong type conversion on PrimitiveDataFrameColumn * Added tests for #6829 * Fix test * Add file generated from tt template and fix unit tests --------- Co-authored-by: Aleksei Smirnov <tlalok@inbox.ru> * update interactive kernel version (#6836) * update interactive kernel version * update * Update Microsoft.Data.Analysis.Interactive.Tests.csproj * Add performance benchmarks for dataframe arithmetic operations (#6827) * Add performance tests * Add extra tests * Fix * Fix typo * Fix Divide_Int16 and Divide_Int32_Int16 benchmarks * Fix * Change csproj file * Update BenchmarkDotNetVersion to 0.13.5 * Fix * Change to 0.13.1 because that is what is latest version in our nuget feeds. --------- Co-authored-by: Jake Radzikowski <JakeRad@Microsoft.com> * Improve performance of column cloning inside DataFrame arithmetics (#6814) * Optimize PrimitiveColumnContainer.Clone method * Avoid unnecessary type conversion during binary operations * Remove using * Fix DataFrameBuffer constructor * remove uncorrectly added using * Make DataFrameBuffer Length field protected * Fix typo * Use RawSpan * Simplify tt files for PrimitiveDataFrameColumnAritmetics (#6830) * First step of tt refactoring * Step 2 * Step 3 * Addresses #6533 (#6838) * Initial structure and started fleshing out some sections * Some corrections and paragraph on DL usages * Starting fleshing out DL on ML.NET section * Addresses #6533 * Update dependencies (#6837) * Update dependencies * Add reference to NuGet.Packaging.Core * PrimitiveDataFrameColumn.Clone method crashes when is used with IEnumerable mapIndices argument (#6822) * Split Test for AppendMany into 4 different tests * Block init of null validity buffer instead of setting individual bits * Add unit tests for PrimitiveDataFrameColumn.Clone * Fixes #6821 * Fix * Fix bug with AppendMany values to not empty column * Restart unit tests * Add more unit tests * Fix failing unit test * Fix code review findings * 6847 incorrectly sets column value (#6849) * Fix DataFrame incorrectly sets column value for index higher than Buffer.MaxCapacity * Revert renaming * Increase performance of arithmetic operations by enhancing calculations on nullable values (#6846) * Optimize PrimitiveColumnContainer.Clone method * Avoid unnecessary type conversion during binary operations * Remove using * Fix DataFrameBuffer constructor * remove uncorrectly added using * Make DataFrameBuffer Length field protected * Add performance tests * Split Test for AppendMany into 4 different tests * Block init of null validity buffer instead of setting individual bits * Add unit tests for PrimitiveDataFrameColumn.Clone * Fixes #6821 * Fix * Add extra tests * Fix * Fix typo * Fix Divide_Int16 and Divide_Int32_Int16 benchmarks * Fix * Avoid using constructor, that copies memory * First step of tt refactoring * Step 2 * Step 3 * Move iteration over buffers outside of the PrimitiveDataFrameColumnArithmetic * Change PrimitiveDataFrameColumnArithmetic * Fix typo * Use RawSpan * Fix bug with AppendMany values to not empty column * Restart unit tests * Add more unit tests * Add GetBitCount method * Fix failing unit test * Implementation * Change unit tests * Update unit tests * Refactoring BinaryOperation * Intermediate changes * Intermediate results * Implement Binary Scalar Reverse Operarions * Add implementation for BinaryIntOperations * Implement Comparison Operations * Implement actual calculations for Comparison operations * Uncomment performance tests * Remove unintentional code changes * Add reference to Apache Arrow project license in THIRD-PARTY-NOTICES * Fix license issues * Fixes incorrect work of DataFrame with VBufferColumn when number of e… (#6851) * Fixes incorrect work of DataFrame with VBufferColumn when number of elements is greater than Int.MaxValue * Fix calculation of max capacity and amount of required buffers * Fix unit test * Run test allocating more than 2 Gb of memory on 64bit env only * Fix StringDataFrameColumn same way as VBufferDataFrameColumn * Fix wrong amount of buffers created in constructor of StringDataFrameColumn * Fix code review findings --------- Co-authored-by: Aleksei Smirnov <tlalok@inbox.ru> Co-authored-by: Michael Sharp <51342856+michaelgsharp@users.noreply.github.com> Co-authored-by: Xiaoyun Zhang <xiaoyuz@microsoft.com> Co-authored-by: zewditu Hailemariam <36615490+zewditu@users.noreply.github.com> Co-authored-by: Lehonti Ramos <17771375+Lehonti@users.noreply.github.com> Co-authored-by: John Doe <john@doe> Co-authored-by: Raffaello Fraboni <10281615+novelhawk@users.noreply.github.com> Co-authored-by: R. G. Esteves <rodolfo.g.esteves@intel.com> Co-authored-by: Eric StJohn <ericstj@microsoft.com>

asmirnov82 added 3 commits September 8, 2023 20:36

Optimize PrimitiveColumnContainer.Clone method

eb2230d

Avoid unnecessary type conversion during binary operations

67af276

Remove using

cbc7c4d

ghost added the community-contribution label Sep 8, 2023

asmirnov82 added 3 commits September 9, 2023 09:05

Fix DataFrameBuffer constructor

63d983a

remove uncorrectly added using

6abf02e

Make DataFrameBuffer Length field protected

1a47ce4

asmirnov82 mentioned this pull request Sep 14, 2023

Improve DataFrame Performance #6824

Open

9 tasks

asmirnov82 added 3 commits September 18, 2023 23:51

Merge remote-tracking branch 'origin/main' into reduce_number_of_copies

0724599

Fix typo

be47f8d

Use RawSpan

d1b0686

Merge remote-tracking branch 'origin/main' into reduce_number_of_copies

1eb8a35

JakeRadMSFT reviewed Sep 26, 2023

View reviewed changes

src/Microsoft.Data.Analysis/ReadOnlyDataFrameBuffer.cs Show resolved Hide resolved

JakeRadMSFT approved these changes Sep 26, 2023

View reviewed changes

JakeRadMSFT merged commit 5648c89 into dotnet:main Sep 27, 2023

asmirnov82 deleted the reduce_number_of_copies branch September 28, 2023 08:45

ghost locked as resolved and limited conversation to collaborators Oct 28, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Improve performance of column cloning inside DataFrame arithmetics #6814

Improve performance of column cloning inside DataFrame arithmetics #6814

Uh oh!

asmirnov82 commented Sep 8, 2023 •

edited

Loading

Uh oh!

codecov bot commented Sep 8, 2023 •

edited

Loading

Uh oh!

asmirnov82 commented Sep 14, 2023 •

edited

Loading

Uh oh!

asmirnov82 commented Sep 22, 2023

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Improve performance of column cloning inside DataFrame arithmetics #6814

Improve performance of column cloning inside DataFrame arithmetics #6814

Uh oh!

Conversation

asmirnov82 commented Sep 8, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

codecov bot commented Sep 8, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

asmirnov82 commented Sep 14, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

asmirnov82 commented Sep 22, 2023

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

asmirnov82 commented Sep 8, 2023 •

edited

Loading

codecov bot commented Sep 8, 2023 •

edited

Loading

asmirnov82 commented Sep 14, 2023 •

edited

Loading