Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix Apply in PrimitiveColumnContainer so it does not change source column #6642

Merged
merged 1 commit into from
May 5, 2023

Conversation

janholo
Copy link
Contributor

@janholo janholo commented May 2, 2023

The Apply method calls SetValidityBit on the source container and not on the target container. That way the null check of the container values gets incorrect.

We are excited to review your PR.

So we can do the best job, please check:

  • There's a descriptive title that will make sense to other developers some time from now.
  • There's associated issues. All PR's should have issue(s) associated - unless a trivial self-evident change such as fixing a typo. You can use the format Fixes #nnnn in your description to cause GitHub to automatically close the issue(s) when your PR is merged.
  • Your change description explains what the change does, why you chose your approach, and anything else that reviewers should know.
  • You have included any necessary tests in the same PR.

@janholo
Copy link
Contributor Author

janholo commented May 2, 2023

@dotnet-policy-service agree

@codecov
Copy link

codecov bot commented May 2, 2023

Codecov Report

Merging #6642 (53b90ae) into main (a18b9cb) will decrease coverage by 0.01%.
The diff coverage is 83.33%.

Additional details and impacted files
@@            Coverage Diff             @@
##             main    #6642      +/-   ##
==========================================
- Coverage   68.53%   68.52%   -0.01%     
==========================================
  Files        1200     1200              
  Lines      250287   250279       -8     
  Branches    26093    26093              
==========================================
- Hits       171526   171509      -17     
- Misses      71929    71936       +7     
- Partials     6832     6834       +2     
Flag Coverage Δ
Debug 68.52% <83.33%> (-0.01%) ⬇️
production 63.00% <83.33%> (-0.01%) ⬇️
test 88.83% <ø> (ø)

Flags with carried forward coverage won't be shown. Click here to find out more.

Impacted Files Coverage Δ
...icrosoft.Data.Analysis/PrimitiveColumnContainer.cs 83.01% <83.33%> (-0.20%) ⬇️

... and 2 files with indirect coverage changes

@JakeRadMSFT
Copy link
Contributor

JakeRadMSFT commented May 4, 2023

@janholo Thanks for your PR!

Could you add a test that fails before your change and passes after?

It would help me better understand the problem and it's solution.

Copy link
Contributor

@JakeRadMSFT JakeRadMSFT left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So I dug into this a bunch to try and understand the differences and overlap. After doing that investigation - I like your change. It fixes 2 things.

What Does Apply Do?

The goal of Apply is to start with the value from this container ... call func with the value or default for the type ... then assign the result to resultContainer. Since it's possible the func changed a value from null to not null or vice-versa we have to update the validityBit.

Thing 1

This method used to get mutable versions of span and nullBitMap for the SourceContainer or this that we're not changing.

Your change makes it so we don't create mutable versions of source ... since it's not needed. Nothing should change with source.

Thing 2

It also fixes a bug where we were setting the validity bit on the SourceContainer because we were calling this.SetValidityBit instead of resultContainer.SetValidityBit.

DataFrameBuffer<byte> mutableNullBitMapBuffer = DataFrameBuffer<byte>.GetMutableBuffer(NullBitMapBuffers[b]);
NullBitMapBuffers[b] = mutableNullBitMapBuffer;
Span<byte> nullBitMapSpan = mutableNullBitMapBuffer.Span;
ReadOnlyDataFrameBuffer<T> sourceBuffer = Buffers[b];
Copy link
Contributor

@JakeRadMSFT JakeRadMSFT May 4, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@agocke @jaredpar

Just for my own learning purposes. Is this something that would be improved with the span changes in .NET? Would we no longer need to do all the GetMutableBuffer stuff?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Specifically I'm wondering if there would be a better way to refactor out the code like this:

DataFrameBuffer<TResult> resultMutableBuffer = DataFrameBuffer<TResult>.GetMutableBuffer(resultBuffer);
                resultContainer.Buffers[b] = resultMutableBuffer;
                Span<TResult> resultSpan = resultMutableBuffer.Span;
                DataFrameBuffer<byte> resultMutableNullBitMapBuffer = DataFrameBuffer<byte>.GetMutableBuffer(resultContainer.NullBitMapBuffers[b]);
                resultContainer.NullBitMapBuffers[b] = resultMutableNullBitMapBuffer;
                Span<byte> resultNullBitMapSpan = resultMutableNullBitMapBuffer.Span;

It's all over this file in some form.
Clone, Apply, ApplyElementWise, CloneAs

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There's a lot of very questionable stuff going on here. The code is going to great lengths to copy the original buffer to get a mutable version... and then passing the span to IsValid or calling the indexer, which both accept ReadOnlySpan.

And then there's the ReadOnlyDataFrameBuffer<T> type which takes a T : struct, but should be T : unmanaged.

This code could likely be simplified by extracting some helper methods, but I think the above issues indicate that someone should review the actual intent and ensure that the code reflects it.

Copy link
Contributor

@JakeRadMSFT JakeRadMSFT May 5, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's sort of what it felt like but I appreciate having some expert eyes on it! Thanks @agocke!

I might take this up as a pet project in the future :)

Copy link
Contributor

@JakeRadMSFT JakeRadMSFT May 5, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@agocke - I'm not sure it's 100% of the way there but I did a pass and opened a different PR.

https://github.com/dotnet/machinelearning/pull/6656/files

Copy link
Contributor

@asmirnov82 asmirnov82 May 29, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see that PR already closed, but I have some comments regarding this GetMutableBuffer staff. It looks it's used for working with readonly memory, when column is created from the Apache Arrow RecordBatch in

public static DataFrame FromArrowRecordBatch(RecordBatch recordBatch)
{
    DataFrame ret = new DataFrame();
    Apache.Arrow.Schema arrowSchema = recordBatch.Schema;
    int fieldIndex = 0;
    IEnumerable<IArrowArray> arrowArrays = recordBatch.Arrays;
    foreach (IArrowArray arrowArray in arrowArrays)
    {
        Field field = arrowSchema.GetFieldByIndex(fieldIndex);
        AppendDataFrameColumnFromArrowArray(field, arrowArray, ret);
        fieldIndex++;
    }
    return ret;
}

However, this implementation looks very suspicious.

RecordBatch is a disposable object. Apache Arrow by default uses NativeMemoryAllocator to allocate unmanaged memory (for example, this default allocator is used in Spark.Net to create RecorBatch and pass it to DataFrame.FromArrowRecordBatch factory method).
So it's up to a DataFrame to hold the link to the RecordBatch and correctly Dispose it. Or we have to copy the unmanaged readonly memory from the RecordBatch into managed buffers (that exactly what is happening in GetMutableBuffer on attempt to edit data), but in this case we can avoid using ReadOnlyBuffers at all.

@michaelgsharp michaelgsharp merged commit 33342a2 into dotnet:main May 5, 2023
@janholo
Copy link
Contributor Author

janholo commented May 11, 2023

Hey @JakeRadMSFT, thanks for taking the time to review and merge the PR.

Sorry for not providing tests. What I still think is strange is that Apply and ApplyElementwise have similar signatures but do fundamentally different things. One changes "this" column and the other operates on an other column.

@ghost ghost locked as resolved and limited conversation to collaborators Jun 28, 2023
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants