-
Notifications
You must be signed in to change notification settings - Fork 1.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fix Apply in PrimitiveColumnContainer so it does not change source column #6642
Conversation
@dotnet-policy-service agree |
Codecov Report
Additional details and impacted files@@ Coverage Diff @@
## main #6642 +/- ##
==========================================
- Coverage 68.53% 68.52% -0.01%
==========================================
Files 1200 1200
Lines 250287 250279 -8
Branches 26093 26093
==========================================
- Hits 171526 171509 -17
- Misses 71929 71936 +7
- Partials 6832 6834 +2
Flags with carried forward coverage won't be shown. Click here to find out more.
|
@janholo Thanks for your PR! Could you add a test that fails before your change and passes after? It would help me better understand the problem and it's solution. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So I dug into this a bunch to try and understand the differences and overlap. After doing that investigation - I like your change. It fixes 2 things.
What Does Apply Do?
The goal of Apply
is to start with the value from this
container ... call func
with the value or default
for the type ... then assign the result to resultContainer
. Since it's possible the func
changed a value from null to not null or vice-versa we have to update the validityBit.
Thing 1
This method used to get mutable versions of span and nullBitMap for the SourceContainer or this
that we're not changing.
Your change makes it so we don't create mutable versions of source ... since it's not needed. Nothing should change with source.
Thing 2
It also fixes a bug where we were setting the validity bit on the SourceContainer because we were calling this.SetValidityBit
instead of resultContainer.SetValidityBit
.
DataFrameBuffer<byte> mutableNullBitMapBuffer = DataFrameBuffer<byte>.GetMutableBuffer(NullBitMapBuffers[b]); | ||
NullBitMapBuffers[b] = mutableNullBitMapBuffer; | ||
Span<byte> nullBitMapSpan = mutableNullBitMapBuffer.Span; | ||
ReadOnlyDataFrameBuffer<T> sourceBuffer = Buffers[b]; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe it doesn't because it actually does a copy?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Specifically I'm wondering if there would be a better way to refactor out the code like this:
DataFrameBuffer<TResult> resultMutableBuffer = DataFrameBuffer<TResult>.GetMutableBuffer(resultBuffer);
resultContainer.Buffers[b] = resultMutableBuffer;
Span<TResult> resultSpan = resultMutableBuffer.Span;
DataFrameBuffer<byte> resultMutableNullBitMapBuffer = DataFrameBuffer<byte>.GetMutableBuffer(resultContainer.NullBitMapBuffers[b]);
resultContainer.NullBitMapBuffers[b] = resultMutableNullBitMapBuffer;
Span<byte> resultNullBitMapSpan = resultMutableNullBitMapBuffer.Span;
It's all over this file in some form.
Clone, Apply, ApplyElementWise, CloneAs
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There's a lot of very questionable stuff going on here. The code is going to great lengths to copy the original buffer to get a mutable version... and then passing the span to IsValid
or calling the indexer, which both accept ReadOnlySpan.
And then there's the ReadOnlyDataFrameBuffer<T>
type which takes a T : struct
, but should be T : unmanaged
.
This code could likely be simplified by extracting some helper methods, but I think the above issues indicate that someone should review the actual intent and ensure that the code reflects it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That's sort of what it felt like but I appreciate having some expert eyes on it! Thanks @agocke!
I might take this up as a pet project in the future :)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@agocke - I'm not sure it's 100% of the way there but I did a pass and opened a different PR.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I see that PR already closed, but I have some comments regarding this GetMutableBuffer staff. It looks it's used for working with readonly memory, when column is created from the Apache Arrow RecordBatch in
public static DataFrame FromArrowRecordBatch(RecordBatch recordBatch)
{
DataFrame ret = new DataFrame();
Apache.Arrow.Schema arrowSchema = recordBatch.Schema;
int fieldIndex = 0;
IEnumerable<IArrowArray> arrowArrays = recordBatch.Arrays;
foreach (IArrowArray arrowArray in arrowArrays)
{
Field field = arrowSchema.GetFieldByIndex(fieldIndex);
AppendDataFrameColumnFromArrowArray(field, arrowArray, ret);
fieldIndex++;
}
return ret;
}
However, this implementation looks very suspicious.
RecordBatch is a disposable object. Apache Arrow by default uses NativeMemoryAllocator to allocate unmanaged memory (for example, this default allocator is used in Spark.Net to create RecorBatch and pass it to DataFrame.FromArrowRecordBatch factory method).
So it's up to a DataFrame to hold the link to the RecordBatch and correctly Dispose it. Or we have to copy the unmanaged readonly memory from the RecordBatch into managed buffers (that exactly what is happening in GetMutableBuffer on attempt to edit data), but in this case we can avoid using ReadOnlyBuffers at all.
Hey @JakeRadMSFT, thanks for taking the time to review and merge the PR. Sorry for not providing tests. What I still think is strange is that Apply and ApplyElementwise have similar signatures but do fundamentally different things. One changes "this" column and the other operates on an other column. |
The Apply method calls SetValidityBit on the source container and not on the target container. That way the null check of the container values gets incorrect.
We are excited to review your PR.
So we can do the best job, please check:
Fixes #nnnn
in your description to cause GitHub to automatically close the issue(s) when your PR is merged.