-
Notifications
You must be signed in to change notification settings - Fork 1.9k
Added Block Size Checks for ParquetLoader #120
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
catch (OverflowException) | ||
{ | ||
// this exception is thrown when number of blocks exceeds int.MaxValue | ||
throw _loader._host.ExceptParam("ColumnChunkReadSize", "Error due to too many blocks. Try increasing block size."); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ColumnChunkReadSize [](start = 53, length = 19)
nameof( #Closed
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fixed, along with other places better off with nameof().
int[] blockOrder; | ||
try | ||
{ | ||
numBlocks = checked((int)Math.Ceiling(((decimal)parent.GetRowCount() / _readerOptions.Count))); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
decimal [](start = 60, length = 7)
why cast long to decimal? why not just stick to long? #Closed
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Because we want the Math.Ceiling
to operate, whereas if we had this be a long it be the equiv to a floor -- that is, unless the number of rows happened to be exactly divisible by the count, then numBlocks
would be 1 lower than it ought to be.
That said: I don't particularly like this cast to decimal
myself and call to Math
to solve what is a fairly straightforward problem. Elsewhere we solve the problem of ceil((decimal)a / b)
by doing instead (a + b -1) / b
on int
or long
types. That said that's not terribly obvious either, though at least it has the virtue of using familiar types. Perhaps an addition to MathUtils
, since this sort of rounded-up-div problem comes up pretty often.
In reply to: 187491573 [](ancestors = 187491573)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
do we really expect the number of rows to be above 2^^63? That seem like a very large number, we are talking Exobytes..
In reply to: 187502257 [](ancestors = 187502257,187491573)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Well, merely casing to long and doing a normal division won't save you. The issue isn't 2^^63 or any other large number. The issue is that divisions over integers are truncating divisions, which is the opposite of what we want sometimes like now when we're trying to figure out how many containers we'll need of capacity c
if we have r
items to store in them.
Let's consider, row count long r=800
and readerOptions.count
, let's call it long c=100
. In this case, you can fit those into 8 blocks. 800/100 => 8
, (800+100-1)/100=>8
, and (long)Math.Ceiling(((decimal)800/100))=>8
.
Now consider if r=801
. Suddenly we'll need 9
blocks. But 801/100 => 8
. Note however, (801+101-1)/100=>9
and (long)Math.Ceiling(((decimal)801/100))=>9
.
In reply to: 187506601 [](ancestors = 187506601,187502257,187491573)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Originally the decimal cast was because that's the only way to come out of the division with a decimal point instead of the closest lesser integer, but I've added a method in MathUtils to do the calculation with longs for clarity and to minimize casting instead.
} | ||
try | ||
{ | ||
blockOrder = _rand == null ? Utils.GetIdentityPermutation(numBlocks) : Utils.GetRandomPermutation(rand, numBlocks); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
GetIdentityPermutation [](start = 55, length = 22)
Now here all you need is to get a random ordering of integers from 0 to N. why do you need to allocate an array, cant you just write a function to return an IEnumerable that would return random sequence of blocks without having to allocate this.?
Take a look here: https://stackoverflow.com/questions/35102073/generating-a-random-non-repeating-sequence-of-all-integers-in-net
If you do this I believe you will be able to avoid all that catch/throwing code.
Now, if you use the IEnumerable that generates a random sequence, it can even use Longs for number of blocks. #Resolved
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This might be a research question of whether the "randomness" from this (what is this, an LCG?) is sufficient. Right now our permutation is generated using a pretty good PRNG generator -- albeit one that uses O(blocks)
memory. Whether we can "get away" with a worse one in situations where shuffling is important strikes me as a research question, one that I'm not sure we should attempt to answer now.
I might choose to delay for this reason: Elsewhere in our codebase we don't insist that a random order without repetitions be accomplished while using no additional memory (as evidenced by the fact that we have a utility for a materialized array, vs. having a utility like you mention, and we've used the first ). So: is there a strong reason why we would insist on it here, especially since there are valid questions in the linked discussion about whether this simple modular arithmetic scheme is random enough for many applications?
In reply to: 187491929 [](ancestors = 187491929)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Consider having a shuffling scheme based on BigArray<int>
or BigArray<long>
if we want to support these larger cases. (How often does this come up, practically?)
In reply to: 187505463 [](ancestors = 187505463,187491929)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
To me the last question that Tom raised directs how to move forward. Since I don't have any real insight into what kind of Parquet files will be loaded, I can't really say how powerful the shuffle should be, or if what we have now is sufficient for most cases (maybe someone with more experience could shed some light,) though it seems for a single Parquet file, 300M rows*300M blocks covers a wide range, and can also be expanded if the dataset were saved as multiple Parquet files.
For now I've kept the array shuffle unless the elements don't fit in an array, in which case the data is read sequentially up to the size of int.MaxValue. If we decide to go forward with BigArray or another means of shuffling, I will look into it at a later time.
int[] blockOrder; | ||
try | ||
{ | ||
numBlocks = checked((int)Math.Ceiling(((decimal)parent.GetRowCount() / _readerOptions.Count))); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
checked [](start = 32, length = 7)
consider using numblocks as long and then just use an if statement if it is above int.max
Why need to throw and then catch? #Closed
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, exception driven computation is kind of one of those mechanisms best left as a last resort. Storing in long and then using a simple comparison is better for sure.
In reply to: 187492091 [](ancestors = 187492091)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fixed!
|
||
_dataSetEnumerator = new int[0].GetEnumerator(); // Initialize an empty enumerator to get started | ||
_dataSetEnumerator = new int[0].Cast<int>().GetEnumerator(); // Initialize an empty enumerator to get started |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
new int[0].Cast().GetEnumerator(); [](start = 37, length = 39)
There's no need to allocate even an empty array. You may just use Enumerable.Empty<int>().GetEnumerator()
.
// This exception is thrown when attempting to create an array of more than ~300M elements | ||
throw _loader._host.ExceptParam(_chunkSizeShortName, "Error due to too many blocks. Try increasing block size."); | ||
} | ||
_blockEnumerator = blockOrder.Cast<int>().GetEnumerator(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
blockOrder.Cast() [](start = 35, length = 22)
Please just declare blockOrder
as an IEnumerable<int>
, avoid this Cast
stuff.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done.
} | ||
try | ||
{ | ||
blockOrder = _rand == null ? Utils.GetIdentityPermutation(numBlocks) : Utils.GetRandomPermutation(rand, numBlocks); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Utils.GetIdentityPermutation(numBlocks) [](start = 49, length = 39)
Enumerable.Range
. Don't materialize the array in the sequential case... #Closed
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
My guess the implementation there just allocates an array and that throws. We should avoid it. Can we at least do some kind of hierarchical randomness where we don't need to allocate the entire array of blocks?
for example, if we don't trust LCG, shy not just get a block of LCG numbers and then shuffle them. If blocks are large enough to be random, but not too big to be a memory hog we might achieve our goal.
In reply to: 187505300 [](ancestors = 187505300)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hi Gleb, this comment was on the sequential case. There's certainly no need to allocate there. Did you mean this comment to go on the above thread?
Let's talk about it here though just so we don't break threads. Right, so, these questions we're having are, I think, a pretty clear sign of something goes well beyond the scope of what this PR is meant to do. Regarding the linked stack overflow discussion, I think using a PRNG scheme with like LCG, which has a bad reputation verging on infamy, or any close variant, is not something we ought to just accept reflexively. If you feel strongly about it, perhaps an issue should be filed, so whether or not that level of shuffling is sufficient or produces good practical results can be evaluated. In the meantime, is using the existing mechanism that is used elsewhere without incident sufficient, until that hypothetical investigation is finished?
In reply to: 187506292 [](ancestors = 187506292,187505300)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank for the clarification Tom. I agree that this discussion, although good to have, is beyond the scope of this PR. The original implementation was done to conform to the pattern of the other shuffling mechanisms of this library. If there is an agreement on the implementation we're happy to follow guidance. At this moment however, I'm seconding the vote for array shuffling until a sufficient MathUtils replacement is created that supports IEnumerator. Please advise.
catch (OutOfMemoryException) | ||
{ | ||
// This exception is thrown when attempting to create an array of more than ~300M elements | ||
throw _loader._host.ExceptParam(_chunkSizeShortName, "Error due to too many blocks. Try increasing block size."); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
_loader._host.ExceptParam(_chunkSizeShortName, "Error due to too many blocks. Try increasing block size."); [](start = 26, length = 107)
In the unlikely event that there is a file with more blocks than can fit in an array in .NET, neither should we throw. When some circumstances prevents us from serving up a random cursor in any other implementation of IDataView
we don't throw, we just simply do not serve up a random cursor, but rather just a sequential one. We ought to have the same policy here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Certainly we don't expect our cursor creation anywhere else to throw on valid inputs except in exceptional circumstances like, say, an input file has been deleted.
In reply to: 187505473 [](ancestors = 187505473)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Right, the IEnumerable fix for the sequential case takes away the restriction of the number of elements needing to fit in an array, so one would expect the "next best thing" to a shuffled sequence to be an unshuffled sequence, instead of just throwing. For now ParquetLoader will give a sequential cursor when it's unable to produce a shuffled sequence.
@@ -94,7 +94,8 @@ public sealed class Arguments | |||
private readonly int _columnChunkReadSize; | |||
private readonly Column[] _columnsLoaded; | |||
private readonly DataSet _schemaDataSet; | |||
private const int _defaultColumnChunkReadSize = 100; // Should ideally be close to Rowgroup size | |||
private const int _defaultColumnChunkReadSize = 1000000; | |||
private const string _chunkSizeShortName = "chunkSize"; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
private const string _chunkSizeShortName = "chunkSize"; [](start = 8, length = 55)
Hi, I guess the idea here is we want error messages to use the short name, but this is inconsistent with other things in the codebase that just use the field name. Also the short name is relevant only to command line users, but is not relevant to ML.NET's API users.) Remove this constand string, and replace all usages with nameof(Arguments.ColumnChunkReadSize)
. #Closed
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
TIL of nameof(). Fixed! #Closed
…al reading instead of throwing
@mandyshieh, does this replace #12? If so, can you please close that one? |
{ | ||
// if unable to create a shuffled sequence, default to sequential reading. | ||
dataSetOrder = Enumerable.Range(0, ds.RowCount); | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This seems identical code with line 402. Can this be factored out? #Closed
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done.
/// <returns></returns> | ||
public static long DivisionCeiling(long numerator, long denomenator) | ||
{ | ||
return (numerator + denomenator - 1) / denomenator; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
(numerator + denomenator [](start = 19, length = 24)
should this be checked? what if numerator + denumerator > long? Will we get a hidden overflow? #Closed
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good point, check added! #Closed
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
/// <param name="numerator">Number to be divided.</param> | ||
/// <param name="denomenator">Number with which to divide the numerator.</param> | ||
/// <returns></returns> | ||
public static long DivisionCeiling(long numerator, long denomenator) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
DivisionCeiling [](start = 27, length = 15)
FYI we do this a lot throughout the codebase where we could benefit from this instead -- hmm, but I guess I won't make you fix everywhere we do it. I suppose. ;)
var numBlocks = MathUtils.DivisionCeiling((long)parent.GetRowCount(), _readerOptions.Count); | ||
if (numBlocks > int.MaxValue) | ||
{ | ||
throw _loader._host.ExceptParam(nameof(Arguments.ColumnChunkReadSize), "Error due to too many blocks. Try increasing block size."); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The base class has opened a channel and made it available via Ch
property, should probably use that instead since it's a more specific exception context.
I know it's kind of sillly, but when a channel exits successfully you have to call Refers to: src/Microsoft.ML.Parquet/ParquetLoader.cs:186 in ec1e78a. [](commit_id = ec1e78a, deletion_comment = False) |
order = Enumerable.Range(0, size); | ||
} | ||
return order; | ||
} | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A minor note (minor because I am fairly certain this condition will never occur), it is somewhat of a corruption of the IDataView
contract` to have an implementation that delivers different results depending on runtime conditions... but since this will never occur and it's for shuffling case only anyway, it is probably not so bad.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
* added ceiling to mathutils and have parquetloader default to sequential reading instead of throwing * factor out sequence creation and add check for overflow in MathUtils
Fixes #118
Originally PR#12
In order to enable shuffling for both the {rows in a block} and {blocks}, both sizes have to (1) be smaller than int.MaxValue, and (2) fit in an int array, making the upper limit for both {rows in a block} and {number of blocks} around 300M. If either value throws an exception when assigned, the library will suggest adjusting the block size as a potential fix.
Also changed the default block size to 1M to minimize the chance of OutOfMemory and Overflow exceptions happening in the first place, and addressed comments in original PR.