Skip to content

Added Block Size Checks for ParquetLoader #120

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 5 commits into from
May 23, 2018

Conversation

mandyshieh
Copy link
Contributor

Fixes #118
Originally PR#12

In order to enable shuffling for both the {rows in a block} and {blocks}, both sizes have to (1) be smaller than int.MaxValue, and (2) fit in an int array, making the upper limit for both {rows in a block} and {number of blocks} around 300M. If either value throws an exception when assigned, the library will suggest adjusting the block size as a potential fix.

Also changed the default block size to 1M to minimize the chance of OutOfMemory and Overflow exceptions happening in the first place, and addressed comments in original PR.

catch (OverflowException)
{
// this exception is thrown when number of blocks exceeds int.MaxValue
throw _loader._host.ExceptParam("ColumnChunkReadSize", "Error due to too many blocks. Try increasing block size.");
Copy link
Contributor

@glebuk glebuk May 10, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ColumnChunkReadSize [](start = 53, length = 19)

nameof( #Closed

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed, along with other places better off with nameof().

int[] blockOrder;
try
{
numBlocks = checked((int)Math.Ceiling(((decimal)parent.GetRowCount() / _readerOptions.Count)));
Copy link
Contributor

@glebuk glebuk May 11, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

decimal [](start = 60, length = 7)

why cast long to decimal? why not just stick to long? #Closed

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Because we want the Math.Ceiling to operate, whereas if we had this be a long it be the equiv to a floor -- that is, unless the number of rows happened to be exactly divisible by the count, then numBlocks would be 1 lower than it ought to be.

That said: I don't particularly like this cast to decimal myself and call to Math to solve what is a fairly straightforward problem. Elsewhere we solve the problem of ceil((decimal)a / b) by doing instead (a + b -1) / b on int or long types. That said that's not terribly obvious either, though at least it has the virtue of using familiar types. Perhaps an addition to MathUtils, since this sort of rounded-up-div problem comes up pretty often.


In reply to: 187491573 [](ancestors = 187491573)

Copy link
Contributor

@glebuk glebuk May 11, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

do we really expect the number of rows to be above 2^^63? That seem like a very large number, we are talking Exobytes..


In reply to: 187502257 [](ancestors = 187502257,187491573)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Well, merely casing to long and doing a normal division won't save you. The issue isn't 2^^63 or any other large number. The issue is that divisions over integers are truncating divisions, which is the opposite of what we want sometimes like now when we're trying to figure out how many containers we'll need of capacity c if we have r items to store in them.

Let's consider, row count long r=800 and readerOptions.count, let's call it long c=100. In this case, you can fit those into 8 blocks. 800/100 => 8, (800+100-1)/100=>8, and (long)Math.Ceiling(((decimal)800/100))=>8.

Now consider if r=801. Suddenly we'll need 9 blocks. But 801/100 => 8. Note however, (801+101-1)/100=>9 and (long)Math.Ceiling(((decimal)801/100))=>9.


In reply to: 187506601 [](ancestors = 187506601,187502257,187491573)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Originally the decimal cast was because that's the only way to come out of the division with a decimal point instead of the closest lesser integer, but I've added a method in MathUtils to do the calculation with longs for clarity and to minimize casting instead.

}
try
{
blockOrder = _rand == null ? Utils.GetIdentityPermutation(numBlocks) : Utils.GetRandomPermutation(rand, numBlocks);
Copy link
Contributor

@glebuk glebuk May 11, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

GetIdentityPermutation [](start = 55, length = 22)

Now here all you need is to get a random ordering of integers from 0 to N. why do you need to allocate an array, cant you just write a function to return an IEnumerable that would return random sequence of blocks without having to allocate this.?
Take a look here: https://stackoverflow.com/questions/35102073/generating-a-random-non-repeating-sequence-of-all-integers-in-net
If you do this I believe you will be able to avoid all that catch/throwing code.

Now, if you use the IEnumerable that generates a random sequence, it can even use Longs for number of blocks. #Resolved

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This might be a research question of whether the "randomness" from this (what is this, an LCG?) is sufficient. Right now our permutation is generated using a pretty good PRNG generator -- albeit one that uses O(blocks) memory. Whether we can "get away" with a worse one in situations where shuffling is important strikes me as a research question, one that I'm not sure we should attempt to answer now.

I might choose to delay for this reason: Elsewhere in our codebase we don't insist that a random order without repetitions be accomplished while using no additional memory (as evidenced by the fact that we have a utility for a materialized array, vs. having a utility like you mention, and we've used the first ). So: is there a strong reason why we would insist on it here, especially since there are valid questions in the linked discussion about whether this simple modular arithmetic scheme is random enough for many applications?


In reply to: 187491929 [](ancestors = 187491929)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Consider having a shuffling scheme based on BigArray<int> or BigArray<long> if we want to support these larger cases. (How often does this come up, practically?)


In reply to: 187505463 [](ancestors = 187505463,187491929)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To me the last question that Tom raised directs how to move forward. Since I don't have any real insight into what kind of Parquet files will be loaded, I can't really say how powerful the shuffle should be, or if what we have now is sufficient for most cases (maybe someone with more experience could shed some light,) though it seems for a single Parquet file, 300M rows*300M blocks covers a wide range, and can also be expanded if the dataset were saved as multiple Parquet files.

For now I've kept the array shuffle unless the elements don't fit in an array, in which case the data is read sequentially up to the size of int.MaxValue. If we decide to go forward with BigArray or another means of shuffling, I will look into it at a later time.

@glebuk glebuk requested a review from TomFinley May 11, 2018 00:03
int[] blockOrder;
try
{
numBlocks = checked((int)Math.Ceiling(((decimal)parent.GetRowCount() / _readerOptions.Count)));
Copy link
Contributor

@glebuk glebuk May 11, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

checked [](start = 32, length = 7)

consider using numblocks as long and then just use an if statement if it is above int.max
Why need to throw and then catch? #Closed

Copy link
Contributor

@TomFinley TomFinley May 11, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, exception driven computation is kind of one of those mechanisms best left as a last resort. Storing in long and then using a simple comparison is better for sure.


In reply to: 187492091 [](ancestors = 187492091)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed!


_dataSetEnumerator = new int[0].GetEnumerator(); // Initialize an empty enumerator to get started
_dataSetEnumerator = new int[0].Cast<int>().GetEnumerator(); // Initialize an empty enumerator to get started
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

new int[0].Cast().GetEnumerator(); [](start = 37, length = 39)

There's no need to allocate even an empty array. You may just use Enumerable.Empty<int>().GetEnumerator().

// This exception is thrown when attempting to create an array of more than ~300M elements
throw _loader._host.ExceptParam(_chunkSizeShortName, "Error due to too many blocks. Try increasing block size.");
}
_blockEnumerator = blockOrder.Cast<int>().GetEnumerator();
Copy link
Contributor

@TomFinley TomFinley May 11, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

blockOrder.Cast() [](start = 35, length = 22)

Please just declare blockOrder as an IEnumerable<int>, avoid this Cast stuff.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

}
try
{
blockOrder = _rand == null ? Utils.GetIdentityPermutation(numBlocks) : Utils.GetRandomPermutation(rand, numBlocks);
Copy link
Contributor

@TomFinley TomFinley May 11, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Utils.GetIdentityPermutation(numBlocks) [](start = 49, length = 39)

Enumerable.Range. Don't materialize the array in the sequential case... #Closed

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My guess the implementation there just allocates an array and that throws. We should avoid it. Can we at least do some kind of hierarchical randomness where we don't need to allocate the entire array of blocks?
for example, if we don't trust LCG, shy not just get a block of LCG numbers and then shuffle them. If blocks are large enough to be random, but not too big to be a memory hog we might achieve our goal.


In reply to: 187505300 [](ancestors = 187505300)

Copy link
Contributor

@TomFinley TomFinley May 11, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi Gleb, this comment was on the sequential case. There's certainly no need to allocate there. Did you mean this comment to go on the above thread?

Let's talk about it here though just so we don't break threads. Right, so, these questions we're having are, I think, a pretty clear sign of something goes well beyond the scope of what this PR is meant to do. Regarding the linked stack overflow discussion, I think using a PRNG scheme with like LCG, which has a bad reputation verging on infamy, or any close variant, is not something we ought to just accept reflexively. If you feel strongly about it, perhaps an issue should be filed, so whether or not that level of shuffling is sufficient or produces good practical results can be evaluated. In the meantime, is using the existing mechanism that is used elsewhere without incident sufficient, until that hypothetical investigation is finished?


In reply to: 187506292 [](ancestors = 187506292,187505300)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank for the clarification Tom. I agree that this discussion, although good to have, is beyond the scope of this PR. The original implementation was done to conform to the pattern of the other shuffling mechanisms of this library. If there is an agreement on the implementation we're happy to follow guidance. At this moment however, I'm seconding the vote for array shuffling until a sufficient MathUtils replacement is created that supports IEnumerator. Please advise.

catch (OutOfMemoryException)
{
// This exception is thrown when attempting to create an array of more than ~300M elements
throw _loader._host.ExceptParam(_chunkSizeShortName, "Error due to too many blocks. Try increasing block size.");
Copy link
Contributor

@TomFinley TomFinley May 11, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

_loader._host.ExceptParam(_chunkSizeShortName, "Error due to too many blocks. Try increasing block size."); [](start = 26, length = 107)

In the unlikely event that there is a file with more blocks than can fit in an array in .NET, neither should we throw. When some circumstances prevents us from serving up a random cursor in any other implementation of IDataView we don't throw, we just simply do not serve up a random cursor, but rather just a sequential one. We ought to have the same policy here.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Certainly we don't expect our cursor creation anywhere else to throw on valid inputs except in exceptional circumstances like, say, an input file has been deleted.


In reply to: 187505473 [](ancestors = 187505473)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Right, the IEnumerable fix for the sequential case takes away the restriction of the number of elements needing to fit in an array, so one would expect the "next best thing" to a shuffled sequence to be an unshuffled sequence, instead of just throwing. For now ParquetLoader will give a sequential cursor when it's unable to produce a shuffled sequence.

@@ -94,7 +94,8 @@ public sealed class Arguments
private readonly int _columnChunkReadSize;
private readonly Column[] _columnsLoaded;
private readonly DataSet _schemaDataSet;
private const int _defaultColumnChunkReadSize = 100; // Should ideally be close to Rowgroup size
private const int _defaultColumnChunkReadSize = 1000000;
private const string _chunkSizeShortName = "chunkSize";
Copy link
Contributor

@TomFinley TomFinley May 12, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

private const string _chunkSizeShortName = "chunkSize"; [](start = 8, length = 55)

Hi, I guess the idea here is we want error messages to use the short name, but this is inconsistent with other things in the codebase that just use the field name. Also the short name is relevant only to command line users, but is not relevant to ML.NET's API users.) Remove this constand string, and replace all usages with nameof(Arguments.ColumnChunkReadSize). #Closed

Copy link
Contributor Author

@mandyshieh mandyshieh May 14, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

TIL of nameof(). Fixed! #Closed

@markusweimer
Copy link
Member

@mandyshieh, does this replace #12? If so, can you please close that one?

{
// if unable to create a shuffled sequence, default to sequential reading.
dataSetOrder = Enumerable.Range(0, ds.RowCount);
}
Copy link
Contributor

@glebuk glebuk May 17, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This seems identical code with line 402. Can this be factored out? #Closed

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

/// <returns></returns>
public static long DivisionCeiling(long numerator, long denomenator)
{
return (numerator + denomenator - 1) / denomenator;
Copy link
Contributor

@glebuk glebuk May 17, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(numerator + denomenator [](start = 19, length = 24)

should this be checked? what if numerator + denumerator > long? Will we get a hidden overflow? #Closed

Copy link
Contributor Author

@mandyshieh mandyshieh May 17, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good point, check added! #Closed

Copy link
Contributor

@glebuk glebuk left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

:shipit:

/// <param name="numerator">Number to be divided.</param>
/// <param name="denomenator">Number with which to divide the numerator.</param>
/// <returns></returns>
public static long DivisionCeiling(long numerator, long denomenator)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

DivisionCeiling [](start = 27, length = 15)

FYI we do this a lot throughout the codebase where we could benefit from this instead -- hmm, but I guess I won't make you fix everywhere we do it. I suppose. ;)

var numBlocks = MathUtils.DivisionCeiling((long)parent.GetRowCount(), _readerOptions.Count);
if (numBlocks > int.MaxValue)
{
throw _loader._host.ExceptParam(nameof(Arguments.ColumnChunkReadSize), "Error due to too many blocks. Try increasing block size.");
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The base class has opened a channel and made it available via Ch property, should probably use that instead since it's a more specific exception context.

@TomFinley
Copy link
Contributor

        }

I know it's kind of sillly, but when a channel exits successfully you have to call ch.Done()... so that we can tell the difference. (Because, I guess, the big freakin' exception that must result in that condition wasn't a big enough of a clue I guess. :D )


Refers to: src/Microsoft.ML.Parquet/ParquetLoader.cs:186 in ec1e78a. [](commit_id = ec1e78a, deletion_comment = False)

order = Enumerable.Range(0, size);
}
return order;
}
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A minor note (minor because I am fairly certain this condition will never occur), it is somewhat of a corruption of the IDataView contract` to have an implementation that delivers different results depending on runtime conditions... but since this will never occur and it's for shuffling case only anyway, it is probably not so bad.

Copy link
Contributor

@TomFinley TomFinley left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

:shipit:

@shauheen shauheen merged commit 7a5b303 into dotnet:master May 23, 2018
eerhardt pushed a commit to eerhardt/machinelearning that referenced this pull request Jul 27, 2018
* added ceiling to mathutils and have parquetloader default to sequential reading instead of throwing

* factor out sequence creation and add check for overflow in MathUtils
@ghost ghost locked as resolved and limited conversation to collaborators Mar 30, 2022
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

ParquetLoader Exceptions Due to Block Size
6 participants