Added Block Size Checks for ParquetLoader #120

mandyshieh · 2018-05-10T21:58:17Z

Fixes #118
Originally PR#12

In order to enable shuffling for both the {rows in a block} and {blocks}, both sizes have to (1) be smaller than int.MaxValue, and (2) fit in an int array, making the upper limit for both {rows in a block} and {number of blocks} around 300M. If either value throws an exception when assigned, the library will suggest adjusting the block size as a potential fix.

Also changed the default block size to 1M to minimize the chance of OutOfMemory and Overflow exceptions happening in the first place, and addressed comments in original PR.

glebuk · 2018-05-10T23:59:15Z

src/Microsoft.ML.Parquet/ParquetLoader.cs

+                catch (OverflowException)
+                {
+                    // this exception is thrown when number of blocks exceeds int.MaxValue
+                    throw _loader._host.ExceptParam("ColumnChunkReadSize", "Error due to too many blocks. Try increasing block size.");


ColumnChunkReadSize [](start = 53, length = 19)

nameof( #Closed

Fixed, along with other places better off with nameof().

glebuk · 2018-05-11T00:00:35Z

src/Microsoft.ML.Parquet/ParquetLoader.cs

+                int[] blockOrder;
+                try
+                {
+                    numBlocks = checked((int)Math.Ceiling(((decimal)parent.GetRowCount() / _readerOptions.Count)));


decimal [](start = 60, length = 7)

why cast long to decimal? why not just stick to long? #Closed

Because we want the Math.Ceiling to operate, whereas if we had this be a long it be the equiv to a floor -- that is, unless the number of rows happened to be exactly divisible by the count, then numBlocks would be 1 lower than it ought to be.

That said: I don't particularly like this cast to decimal myself and call to Math to solve what is a fairly straightforward problem. Elsewhere we solve the problem of ceil((decimal)a / b) by doing instead (a + b -1) / b on int or long types. That said that's not terribly obvious either, though at least it has the virtue of using familiar types. Perhaps an addition to MathUtils, since this sort of rounded-up-div problem comes up pretty often.

In reply to: 187491573 [](ancestors = 187491573)

do we really expect the number of rows to be above 2^^63? That seem like a very large number, we are talking Exobytes..

In reply to: 187502257 [](ancestors = 187502257,187491573)

Well, merely casing to long and doing a normal division won't save you. The issue isn't 2^^63 or any other large number. The issue is that divisions over integers are truncating divisions, which is the opposite of what we want sometimes like now when we're trying to figure out how many containers we'll need of capacity c if we have r items to store in them.

Let's consider, row count long r=800 and readerOptions.count, let's call it long c=100. In this case, you can fit those into 8 blocks. 800/100 => 8, (800+100-1)/100=>8, and (long)Math.Ceiling(((decimal)800/100))=>8.

Now consider if r=801. Suddenly we'll need 9 blocks. But 801/100 => 8. Note however, (801+101-1)/100=>9 and (long)Math.Ceiling(((decimal)801/100))=>9.

In reply to: 187506601 [](ancestors = 187506601,187502257,187491573)

Originally the decimal cast was because that's the only way to come out of the division with a decimal point instead of the closest lesser integer, but I've added a method in MathUtils to do the calculation with longs for clarity and to minimize casting instead.

glebuk · 2018-05-11T00:03:23Z

src/Microsoft.ML.Parquet/ParquetLoader.cs

+                }
+                try
+                {
+                    blockOrder = _rand == null ? Utils.GetIdentityPermutation(numBlocks) : Utils.GetRandomPermutation(rand, numBlocks);


GetIdentityPermutation [](start = 55, length = 22)

Now here all you need is to get a random ordering of integers from 0 to N. why do you need to allocate an array, cant you just write a function to return an IEnumerable that would return random sequence of blocks without having to allocate this.?
Take a look here: https://stackoverflow.com/questions/35102073/generating-a-random-non-repeating-sequence-of-all-integers-in-net
If you do this I believe you will be able to avoid all that catch/throwing code.

Now, if you use the IEnumerable that generates a random sequence, it can even use Longs for number of blocks. #Resolved

This might be a research question of whether the "randomness" from this (what is this, an LCG?) is sufficient. Right now our permutation is generated using a pretty good PRNG generator -- albeit one that uses O(blocks) memory. Whether we can "get away" with a worse one in situations where shuffling is important strikes me as a research question, one that I'm not sure we should attempt to answer now.

I might choose to delay for this reason: Elsewhere in our codebase we don't insist that a random order without repetitions be accomplished while using no additional memory (as evidenced by the fact that we have a utility for a materialized array, vs. having a utility like you mention, and we've used the first ). So: is there a strong reason why we would insist on it here, especially since there are valid questions in the linked discussion about whether this simple modular arithmetic scheme is random enough for many applications?

In reply to: 187491929 [](ancestors = 187491929)

Consider having a shuffling scheme based on BigArray<int> or BigArray<long> if we want to support these larger cases. (How often does this come up, practically?)

In reply to: 187505463 [](ancestors = 187505463,187491929)

To me the last question that Tom raised directs how to move forward. Since I don't have any real insight into what kind of Parquet files will be loaded, I can't really say how powerful the shuffle should be, or if what we have now is sufficient for most cases (maybe someone with more experience could shed some light,) though it seems for a single Parquet file, 300M rows*300M blocks covers a wide range, and can also be expanded if the dataset were saved as multiple Parquet files.

For now I've kept the array shuffle unless the elements don't fit in an array, in which case the data is read sequentially up to the size of int.MaxValue. If we decide to go forward with BigArray or another means of shuffling, I will look into it at a later time.

glebuk · 2018-05-11T00:04:38Z

src/Microsoft.ML.Parquet/ParquetLoader.cs

+                int[] blockOrder;
+                try
+                {
+                    numBlocks = checked((int)Math.Ceiling(((decimal)parent.GetRowCount() / _readerOptions.Count)));


checked [](start = 32, length = 7)

consider using numblocks as long and then just use an if statement if it is above int.max
Why need to throw and then catch? #Closed

Yes, exception driven computation is kind of one of those mechanisms best left as a last resort. Storing in long and then using a simple comparison is better for sure.

In reply to: 187492091 [](ancestors = 187492091)

TomFinley · 2018-05-11T01:47:53Z

src/Microsoft.ML.Parquet/ParquetLoader.cs


-                _dataSetEnumerator = new int[0].GetEnumerator(); // Initialize an empty enumerator to get started
+                _dataSetEnumerator = new int[0].Cast<int>().GetEnumerator(); // Initialize an empty enumerator to get started


new int[0].Cast().GetEnumerator(); [](start = 37, length = 39)

There's no need to allocate even an empty array. You may just use Enumerable.Empty<int>().GetEnumerator().

TomFinley · 2018-05-11T02:10:59Z

src/Microsoft.ML.Parquet/ParquetLoader.cs

+                    // This exception is thrown when attempting to create an array of more than ~300M elements
+                    throw _loader._host.ExceptParam(_chunkSizeShortName, "Error due to too many blocks. Try increasing block size.");
+                }
+                _blockEnumerator = blockOrder.Cast<int>().GetEnumerator();


blockOrder.Cast() [](start = 35, length = 22)

Please just declare blockOrder as an IEnumerable<int>, avoid this Cast stuff.

TomFinley · 2018-05-11T02:11:21Z

src/Microsoft.ML.Parquet/ParquetLoader.cs

+                }
+                try
+                {
+                    blockOrder = _rand == null ? Utils.GetIdentityPermutation(numBlocks) : Utils.GetRandomPermutation(rand, numBlocks);


Utils.GetIdentityPermutation(numBlocks) [](start = 49, length = 39)

Enumerable.Range. Don't materialize the array in the sequential case... #Closed

My guess the implementation there just allocates an array and that throws. We should avoid it. Can we at least do some kind of hierarchical randomness where we don't need to allocate the entire array of blocks?
for example, if we don't trust LCG, shy not just get a block of LCG numbers and then shuffle them. If blocks are large enough to be random, but not too big to be a memory hog we might achieve our goal.

In reply to: 187505300 [](ancestors = 187505300)

Hi Gleb, this comment was on the sequential case. There's certainly no need to allocate there. Did you mean this comment to go on the above thread?

Let's talk about it here though just so we don't break threads. Right, so, these questions we're having are, I think, a pretty clear sign of something goes well beyond the scope of what this PR is meant to do. Regarding the linked stack overflow discussion, I think using a PRNG scheme with like LCG, which has a bad reputation verging on infamy, or any close variant, is not something we ought to just accept reflexively. If you feel strongly about it, perhaps an issue should be filed, so whether or not that level of shuffling is sufficient or produces good practical results can be evaluated. In the meantime, is using the existing mechanism that is used elsewhere without incident sufficient, until that hypothetical investigation is finished?

In reply to: 187506292 [](ancestors = 187506292,187505300)

Thank for the clarification Tom. I agree that this discussion, although good to have, is beyond the scope of this PR. The original implementation was done to conform to the pattern of the other shuffling mechanisms of this library. If there is an agreement on the implementation we're happy to follow guidance. At this moment however, I'm seconding the vote for array shuffling until a sufficient MathUtils replacement is created that supports IEnumerator. Please advise.

TomFinley · 2018-05-11T02:13:12Z

src/Microsoft.ML.Parquet/ParquetLoader.cs

+                catch (OutOfMemoryException)
+                {
+                    // This exception is thrown when attempting to create an array of more than ~300M elements
+                    throw _loader._host.ExceptParam(_chunkSizeShortName, "Error due to too many blocks. Try increasing block size.");


_loader._host.ExceptParam(_chunkSizeShortName, "Error due to too many blocks. Try increasing block size."); [](start = 26, length = 107)

In the unlikely event that there is a file with more blocks than can fit in an array in .NET, neither should we throw. When some circumstances prevents us from serving up a random cursor in any other implementation of IDataView we don't throw, we just simply do not serve up a random cursor, but rather just a sequential one. We ought to have the same policy here.

Certainly we don't expect our cursor creation anywhere else to throw on valid inputs except in exceptional circumstances like, say, an input file has been deleted.

In reply to: 187505473 [](ancestors = 187505473)

Right, the IEnumerable fix for the sequential case takes away the restriction of the number of elements needing to fit in an array, so one would expect the "next best thing" to a shuffled sequence to be an unshuffled sequence, instead of just throwing. For now ParquetLoader will give a sequential cursor when it's unable to produce a shuffled sequence.

TomFinley · 2018-05-12T16:29:25Z

src/Microsoft.ML.Parquet/ParquetLoader.cs

@@ -94,7 +94,8 @@ public sealed class Arguments
        private readonly int _columnChunkReadSize;
        private readonly Column[] _columnsLoaded;
        private readonly DataSet _schemaDataSet;
-        private const int _defaultColumnChunkReadSize = 100; // Should ideally be close to Rowgroup size
+        private const int _defaultColumnChunkReadSize = 1000000;
+        private const string _chunkSizeShortName = "chunkSize";


private const string _chunkSizeShortName = "chunkSize"; [](start = 8, length = 55)

Hi, I guess the idea here is we want error messages to use the short name, but this is inconsistent with other things in the codebase that just use the field name. Also the short name is relevant only to command line users, but is not relevant to ML.NET's API users.) Remove this constand string, and replace all usages with nameof(Arguments.ColumnChunkReadSize). #Closed

TIL of nameof(). Fixed! #Closed

…al reading instead of throwing

markusweimer · 2018-05-15T23:52:42Z

@mandyshieh, does this replace #12? If so, can you please close that one?

glebuk · 2018-05-17T01:07:22Z

src/Microsoft.ML.Parquet/ParquetLoader.cs

+                    {
+                        // if unable to create a shuffled sequence, default to sequential reading.
+                        dataSetOrder = Enumerable.Range(0, ds.RowCount);
+                    }


This seems identical code with line 402. Can this be factored out? #Closed

glebuk · 2018-05-17T01:09:13Z

src/Microsoft.ML.Core/Utilities/MathUtils.cs

+        /// <returns></returns>
+        public static long DivisionCeiling(long numerator, long denomenator)
+        {
+            return (numerator + denomenator - 1) / denomenator;


(numerator + denomenator [](start = 19, length = 24)

should this be checked? what if numerator + denumerator > long? Will we get a hidden overflow? #Closed

Good point, check added! #Closed

glebuk

TomFinley · 2018-05-22T22:10:45Z

src/Microsoft.ML.Core/Utilities/MathUtils.cs

+        /// <param name="numerator">Number to be divided.</param>
+        /// <param name="denomenator">Number with which to divide the numerator.</param>
+        /// <returns></returns>
+        public static long DivisionCeiling(long numerator, long denomenator)


DivisionCeiling [](start = 27, length = 15)

FYI we do this a lot throughout the codebase where we could benefit from this instead -- hmm, but I guess I won't make you fix everywhere we do it. I suppose. ;)

TomFinley · 2018-05-22T22:12:22Z

src/Microsoft.ML.Parquet/ParquetLoader.cs

+                var numBlocks = MathUtils.DivisionCeiling((long)parent.GetRowCount(), _readerOptions.Count);
+                if (numBlocks > int.MaxValue)
+                {
+                    throw _loader._host.ExceptParam(nameof(Arguments.ColumnChunkReadSize), "Error due to too many blocks. Try increasing block size.");


The base class has opened a channel and made it available via Ch property, should probably use that instead since it's a more specific exception context.

TomFinley · 2018-05-22T22:13:57Z

I know it's kind of sillly, but when a channel exits successfully you have to call ch.Done()... so that we can tell the difference. (Because, I guess, the big freakin' exception that must result in that condition wasn't a big enough of a clue I guess. :D )

Refers to: src/Microsoft.ML.Parquet/ParquetLoader.cs:186 in ec1e78a. [](commit_id = ec1e78a, deletion_comment = False)

TomFinley · 2018-05-22T22:15:03Z

src/Microsoft.ML.Parquet/ParquetLoader.cs

+                    order = Enumerable.Range(0, size);
+                }
+                return order;
+            }
        }


A minor note (minor because I am fairly certain this condition will never occur), it is somewhat of a corruption of the IDataView contract` to have an implementation that delivers different results depending on runtime conditions... but since this will never occur and it's for shuffling case only anyway, it is probably not so bad.

TomFinley

* added ceiling to mathutils and have parquetloader default to sequential reading instead of throwing * factor out sequence creation and add check for overflow in MathUtils

mandyshieh added 3 commits May 3, 2018 17:49

block size exception

49bb968

catch exception when style

64de2d6

address comments

6e4d46c

glebuk reviewed May 10, 2018

View reviewed changes

glebuk reviewed May 11, 2018

View reviewed changes

glebuk requested a review from TomFinley May 11, 2018 00:03

glebuk reviewed May 11, 2018

View reviewed changes

TomFinley reviewed May 11, 2018

View reviewed changes

TomFinley reviewed May 12, 2018

View reviewed changes

added ceiling to mathutils and have parquetloader default to sequenti…

9967d12

…al reading instead of throwing

mandyshieh mentioned this pull request May 16, 2018

Added block size exception #12

Closed

glebuk reviewed May 17, 2018

View reviewed changes

factor out sequence creation and add check for overflow in MathUtils

ec1e78a

glebuk approved these changes May 17, 2018

View reviewed changes

tyclintw approved these changes May 22, 2018

View reviewed changes

TomFinley reviewed May 22, 2018

View reviewed changes

TomFinley approved these changes May 22, 2018

View reviewed changes

shauheen merged commit 7a5b303 into dotnet:master May 23, 2018

ghost locked as resolved and limited conversation to collaborators Mar 30, 2022


		_dataSetEnumerator = new int[0].GetEnumerator(); // Initialize an empty enumerator to get started
		_dataSetEnumerator = new int[0].Cast<int>().GetEnumerator(); // Initialize an empty enumerator to get started

Added Block Size Checks for ParquetLoader #120

Added Block Size Checks for ParquetLoader #120

Uh oh!

Conversation

mandyshieh commented May 10, 2018

Uh oh!

glebuk May 10, 2018 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

glebuk May 11, 2018 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

glebuk May 11, 2018 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

glebuk May 11, 2018 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

glebuk May 11, 2018 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

TomFinley May 11, 2018 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

TomFinley May 11, 2018 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

TomFinley May 11, 2018 • edited by glebuk Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

TomFinley May 11, 2018 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

TomFinley May 11, 2018 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

TomFinley May 12, 2018 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

mandyshieh May 14, 2018 • edited by glebuk Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

glebuk May 10, 2018 •

edited

Loading

glebuk May 11, 2018 •

edited

Loading

glebuk May 11, 2018 •

edited

Loading

glebuk May 11, 2018 •

edited

Loading

glebuk May 11, 2018 •

edited

Loading

TomFinley May 11, 2018 •

edited

Loading

TomFinley May 11, 2018 •

edited

Loading

TomFinley May 11, 2018 •

edited by glebuk

Loading

TomFinley May 11, 2018 •

edited

Loading

TomFinley May 11, 2018 •

edited

Loading

TomFinley May 12, 2018 •

edited

Loading

mandyshieh May 14, 2018 •

edited by glebuk

Loading

glebuk May 17, 2018 •

edited

Loading

glebuk May 17, 2018 •

edited

Loading

mandyshieh May 17, 2018 •

edited by glebuk

Loading