Skip to content

[ENH] Add new methods of generating bounded random integers #1979

Closed
@MatthiasKohl

Description

@MatthiasKohl

Describe the solution you'd like
There have been recent developments in generating bounded random integers, in particular see swiftlang/swift#39143

This should be added to RAFT and then used in cugraph (in particular algorithms like random walks), for the reasons outlined in context below.

Additional context
There are generally two "simple" methods of generating bounded random integers:

  1. Generate some number of random bits (usually 32 or 64), interpret as unsigned integer and take modulo bound.
  2. Generate random uniform floating-point number (curand generates in interval ]0, 1]) and multiply by ~bound, then center this range and cast to integer.

Method 1 is slow (due to modulo) and can have fairly high bias towards the lower values in [0, bound[ since the bound generally does not divide the number of possible integers generated.
Method 2 seems better for lower values of bound but there are a number of issues:

  1. It is actually not that easy to get right, since we need to center correctly: we have to multiply by bound, otherwise we don't get the right range, but curand may generate 1, so floor(x * bound) may generate the value bound. We either have to do floor((x - std::numeric_limits<float>::epsilon()) * bound) or floor(x * bound) then check for bound and subtract 1 introducing slight bias towards the largest value. Multiplying by bound - 1 or bound - 0.5 always offsets the range with some values having more or less bias.
  2. For large values of bound, we run into similar (if not bigger) problems with bias than the modulo-based method: with 32-bit float, there are only 2^24 possible values in ]0, 1]. That's only ~17M. For a large enough bound, we don't have enough floating point numbers to evenly distribute over the possible outputs: for example for a bound of 10M, some integers will get hit twice but some only once meaning that some integers will be 50% more likely to be drawn than others.
  3. We could solve the problems above by always generating 64-bit floats for which most possible values of bound are small (as long as they are much smaller than 2^53). But generating uniform floating point numbers is slow itself and 64-bit floats are very slow, plus the arithmetic needed afterwards are slower, too.

Examples of issues with floating-point centering:
[current method in random walks] floor(x * (bound - 1)) : this has a very low likelihood to generate the value bound - 1. For bound := 3, this means 2 is only generated with a probability of ~2^-24
[centered method]: floor(x * (bound - 1) + 0.5) : For bound := 3, this would transform the range to ]0.5, 2.5] meaning that 1 is twice more likely to be generated than both 0 and 2.

Activity

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Metadata

Metadata

Assignees

No one assigned

    Labels

    ? - Needs TriageNeed team to review and classify

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions