Skip to content

[BUG] Overflow potentially corrupting hashes in hash_vocab implementation #12403

Open
@vyasr

Description

Describe the bug
The hash vocab test in cudf currently warns about an overflow occurring. This can be easily observed by running the pytest with warnings set to raise errors.

Steps/Code to reproduce bug
Execute pytest -W error python/cudf/cudf/tests/test_hash_vocab.py::test_correct_bert_base_vocab_hash from the root of the repository.

The output should include a traceback like this:

test_correct_bert_base_vocab_hash ____________________________________________________________________________________

datadir = '/home/vyasr/local/rapids/cudf/python/cudf/cudf/tests/data/subword_tokenizer_data/bert_base_cased_sampled', tmpdir = local('/tmp/pytest-of-rapids/pytest-2/test_correct_bert_base_vocab_h0')

    def test_correct_bert_base_vocab_hash(datadir, tmpdir):
        # The vocabulary is drawn from bert-base-cased
        vocab_path = os.path.join(datadir, "vocab.txt")
    
        groundtruth_path = os.path.join(datadir, "vocab-hash.txt")
        output_path = tmpdir.join("cudf-vocab-hash.txt")
>       hash_vocab(vocab_path, output_path)

python/cudf/cudf/tests/test_hash_vocab.py:23: 
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _
python/cudf/cudf/utils/hash_vocab_utils.py:269: in hash_vocab
    ) = _perfect_hash(keys, 10)
python/cudf/cudf/utils/hash_vocab_utils.py:129: in _perfect_hash
    internal_table, coeff_a, coeff_b = _find_hash_for_internal(b)
python/cudf/cudf/utils/hash_vocab_utils.py:102: in _find_hash_for_internal
    bins = _make_bins(hash_bin, new_length, a, b)
python/cudf/cudf/utils/hash_vocab_utils.py:60: in _make_bins
    bins[_hash_func(item, a, b, num_bins)].append(item)
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _

k = 233297689050786, a = 21608458564245, b = 116, size = 6

    def _hash_func(k, a, b, size):
        k = np.uint64(k)
        a = np.uint64(a)
        b = np.uint64(b)
        size = np.uint64(size)
>       return ((a * k + b) % PRIME) % size
E       RuntimeWarning: overflow encountered in ulong_scalars

python/cudf/cudf/utils/hash_vocab_utils.py:49: RuntimeWarning
------------------------------------------------------------------------------------------ Captured stdout call -------------------------------------------------------------------------------------------
Attempting to build table using 1.500000n space
Longest bin was 11
Processing bin 0 / 875 of size = 6

Expected behavior
We should not have overflows occurring. The reason for the overflow is that all the inputs to _hash_func are being converted to np.uint64 (limited to 64 bits) rather than primitive Python ints (which have unlimited precision). I attempted the naive modification of just removing the conversions to np.uint64 here (which also requires rewriting some of the call sites to do conversions since they involve indexing into numpy arrays or adding numpy ints to Python ints), but my quick conversion led to the test failing outright. I didn't check my work all that thoroughly so it's possible I made an error, but we should make sure that we understand whether the numpy integer overflow here is some property that we are depending on implicitly, if it is a bug that users could actually hit and we need to fix, or if it's just the expected behavior and the warning can be silenced.

Metadata

Assignees

No one assigned

    Labels

    0 - BacklogIn queue waiting for assignmentPythonAffects Python cuDF API.bugSomething isn't workingstringsstrings issues (C++ and Python)testsUnit testing for project

    Type

    No type

    Projects

    • Status

      Todo

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions