[BUG] Overflow potentially corrupting hashes in hash_vocab implementation

**Describe the bug**
The hash vocab test in cudf currently warns about an overflow occurring. This can be easily observed by running the pytest with warnings set to raise errors.

**Steps/Code to reproduce bug**
Execute `pytest -W error python/cudf/cudf/tests/test_hash_vocab.py::test_correct_bert_base_vocab_hash` from the root of the repository.

The output should include a traceback like this:
```
test_correct_bert_base_vocab_hash ____________________________________________________________________________________

datadir = '/home/vyasr/local/rapids/cudf/python/cudf/cudf/tests/data/subword_tokenizer_data/bert_base_cased_sampled', tmpdir = local('/tmp/pytest-of-rapids/pytest-2/test_correct_bert_base_vocab_h0')

    def test_correct_bert_base_vocab_hash(datadir, tmpdir):
        # The vocabulary is drawn from bert-base-cased
        vocab_path = os.path.join(datadir, "vocab.txt")
    
        groundtruth_path = os.path.join(datadir, "vocab-hash.txt")
        output_path = tmpdir.join("cudf-vocab-hash.txt")
>       hash_vocab(vocab_path, output_path)

python/cudf/cudf/tests/test_hash_vocab.py:23: 
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _
python/cudf/cudf/utils/hash_vocab_utils.py:269: in hash_vocab
    ) = _perfect_hash(keys, 10)
python/cudf/cudf/utils/hash_vocab_utils.py:129: in _perfect_hash
    internal_table, coeff_a, coeff_b = _find_hash_for_internal(b)
python/cudf/cudf/utils/hash_vocab_utils.py:102: in _find_hash_for_internal
    bins = _make_bins(hash_bin, new_length, a, b)
python/cudf/cudf/utils/hash_vocab_utils.py:60: in _make_bins
    bins[_hash_func(item, a, b, num_bins)].append(item)
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _

k = 233297689050786, a = 21608458564245, b = 116, size = 6

    def _hash_func(k, a, b, size):
        k = np.uint64(k)
        a = np.uint64(a)
        b = np.uint64(b)
        size = np.uint64(size)
>       return ((a * k + b) % PRIME) % size
E       RuntimeWarning: overflow encountered in ulong_scalars

python/cudf/cudf/utils/hash_vocab_utils.py:49: RuntimeWarning
------------------------------------------------------------------------------------------ Captured stdout call -------------------------------------------------------------------------------------------
Attempting to build table using 1.500000n space
Longest bin was 11
Processing bin 0 / 875 of size = 6
```

**Expected behavior**
We should not have overflows occurring. The reason for the overflow is that all the inputs to `_hash_func` are being converted to `np.uint64` (limited to 64 bits) rather than primitive Python ints (which have unlimited precision). I attempted the naive modification of just removing the conversions to `np.uint64` here (which also requires rewriting some of the call sites to do conversions since they involve indexing into numpy arrays or adding numpy ints to Python ints), but my quick conversion led to the test failing outright. I didn't check my work all that thoroughly so it's possible I made an error, but we should make sure that we understand whether the numpy integer overflow here is some property that we are depending on implicitly, if it is a bug that users could actually hit and we need to fix, or if it's just the expected behavior and the warning can be silenced.


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[BUG] Overflow potentially corrupting hashes in hash_vocab implementation #12403

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development