[BUG] Overflow potentially corrupting hashes in hash_vocab implementation #12403
Description
Describe the bug
The hash vocab test in cudf currently warns about an overflow occurring. This can be easily observed by running the pytest with warnings set to raise errors.
Steps/Code to reproduce bug
Execute pytest -W error python/cudf/cudf/tests/test_hash_vocab.py::test_correct_bert_base_vocab_hash
from the root of the repository.
The output should include a traceback like this:
test_correct_bert_base_vocab_hash ____________________________________________________________________________________
datadir = '/home/vyasr/local/rapids/cudf/python/cudf/cudf/tests/data/subword_tokenizer_data/bert_base_cased_sampled', tmpdir = local('/tmp/pytest-of-rapids/pytest-2/test_correct_bert_base_vocab_h0')
def test_correct_bert_base_vocab_hash(datadir, tmpdir):
# The vocabulary is drawn from bert-base-cased
vocab_path = os.path.join(datadir, "vocab.txt")
groundtruth_path = os.path.join(datadir, "vocab-hash.txt")
output_path = tmpdir.join("cudf-vocab-hash.txt")
> hash_vocab(vocab_path, output_path)
python/cudf/cudf/tests/test_hash_vocab.py:23:
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _
python/cudf/cudf/utils/hash_vocab_utils.py:269: in hash_vocab
) = _perfect_hash(keys, 10)
python/cudf/cudf/utils/hash_vocab_utils.py:129: in _perfect_hash
internal_table, coeff_a, coeff_b = _find_hash_for_internal(b)
python/cudf/cudf/utils/hash_vocab_utils.py:102: in _find_hash_for_internal
bins = _make_bins(hash_bin, new_length, a, b)
python/cudf/cudf/utils/hash_vocab_utils.py:60: in _make_bins
bins[_hash_func(item, a, b, num_bins)].append(item)
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _
k = 233297689050786, a = 21608458564245, b = 116, size = 6
def _hash_func(k, a, b, size):
k = np.uint64(k)
a = np.uint64(a)
b = np.uint64(b)
size = np.uint64(size)
> return ((a * k + b) % PRIME) % size
E RuntimeWarning: overflow encountered in ulong_scalars
python/cudf/cudf/utils/hash_vocab_utils.py:49: RuntimeWarning
------------------------------------------------------------------------------------------ Captured stdout call -------------------------------------------------------------------------------------------
Attempting to build table using 1.500000n space
Longest bin was 11
Processing bin 0 / 875 of size = 6
Expected behavior
We should not have overflows occurring. The reason for the overflow is that all the inputs to _hash_func
are being converted to np.uint64
(limited to 64 bits) rather than primitive Python ints (which have unlimited precision). I attempted the naive modification of just removing the conversions to np.uint64
here (which also requires rewriting some of the call sites to do conversions since they involve indexing into numpy arrays or adding numpy ints to Python ints), but my quick conversion led to the test failing outright. I didn't check my work all that thoroughly so it's possible I made an error, but we should make sure that we understand whether the numpy integer overflow here is some property that we are depending on implicitly, if it is a bug that users could actually hit and we need to fix, or if it's just the expected behavior and the warning can be silenced.
Metadata
Assignees
Labels
Type
Projects
Status
Todo