-
Notifications
You must be signed in to change notification settings - Fork 540
Open
Description
Here is a reproducer:
import lance
import pyarrow as pa
import tempfile
uri = tempfile.mkdtemp()
ds = lance.write_dataset(pa.table({"text": ["this", "is", "a", "bug"]}), uri)
ds.create_scalar_index("text", index_type="INVERTED")
# Fragment 2: 10 rows with "hello" (unindexed)
ds = lance.write_dataset(pa.table({"text": [f"hello_{i}" for i in range(10)]}), uri, mode="append")
# FTS should find all 10 "hello" rows
expected = 10
actual = ds.to_table(full_text_query="hello").num_rows
print(f"Expected: {expected}, Actual: {actual}")
assert actual == expected, f"BUG: FTS missed {expected - actual} unindexed rows"
Result:
Expected: 10, Actual: 7
Traceback (most recent call last):
File "/home/wyatt/work/lance-vibecheck/fts_repro.py", line 17, in <module>
assert actual == expected, f"BUG: FTS missed {expected - actual} unindexed rows"
^^^^^^^^^^^^^^^^^^
AssertionError: BUG: FTS missed 3 unindexed rows
it seems sensitive to the number of elements in the initial write. The content of the fetched data is:
text: [["hello_3","hello_4","hello_5","hello_6","hello_7","hello_8","hello_9"]]
_score: [[0.4924765,0.43078294,0.38299227,0.3448405,0.31365758,0.28768212,0.2657031]]
Metadata
Metadata
Assignees
Labels
No labels