Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
1 change: 1 addition & 0 deletions docs/src/config.md
Original file line number Diff line number Diff line change
Expand Up @@ -56,6 +56,7 @@ The following features require the Lance Spark SQL extension to be enabled:
- [UPDATE COLUMNS with backfill](operations/dml/update-columns.md) - Update existing columns using data from a source
- [OPTIMIZE](operations/ddl/optimize.md) - Compact table fragments for improved query performance
- [VACUUM](operations/ddl/vacuum.md) - Remove old versions and reclaim storage space
- [Full-Text Search](operations/dql/fts.md) - `lance_match`, `lance_phrase`, and `lance_multi_match` SQL functions for querying FTS indexes

## Basic Setup

Expand Down
3 changes: 3 additions & 0 deletions docs/src/operations/ddl/create-index.md
Original file line number Diff line number Diff line change
Expand Up @@ -77,6 +77,9 @@ For the `fts` method, the following options are required:

For advanced tokenizer configuration, refer to the [Lance FTS documentation](https://lance.org/format/table/index/scalar/fts/#tokenizers).

!!! tip "Querying FTS Indexes"
Once an FTS index is created, query it using the `lance_match`, `lance_phrase`, and `lance_multi_match` SQL functions. See [Full-Text Search](../dql/fts.md) for function signatures, options, and examples.

### FTS Format Version

Lance FTS index format v2 is selected by the Lance runtime environment variable `LANCE_FTS_FORMAT_VERSION=2`. Configure it on both the Spark driver and executors before creating the index.
Expand Down
1 change: 1 addition & 0 deletions docs/src/operations/dql/.pages
Original file line number Diff line number Diff line change
@@ -1,6 +1,7 @@
title: DQL
nav:
- select.md
- fts.md
- vector-search.md
- search.md
- hybrid-search.md
125 changes: 125 additions & 0 deletions docs/src/operations/dql/fts.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,125 @@
# Full-Text Search

Query Lance tables using full-text search (FTS) with the `lance_match`, `lance_phrase`, and `lance_multi_match` SQL functions. These functions push FTS predicates down to the Lance inverted index for efficient text search.

!!! warning "Prerequisites"
- The Lance Spark SQL extension must be enabled. See [Spark SQL Extensions](../../config.md#spark-sql-extensions).
- The Lance catalog must be the session's default catalog (`spark.sql.defaultCatalog`). Spark resolves unqualified function names against the default catalog; if it points elsewhere, calls to `lance_match`, `lance_phrase`, and `lance_multi_match` fail with a "function not found" error.
- An FTS index must exist on the target column(s). See [CREATE INDEX — FTS Options](../ddl/create-index.md#fts-options).
- `lance_phrase` requires the FTS index to be built with `with_position = true`.

## lance_match

Keyword search against a single FTS-indexed column.

```sql
lance_match(column, query)
lance_match(column, query, 'key=value,...')
```

**Arguments:**

| Position | Name | Type | Description |
|----------|---------|--------|------------------------------------|
| 1 | column | String | Column name with an FTS index. |
| 2 | query | String | Search query text. |
| 3 | options | String | Optional comma-separated key=value options. |

**Options:**

| Key | Type | Default | Constraints | Description |
|------------------|---------|---------|------------------|----------------------------------------------------------------|
| `fuzziness` | Integer | not set (auto) | ≥ 0 | Edit distance for fuzzy matching. When omitted, the engine determines fuzziness by token length. |
| `operator` | String | `OR` | `AND` or `OR` | Whether all terms must match (`AND`) or any term (`OR`). |
| `boost` | Float | `1.0` | any number | Scoring multiplier (affects ranking, not row selection). |
| `prefix_length` | Integer | `0` | ≥ 0 | Number of leading characters that must match exactly for fuzzy. |
| `max_expansions` | Integer | `50` | ≥ 1 | Maximum fuzzy term expansions. |

**Examples:**

```sql
-- Basic keyword search
SELECT * FROM lance.db.documents WHERE lance_match(body, 'machine learning');

-- Fuzzy search with AND operator
SELECT * FROM lance.db.documents
WHERE lance_match(body, 'machin lerning', 'fuzziness=1,operator=AND');

-- Boosted search with prefix length
SELECT * FROM lance.db.documents
WHERE lance_match(body, 'spark', 'boost=2.0,prefix_length=2');
```

## lance_phrase

Phrase search against a single FTS-indexed column. Returns rows where the query terms appear consecutively (or within a `slop` distance).

!!! note
The FTS index must be built with `with_position = true` for phrase queries.

```sql
lance_phrase(column, query)
lance_phrase(column, query, slop)
```

**Arguments:**

| Position | Name | Type | Description |
|----------|--------|---------|----------------------------------------------------------|
| 1 | column | String | Column name with a positional FTS index. |
| 2 | query | String | Phrase to search for. |
| 3 | slop | Integer | Optional. Max positional distance between terms (default: 0 = exact). Must be ≥ 0. |

**Examples:**

```sql
-- Exact phrase match
SELECT * FROM lance.db.documents WHERE lance_phrase(body, 'machine learning');

-- Phrase match allowing 2 words between terms
SELECT * FROM lance.db.documents WHERE lance_phrase(body, 'deep networks', 2);
```

## lance_multi_match

Search across multiple FTS-indexed columns with a single query. A row matches if **any** column matches (OR, default) or **all** columns match (AND).

```sql
lance_multi_match(query, col1, col2, ...)
lance_multi_match(query, 'operator=AND|OR', col1, col2, ...)
```

**Arguments:**

| Position | Name | Type | Description |
|----------|----------|--------|--------------------------------------------------------------------------|
| 1 | query | String | Search query text. |
| 2 | options | String | Optional. `'operator=AND'` or `'operator=OR'`. If omitted, default is OR. |
| 2+ | columns | String | Two or more column names, each with an FTS index. |

When the second argument is a string literal containing `=` with a recognized option key, it is treated as an options string. Otherwise it is treated as the first column name.

**Examples:**

```sql
-- Search across title and body (OR — row matches if either column matches)
SELECT * FROM lance.db.documents
WHERE lance_multi_match('machine learning', title, body);

-- AND operator — row must match in all columns
SELECT * FROM lance.db.documents
WHERE lance_multi_match('machine learning', 'operator=AND', title, body);
```

## Known Limitations

- **No global relevance ordering.** FTS predicates act as WHERE filters and return all matching rows. There is no `lance_score()` function or relevance-based `ORDER BY` — rows are returned in storage order. In particular, `SELECT * FROM t WHERE lance_match(...) LIMIT N` returns N matching rows determined by Spark task scheduling, not by BM25 rank — it is not a "top-N by relevance" query.
- **Column names must match the schema exactly.** Column references are resolved at planning time; aliases or expressions are not supported.
- **`lance_phrase` requires positional index.** The FTS index must be built with `with_position = true`. Without it, phrase queries will fail.
- **WHERE-filter uses full BM25 scoring (no WAND early stopping).** Every row matching the query is evaluated — `wand_factor` is not exposed because SQL WHERE semantics require returning all matching rows.
- **One FTS predicate per query.** Only a single FTS function call is allowed per `WHERE` clause. For multi-column search, use `lance_multi_match` instead of combining multiple `lance_match` calls with OR.
- **FTS predicates cannot appear inside OR.** `WHERE lance_match(a, 'x') OR other_condition` is not supported — OR semantics cannot be preserved when pushing a single FTS query to the scanner.

## Cost Model

Each Spark task scans one fragment and independently opens all committed FTS index segments to build a global BM25 scorer. Total segment-open cost per query is proportional to `N_fragments × M_index_segments`. For example, a 500-fragment dataset with 20 index segments causes 10,000 segment opens per query. Running `OPTIMIZE` periodically compacts data fragments, reducing the fragment count and therefore the total number of segment opens — this is the primary user-facing mitigation for latency-sensitive workloads.
146 changes: 146 additions & 0 deletions integration-tests/test_lance_spark.py
Original file line number Diff line number Diff line change
Expand Up @@ -3112,5 +3112,151 @@ def test_update_preserves_row_ids(self, spark, test_table):
assert values == {1: 100, 2: 201, 3: 301, 4: 401, 5: 501}


class TestDQLFullTextSearch:
"""Test FTS query functions: lance_match, lance_phrase, lance_multi_match."""

@pytest.fixture(autouse=True)
def fts_table(self, spark):
"""Create a table with FTS indexes on body (with_position=true) and title columns."""
spark.sql("""
CREATE TABLE default.fts_docs (
id INT,
title STRING,
body STRING
)
""")
data = [
(1, "Introduction to Spark", "Apache Spark is a unified analytics engine"),
(2, "Lance Format", "Lance is a columnar data format for ML"),
(3, "Full Text Search", "Full text search enables keyword queries"),
(4, "Spark Integration", "Lance integrates with Apache Spark for analytics"),
(5, "Machine Learning", "Deep learning and machine learning pipelines"),
(6, "Slop Test", "Apache unified spark processing framework"),
]
df = spark.createDataFrame(data, ["id", "title", "body"])
df.writeTo("default.fts_docs").append()

# FTS index on body with positions (required for lance_phrase)
spark.sql("""
ALTER TABLE default.fts_docs
CREATE INDEX fts_body USING fts (body) WITH (
base_tokenizer = 'simple', language = 'English',
max_token_length = 40, lower_case = true,
stem = false, remove_stop_words = false,
ascii_folding = false, with_position = true
)
""")
# FTS index on title
spark.sql("""
ALTER TABLE default.fts_docs
CREATE INDEX fts_title USING fts (title) WITH (
base_tokenizer = 'simple', language = 'English',
max_token_length = 40, lower_case = true,
stem = false, remove_stop_words = false,
ascii_folding = false, with_position = true
)
""")
yield
spark.sql("DROP TABLE IF EXISTS default.fts_docs PURGE")

def test_lance_match_basic(self, spark):
"""lance_match returns rows matching keyword query."""
rows = spark.sql(
"SELECT id FROM default.fts_docs WHERE lance_match(body, 'spark')"
).collect()
ids = sorted([r.id for r in rows])
# "spark" appears in body of rows 1 and 4
assert 1 in ids
assert 4 in ids

def test_lance_match_with_options(self, spark):
"""lance_match accepts options string."""
rows = spark.sql(
"SELECT id FROM default.fts_docs "
"WHERE lance_match(body, 'spark', 'operator=AND,boost=1.5')"
).collect()
ids = sorted([r.id for r in rows])
# "spark" is a single term; AND on a single term behaves like OR — rows 1, 4, and 6
assert ids == [1, 4, 6], f"Expected [1, 4, 6], got {ids}"

def test_lance_phrase_basic(self, spark):
"""lance_phrase returns rows with exact phrase match."""
rows = spark.sql(
"SELECT id FROM default.fts_docs WHERE lance_phrase(body, 'apache spark')"
).collect()
ids = [r.id for r in rows]
assert 1 in ids

def test_lance_phrase_with_slop(self, spark):
"""lance_phrase with slop allows distance between terms.

Row 6 body is "Apache unified spark processing framework" — "apache" and
"spark" are separated by one intervening token ("unified"), so slop=0
(exact phrase) must not match row 6, while slop>=1 must match it.
"""
# slop=0 (exact phrase) — must NOT include row 6
exact_rows = spark.sql(
"SELECT id FROM default.fts_docs WHERE lance_phrase(body, 'apache spark')"
).collect()
exact_ids = {r.id for r in exact_rows}
assert 6 not in exact_ids, (
f"slop=0 must not match row 6 (terms separated by intervening token), got {sorted(exact_ids)}"
)
# Rows 1 and 4 have the exact phrase "Apache Spark"
assert 1 in exact_ids
assert 4 in exact_ids

# slop=1 — must include row 6 (one intervening token)
slop_rows = spark.sql(
"SELECT id FROM default.fts_docs WHERE lance_phrase(body, 'apache spark', 1)"
).collect()
slop_ids = {r.id for r in slop_rows}
assert 6 in slop_ids, (
f"slop=1 must match row 6 (one intervening token), got {sorted(slop_ids)}"
)
# slop=1 is a superset of slop=0
assert exact_ids.issubset(slop_ids), (
f"slop=1 results must be a superset of slop=0. slop=0={sorted(exact_ids)}, slop=1={sorted(slop_ids)}"
)

def test_lance_multi_match_or(self, spark):
"""lance_multi_match with default OR returns rows matching in any column."""
rows = spark.sql(
"SELECT id FROM default.fts_docs "
"WHERE lance_multi_match('spark', title, body)"
).collect()
ids = sorted([r.id for r in rows])
# "spark" in title of rows 1,4 and body of rows 1,4
assert 1 in ids
assert 4 in ids

def test_lance_multi_match_with_operator(self, spark):
"""lance_multi_match with explicit operator=OR option."""
rows = spark.sql(
"SELECT id FROM default.fts_docs "
"WHERE lance_multi_match('spark', 'operator=OR', title, body)"
).collect()
ids = sorted([r.id for r in rows])
# "spark" appears in title of rows 1,4 and body of rows 1,4,6
assert ids == [1, 4, 6], f"Expected [1, 4, 6], got {ids}"

def test_show_functions_lists_fts(self, spark):
"""SHOW FUNCTIONS returns all three FTS function names."""
functions = spark.sql("SHOW FUNCTIONS").collect()
fn_names = {r[0] for r in functions}
# Function names may be prefixed with catalog (e.g. "lance.lance_match");
# strip prefix and compare bare names exactly to avoid substring false positives.
fts_names = {"lance_match", "lance_phrase", "lance_multi_match"}
found = set()
for fn in fn_names:
bare = fn.split(".")[-1].lower()
if bare in fts_names:
found.add(bare)
assert fts_names == found, (
f"Expected FTS functions {fts_names} in SHOW FUNCTIONS output, "
f"found {found}. All functions: {sorted(fn_names)}"
)


if __name__ == "__main__":
pytest.main([__file__, "-v"])
Original file line number Diff line number Diff line change
@@ -0,0 +1,26 @@
/*
* Licensed under the Apache License, Version 2.0 (the "License");
* you may not use this file except in compliance with the License.
* You may obtain a copy of the License at
*
* http://www.apache.org/licenses/LICENSE-2.0
*
* Unless required by applicable law or agreed to in writing, software
* distributed under the License is distributed on an "AS IS" BASIS,
* WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
* See the License for the specific language governing permissions and
* limitations under the License.
*/
package org.apache.spark.sql.catalyst.optimizer

import org.apache.spark.sql.execution.datasources.v2.DataSourceV2Relation
import org.apache.spark.sql.util.CaseInsensitiveStringMap

/** Spark 3.4 concrete subclass — 5-arg DataSourceV2Relation.copy(). */
class LanceFtsPredicateRule extends AbstractLanceFtsPredicateRule {

override protected def rewriteRelation(
rel: DataSourceV2Relation,
newOptions: CaseInsensitiveStringMap): DataSourceV2Relation =
rel.copy(options = newOptions)
}
Original file line number Diff line number Diff line change
Expand Up @@ -16,7 +16,7 @@ package org.lance.spark.extensions
import org.apache.spark.sql.SparkSessionExtensions
import org.apache.spark.sql.catalyst.FunctionIdentifier
import org.apache.spark.sql.catalyst.expressions.ExpressionInfo
import org.apache.spark.sql.catalyst.optimizer.{LanceBlobSourceContextRule, LanceFragmentAwareJoinRule}
import org.apache.spark.sql.catalyst.optimizer.{LanceBlobSourceContextRule, LanceFragmentAwareJoinRule, LanceFtsPredicateRule}
import org.apache.spark.sql.catalyst.parser.extensions.LanceSparkSqlExtensionsParser
import org.apache.spark.sql.execution.datasources.v2.LanceDataSourceV2Strategy
import org.lance.spark.search.LanceSearchTableFunctions
Expand All @@ -30,6 +30,9 @@ class LanceSparkSessionExtensions extends (SparkSessionExtensions => Unit) {
// optimizer rules for fragment-aware joins
extensions.injectOptimizerRule(_ => LanceFragmentAwareJoinRule())

// optimizer rule for FTS predicate pushdown
extensions.injectOptimizerRule(_ => new LanceFtsPredicateRule())

// propagate blob source credentials from read scans to the write side
extensions.injectOptimizerRule(_ => LanceBlobSourceContextRule())

Expand Down
Original file line number Diff line number Diff line change
@@ -0,0 +1,16 @@
/*
* Licensed under the Apache License, Version 2.0 (the "License");
* you may not use this file except in compliance with the License.
* You may obtain a copy of the License at
*
* http://www.apache.org/licenses/LICENSE-2.0
*
* Unless required by applicable law or agreed to in writing, software
* distributed under the License is distributed on an "AS IS" BASIS,
* WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
* See the License for the specific language governing permissions and
* limitations under the License.
*/
package org.lance.spark.read;

public class FtsAdvancedOptionsTest extends BaseFtsAdvancedOptionsTest {}
Original file line number Diff line number Diff line change
@@ -0,0 +1,16 @@
/*
* Licensed under the Apache License, Version 2.0 (the "License");
* you may not use this file except in compliance with the License.
* You may obtain a copy of the License at
*
* http://www.apache.org/licenses/LICENSE-2.0
*
* Unless required by applicable law or agreed to in writing, software
* distributed under the License is distributed on an "AS IS" BASIS,
* WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
* See the License for the specific language governing permissions and
* limitations under the License.
*/
package org.lance.spark.read;

public class FtsMatchPushdownTest extends BaseFtsMatchPushdownTest {}
Loading
Loading