lance-format · ivscheianu · May 4, 2026 · May 4, 2026 · May 10, 2026 · May 12, 2026
diff --git a/docs/src/config.md b/docs/src/config.md
@@ -56,6 +56,7 @@ The following features require the Lance Spark SQL extension to be enabled:
 - [UPDATE COLUMNS with backfill](operations/dml/update-columns.md) - Update existing columns using data from a source
 - [OPTIMIZE](operations/ddl/optimize.md) - Compact table fragments for improved query performance
 - [VACUUM](operations/ddl/vacuum.md) - Remove old versions and reclaim storage space
+- [Full-Text Search](operations/dql/fts.md) - `lance_match`, `lance_phrase`, and `lance_multi_match` SQL functions for querying FTS indexes
 
 ## Basic Setup
 

diff --git a/docs/src/operations/ddl/create-index.md b/docs/src/operations/ddl/create-index.md
@@ -77,6 +77,9 @@ For the `fts` method, the following options are required:
 
 For advanced tokenizer configuration, refer to the [Lance FTS documentation](https://lance.org/format/table/index/scalar/fts/#tokenizers).
 
+!!! tip "Querying FTS Indexes"
+    Once an FTS index is created, query it using the `lance_match`, `lance_phrase`, and `lance_multi_match` SQL functions. See [Full-Text Search](../dql/fts.md) for function signatures, options, and examples.
+
 ### FTS Format Version
 
 Lance FTS index format v2 is selected by the Lance runtime environment variable `LANCE_FTS_FORMAT_VERSION=2`. Configure it on both the Spark driver and executors before creating the index.

diff --git a/docs/src/operations/dql/.pages b/docs/src/operations/dql/.pages
@@ -1,6 +1,7 @@
 title: DQL
 nav:
   - select.md
+  - fts.md
   - vector-search.md
   - search.md
   - hybrid-search.md
diff --git a/docs/src/operations/dql/fts.md b/docs/src/operations/dql/fts.md
@@ -0,0 +1,125 @@
+# Full-Text Search
+
+Query Lance tables using full-text search (FTS) with the `lance_match`, `lance_phrase`, and `lance_multi_match` SQL functions. These functions push FTS predicates down to the Lance inverted index for efficient text search.
+
+!!! warning "Prerequisites"
+    - The Lance Spark SQL extension must be enabled. See [Spark SQL Extensions](../../config.md#spark-sql-extensions).
+    - The Lance catalog must be the session's default catalog (`spark.sql.defaultCatalog`). Spark resolves unqualified function names against the default catalog; if it points elsewhere, calls to `lance_match`, `lance_phrase`, and `lance_multi_match` fail with a "function not found" error.
+    - An FTS index must exist on the target column(s). See [CREATE INDEX — FTS Options](../ddl/create-index.md#fts-options).
+    - `lance_phrase` requires the FTS index to be built with `with_position = true`.
+
+## lance_match
+
+Keyword search against a single FTS-indexed column.
+
+```sql
+lance_match(column, query)
+lance_match(column, query, 'key=value,...')
+```
+
+**Arguments:**
+
+| Position | Name    | Type   | Description                        |
+|----------|---------|--------|------------------------------------|
+| 1        | column  | String | Column name with an FTS index.     |
+| 2        | query   | String | Search query text.                 |
+| 3        | options | String | Optional comma-separated key=value options. |
+
+**Options:**
+
+| Key              | Type    | Default | Constraints      | Description                                                    |
+|------------------|---------|---------|------------------|----------------------------------------------------------------|
+| `fuzziness`      | Integer | not set (auto) | ≥ 0       | Edit distance for fuzzy matching. When omitted, the engine determines fuzziness by token length. |
+| `operator`       | String  | `OR`    | `AND` or `OR`    | Whether all terms must match (`AND`) or any term (`OR`).       |
+| `boost`          | Float   | `1.0`   | any number       | Scoring multiplier (affects ranking, not row selection).        |
+| `prefix_length`  | Integer | `0`     | ≥ 0              | Number of leading characters that must match exactly for fuzzy. |
+| `max_expansions` | Integer | `50`    | ≥ 1              | Maximum fuzzy term expansions.                                 |
+
+**Examples:**
+
+```sql
+-- Basic keyword search
+SELECT * FROM lance.db.documents WHERE lance_match(body, 'machine learning');
+
+-- Fuzzy search with AND operator
+SELECT * FROM lance.db.documents
+WHERE lance_match(body, 'machin lerning', 'fuzziness=1,operator=AND');
+
+-- Boosted search with prefix length
+SELECT * FROM lance.db.documents
+WHERE lance_match(body, 'spark', 'boost=2.0,prefix_length=2');
+```
+
+## lance_phrase
+
+Phrase search against a single FTS-indexed column. Returns rows where the query terms appear consecutively (or within a `slop` distance).
+
+!!! note
+    The FTS index must be built with `with_position = true` for phrase queries.
+
+```sql
+lance_phrase(column, query)
+lance_phrase(column, query, slop)
+```
+
+**Arguments:**
+
+| Position | Name   | Type    | Description                                              |
+|----------|--------|---------|----------------------------------------------------------|
+| 1        | column | String  | Column name with a positional FTS index.                 |
+| 2        | query  | String  | Phrase to search for.                                    |
+| 3        | slop   | Integer | Optional. Max positional distance between terms (default: 0 = exact). Must be ≥ 0. |
+
+**Examples:**
+
+```sql
+-- Exact phrase match
+SELECT * FROM lance.db.documents WHERE lance_phrase(body, 'machine learning');
+
+-- Phrase match allowing 2 words between terms
+SELECT * FROM lance.db.documents WHERE lance_phrase(body, 'deep networks', 2);
+```
+
+## lance_multi_match
+
+Search across multiple FTS-indexed columns with a single query. A row matches if **any** column matches (OR, default) or **all** columns match (AND).
+
+```sql
+lance_multi_match(query, col1, col2, ...)
+lance_multi_match(query, 'operator=AND|OR', col1, col2, ...)
+```
+
+**Arguments:**
+
+| Position | Name     | Type   | Description                                                              |
+|----------|----------|--------|--------------------------------------------------------------------------|
+| 1        | query    | String | Search query text.                                                       |
+| 2        | options  | String | Optional. `'operator=AND'` or `'operator=OR'`. If omitted, default is OR. |
+| 2+       | columns  | String | Two or more column names, each with an FTS index.                        |
+
+When the second argument is a string literal containing `=` with a recognized option key, it is treated as an options string. Otherwise it is treated as the first column name.
+
+**Examples:**
+
+```sql
+-- Search across title and body (OR — row matches if either column matches)
+SELECT * FROM lance.db.documents
+WHERE lance_multi_match('machine learning', title, body);
+
+-- AND operator — row must match in all columns
+SELECT * FROM lance.db.documents
+WHERE lance_multi_match('machine learning', 'operator=AND', title, body);
+```
+
+## Known Limitations
+
+- **No global relevance ordering.** FTS predicates act as WHERE filters and return all matching rows. There is no `lance_score()` function or relevance-based `ORDER BY` — rows are returned in storage order. In particular, `SELECT * FROM t WHERE lance_match(...) LIMIT N` returns N matching rows determined by Spark task scheduling, not by BM25 rank — it is not a "top-N by relevance" query.
+- **Column names must match the schema exactly.** Column references are resolved at planning time; aliases or expressions are not supported.
+- **`lance_phrase` requires positional index.** The FTS index must be built with `with_position = true`. Without it, phrase queries will fail.
+- **WHERE-filter uses full BM25 scoring (no WAND early stopping).** Every row matching the query is evaluated — `wand_factor` is not exposed because SQL WHERE semantics require returning all matching rows.
+- **One FTS predicate per query.** Only a single FTS function call is allowed per `WHERE` clause. For multi-column search, use `lance_multi_match` instead of combining multiple `lance_match` calls with OR.
+- **FTS predicates cannot appear inside OR.** `WHERE lance_match(a, 'x') OR other_condition` is not supported — OR semantics cannot be preserved when pushing a single FTS query to the scanner.
+
+## Cost Model
+
+Each Spark task scans one fragment and independently opens all committed FTS index segments to build a global BM25 scorer. Total segment-open cost per query is proportional to `N_fragments × M_index_segments`. For example, a 500-fragment dataset with 20 index segments causes 10,000 segment opens per query. Running `OPTIMIZE` periodically compacts data fragments, reducing the fragment count and therefore the total number of segment opens — this is the primary user-facing mitigation for latency-sensitive workloads.
diff --git a/integration-tests/test_lance_spark.py b/integration-tests/test_lance_spark.py
@@ -3112,5 +3112,151 @@ def test_update_preserves_row_ids(self, spark, test_table):
         assert values == {1: 100, 2: 201, 3: 301, 4: 401, 5: 501}
 
 
+class TestDQLFullTextSearch:
+    """Test FTS query functions: lance_match, lance_phrase, lance_multi_match."""
+
+    @pytest.fixture(autouse=True)
+    def fts_table(self, spark):
+        """Create a table with FTS indexes on body (with_position=true) and title columns."""
+        spark.sql("""
+            CREATE TABLE default.fts_docs (
+                id INT,
+                title STRING,
+                body STRING
+            )
+        """)
+        data = [
+            (1, "Introduction to Spark", "Apache Spark is a unified analytics engine"),
+            (2, "Lance Format", "Lance is a columnar data format for ML"),
+            (3, "Full Text Search", "Full text search enables keyword queries"),
+            (4, "Spark Integration", "Lance integrates with Apache Spark for analytics"),
+            (5, "Machine Learning", "Deep learning and machine learning pipelines"),
+            (6, "Slop Test", "Apache unified spark processing framework"),
+        ]
+        df = spark.createDataFrame(data, ["id", "title", "body"])
+        df.writeTo("default.fts_docs").append()
+
+        # FTS index on body with positions (required for lance_phrase)
+        spark.sql("""
+            ALTER TABLE default.fts_docs
+            CREATE INDEX fts_body USING fts (body) WITH (
+                base_tokenizer = 'simple', language = 'English',
+                max_token_length = 40, lower_case = true,
+                stem = false, remove_stop_words = false,
+                ascii_folding = false, with_position = true
+            )
+        """)
+        # FTS index on title
+        spark.sql("""
+            ALTER TABLE default.fts_docs
+            CREATE INDEX fts_title USING fts (title) WITH (
+                base_tokenizer = 'simple', language = 'English',
+                max_token_length = 40, lower_case = true,
+                stem = false, remove_stop_words = false,
+                ascii_folding = false, with_position = true
+            )
+        """)
+        yield
+        spark.sql("DROP TABLE IF EXISTS default.fts_docs PURGE")
+
+    def test_lance_match_basic(self, spark):
+        """lance_match returns rows matching keyword query."""
+        rows = spark.sql(
+            "SELECT id FROM default.fts_docs WHERE lance_match(body, 'spark')"
+        ).collect()
+        ids = sorted([r.id for r in rows])
+        # "spark" appears in body of rows 1 and 4
+        assert 1 in ids
+        assert 4 in ids
+
+    def test_lance_match_with_options(self, spark):
+        """lance_match accepts options string."""
+        rows = spark.sql(
+            "SELECT id FROM default.fts_docs "
+            "WHERE lance_match(body, 'spark', 'operator=AND,boost=1.5')"
+        ).collect()
+        ids = sorted([r.id for r in rows])
+        # "spark" is a single term; AND on a single term behaves like OR — rows 1, 4, and 6
+        assert ids == [1, 4, 6], f"Expected [1, 4, 6], got {ids}"
+
+    def test_lance_phrase_basic(self, spark):
+        """lance_phrase returns rows with exact phrase match."""
+        rows = spark.sql(
+            "SELECT id FROM default.fts_docs WHERE lance_phrase(body, 'apache spark')"
+        ).collect()
+        ids = [r.id for r in rows]
+        assert 1 in ids
+
+    def test_lance_phrase_with_slop(self, spark):
+        """lance_phrase with slop allows distance between terms.
+
+        Row 6 body is "Apache unified spark processing framework" — "apache" and
+        "spark" are separated by one intervening token ("unified"), so slop=0
+        (exact phrase) must not match row 6, while slop>=1 must match it.
+        """
+        # slop=0 (exact phrase) — must NOT include row 6
+        exact_rows = spark.sql(
+            "SELECT id FROM default.fts_docs WHERE lance_phrase(body, 'apache spark')"
+        ).collect()
+        exact_ids = {r.id for r in exact_rows}
+        assert 6 not in exact_ids, (
+            f"slop=0 must not match row 6 (terms separated by intervening token), got {sorted(exact_ids)}"
+        )
+        # Rows 1 and 4 have the exact phrase "Apache Spark"
+        assert 1 in exact_ids
+        assert 4 in exact_ids
+
+        # slop=1 — must include row 6 (one intervening token)
+        slop_rows = spark.sql(
+            "SELECT id FROM default.fts_docs WHERE lance_phrase(body, 'apache spark', 1)"
+        ).collect()
+        slop_ids = {r.id for r in slop_rows}
+        assert 6 in slop_ids, (
+            f"slop=1 must match row 6 (one intervening token), got {sorted(slop_ids)}"
+        )
+        # slop=1 is a superset of slop=0
+        assert exact_ids.issubset(slop_ids), (
+            f"slop=1 results must be a superset of slop=0. slop=0={sorted(exact_ids)}, slop=1={sorted(slop_ids)}"
+        )
+
+    def test_lance_multi_match_or(self, spark):
+        """lance_multi_match with default OR returns rows matching in any column."""
+        rows = spark.sql(
+            "SELECT id FROM default.fts_docs "
+            "WHERE lance_multi_match('spark', title, body)"
+        ).collect()
+        ids = sorted([r.id for r in rows])
+        # "spark" in title of rows 1,4 and body of rows 1,4
+        assert 1 in ids
+        assert 4 in ids
+
+    def test_lance_multi_match_with_operator(self, spark):
+        """lance_multi_match with explicit operator=OR option."""
+        rows = spark.sql(
+            "SELECT id FROM default.fts_docs "
+            "WHERE lance_multi_match('spark', 'operator=OR', title, body)"
+        ).collect()
+        ids = sorted([r.id for r in rows])
+        # "spark" appears in title of rows 1,4 and body of rows 1,4,6
+        assert ids == [1, 4, 6], f"Expected [1, 4, 6], got {ids}"
+
+    def test_show_functions_lists_fts(self, spark):
+        """SHOW FUNCTIONS returns all three FTS function names."""
+        functions = spark.sql("SHOW FUNCTIONS").collect()
+        fn_names = {r[0] for r in functions}
+        # Function names may be prefixed with catalog (e.g. "lance.lance_match");
+        # strip prefix and compare bare names exactly to avoid substring false positives.
+        fts_names = {"lance_match", "lance_phrase", "lance_multi_match"}
+        found = set()
+        for fn in fn_names:
+            bare = fn.split(".")[-1].lower()
+            if bare in fts_names:
+                found.add(bare)
+        assert fts_names == found, (
+            f"Expected FTS functions {fts_names} in SHOW FUNCTIONS output, "
+            f"found {found}. All functions: {sorted(fn_names)}"
+        )
+
+
 if __name__ == "__main__":
     pytest.main([__file__, "-v"])
diff --git a/...4_2.12/src/main/scala/org/apache/spark/sql/catalyst/optimizer/LanceFtsPredicateRule.scala b/...4_2.12/src/main/scala/org/apache/spark/sql/catalyst/optimizer/LanceFtsPredicateRule.scala
@@ -0,0 +1,26 @@
+/*
+ * Licensed under the Apache License, Version 2.0 (the "License");
+ * you may not use this file except in compliance with the License.
+ * You may obtain a copy of the License at
+ *
+ *     http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+package org.apache.spark.sql.catalyst.optimizer
+
+import org.apache.spark.sql.execution.datasources.v2.DataSourceV2Relation
+import org.apache.spark.sql.util.CaseInsensitiveStringMap
+
+/** Spark 3.4 concrete subclass — 5-arg DataSourceV2Relation.copy(). */
+class LanceFtsPredicateRule extends AbstractLanceFtsPredicateRule {
+
+  override protected def rewriteRelation(
+      rel: DataSourceV2Relation,
+      newOptions: CaseInsensitiveStringMap): DataSourceV2Relation =
+    rel.copy(options = newOptions)
+}
diff --git a/...park-3.4_2.12/src/main/scala/org/lance/spark/extensions/LanceSparkSessionExtensions.scala b/...park-3.4_2.12/src/main/scala/org/lance/spark/extensions/LanceSparkSessionExtensions.scala
@@ -16,7 +16,7 @@ package org.lance.spark.extensions
 import org.apache.spark.sql.SparkSessionExtensions
 import org.apache.spark.sql.catalyst.FunctionIdentifier
 import org.apache.spark.sql.catalyst.expressions.ExpressionInfo
-import org.apache.spark.sql.catalyst.optimizer.{LanceBlobSourceContextRule, LanceFragmentAwareJoinRule}
+import org.apache.spark.sql.catalyst.optimizer.{LanceBlobSourceContextRule, LanceFragmentAwareJoinRule, LanceFtsPredicateRule}
 import org.apache.spark.sql.catalyst.parser.extensions.LanceSparkSqlExtensionsParser
 import org.apache.spark.sql.execution.datasources.v2.LanceDataSourceV2Strategy
 import org.lance.spark.search.LanceSearchTableFunctions
@@ -30,6 +30,9 @@ class LanceSparkSessionExtensions extends (SparkSessionExtensions => Unit) {
     // optimizer rules for fragment-aware joins
     extensions.injectOptimizerRule(_ => LanceFragmentAwareJoinRule())
 
+    // optimizer rule for FTS predicate pushdown
+    extensions.injectOptimizerRule(_ => new LanceFtsPredicateRule())
+
     // propagate blob source credentials from read scans to the write side
     extensions.injectOptimizerRule(_ => LanceBlobSourceContextRule())
 

diff --git a/lance-spark-3.4_2.12/src/test/java/org/lance/spark/read/FtsAdvancedOptionsTest.java b/lance-spark-3.4_2.12/src/test/java/org/lance/spark/read/FtsAdvancedOptionsTest.java
@@ -0,0 +1,16 @@
+/*
+ * Licensed under the Apache License, Version 2.0 (the "License");
+ * you may not use this file except in compliance with the License.
+ * You may obtain a copy of the License at
+ *
+ *     http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+package org.lance.spark.read;
+
+public class FtsAdvancedOptionsTest extends BaseFtsAdvancedOptionsTest {}
diff --git a/lance-spark-3.4_2.12/src/test/java/org/lance/spark/read/FtsMatchPushdownTest.java b/lance-spark-3.4_2.12/src/test/java/org/lance/spark/read/FtsMatchPushdownTest.java
@@ -0,0 +1,16 @@
+/*
+ * Licensed under the Apache License, Version 2.0 (the "License");
+ * you may not use this file except in compliance with the License.
+ * You may obtain a copy of the License at
+ *
+ *     http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+package org.lance.spark.read;
+
+public class FtsMatchPushdownTest extends BaseFtsMatchPushdownTest {}