feat:优化全文索引，增加datasetId作为联合索引 #4970

shikaiwei1 · 2025-06-06T06:26:08Z

我这边的应用，在执行的时候发现问题：每次查询时，扫描的数据数量几乎为整个知识库（整个DB）的文本数量，而不是单个DataSet的文本数量。进一步查询发现，此处的索引未针对datasetID字段做联合索引，导致所有的全文检索扫描都会对比所有文本。
通过将datasetId字段加入联合索引，并在本地进行测试后，测试结论如下：

以下是针对10w左右数据量的一个dataset进行查询的前后对比。
增加DatasetID作为联合索引的字段，用于优化全文索引效率。优化效果如下：

以下是索引优化前后，慢SQL日志对比

经过上述对比，可见文档扫描数量和数据检索数量都有大幅下降。
该优化对于1个TeamID下有很多知识库，但是目标查询知识库占总知识库文本数量较低的查询优化效果更明显。

增加DatasetID作为联合索引的字段，用于优化全文索引效率

gru-agent · 2025-06-06T06:26:26Z

TestGru Assignment

Summary

Link	CommitId	Status	Reason
Detail	`bce36fd`	🚫 Skipped	No files need to be tested {"packages/service/core/dataset/data/dataTextSchema.ts":"File path does not match include patterns."}

Tip

You can @gru-agent and leave your feedback. TestGru will make adjustments based on your input

github-actions · 2025-06-06T06:26:58Z

Preview mcp_server Image: ghcr.io/labring/fastgpt-pr:fatsgpt_mcp_server_bce36fd09c328edf75d32ff934d0c39d13ec8706

shikaiwei1 · 2025-06-06T06:26:59Z

补充：以下是同一个数据库，同一个搜索词的慢查询日志

索引优化后

{"op":"command","ns":"fastgpt-dev-ent.dataset_data_texts","hasSortStage":true,"planSummary":"IXSCAN { teamId: 1, datasetId: 1, _fts: \"text\", _ftsx: 1 }, IXSCAN { teamId: 1, datasetId: 1, _fts: \"text\", _ftsx: 1 }, IXSCAN { teamId: 1, datasetId: 1, _fts: \"text\", _ftsx: 1 }, IXSCAN { teamId: 1, datasetId: 1, _fts: \"text\", _ftsx: 1 }, IXSCAN { teamId: 1, datasetId: 1, _fts: \"text\", _ftsx: 1 }, IXSCAN { teamId: 1, datasetId: 1, _fts: \"text\", _ftsx: 1 }, IXSCAN { teamId: 1, datasetId: 1, _fts: \"text\", _ftsx: 1 }, IXSCAN { teamId: 1, datasetId: 1, _fts: \"text\", _ftsx: 1 }, IXSCAN { teamId: 1, datasetId: 1, _fts: \"text\", _ftsx: 1 }, IXSCAN { teamId: 1, datasetId: 1, _fts: \"text\", _ftsx: 1 }","nreturned":100,"keysExaminedBySizeInBytes":162529790,"responseLength":8938,"storage":{"data":{"timeReadingMicros":{"$numberLong":"318539"},"bytesRead":{"$numberLong":"166513881"}}},"locks":{"Database":{"acquireCount":{"r":{"$numberLong":"3056"}}},"Collection":{"acquireCount":{"r":{"$numberLong":"3056"}}},"Mutex":{"acquireCount":{"r":{"$numberLong":"12"}}},"ReplicationStateTransition":{"acquireCount":{"w":{"$numberLong":"3057"}}},"Global":{"acquireCount":{"r":{"$numberLong":"3057"}}}},"flowControl":{},"command":{"pipeline":[{"$match":{"$text":{"$search":"法国 比利时 癌症 研究 中 年轻 患者 生活 质量 发现"},"teamId":{"$oid":"6710d3a18980a3369f98c2a2"},"datasetId":{"$in":[{"$oid":"67b709f9460b03446665ac40"}]}}},{"$sort":{"score":{"$meta":"textScore"}}},{"$limit":100},{"$project":{"score":{"$meta":"textScore"},"dataId":1,"_id":1,"collectionId":1}}],"cursor":{},"lsid":{"id":{"$binary":"WVBEdXbnTqeXI1kGtXdc2w==","$type":"04"}},"$readPreference":{"mode":"secondaryPreferred"},"$db":"fastgpt-dev-ent","$clusterTime":{"clusterTime":{"$timestamp":{"t":1749190017,"i":1}},"signature":{"keyId":{"$numberLong":"7468306811817295874"},"hash":{"$binary":"4waZcTZ+poYrxson9xiiFBjE7Lo=","$type":"00"}}},"readConcern":{"level":"local"},"aggregate":"dataset_data_texts"},"queryHash":"D73E05E4","protocol":"op_msg","keysExamined":294100,"planCacheKey":"9D5F8967","numYield":3044,"replRole":{"stateStr":"SECONDARY","_id":3},"docsExamined":95600,"docsExaminedBySizeInBytes":2150100720,"cursorExhausted":true,"millis":1730}

索引优化前

{"op":"command","ns":"fastgpt-dev-ent.dataset_data_texts","hasSortStage":true,"planSummary":"IXSCAN { teamId: 1, _fts: \"text\", _ftsx: 1 }, IXSCAN { teamId: 1, _fts: \"text\", _ftsx: 1 }, IXSCAN { teamId: 1, _fts: \"text\", _ftsx: 1 }, IXSCAN { teamId: 1, _fts: \"text\", _ftsx: 1 }, IXSCAN { teamId: 1, _fts: \"text\", _ftsx: 1 }, IXSCAN { teamId: 1, _fts: \"text\", _ftsx: 1 }, IXSCAN { teamId: 1, _fts: \"text\", _ftsx: 1 }, IXSCAN { teamId: 1, _fts: \"text\", _ftsx: 1 }, IXSCAN { teamId: 1, _fts: \"text\", _ftsx: 1 }, IXSCAN { teamId: 1, _fts: \"text\", _ftsx: 1 }","nreturned":100,"keysExaminedBySizeInBytes":214051440,"responseLength":8938,"storage":{"data":{"timeReadingMicros":{"$numberLong":"2305063"},"bytesRead":{"$numberLong":"1193457705"}}},"locks":{"Database":{"acquireCount":{"r":{"$numberLong":"6000"}}},"Collection":{"acquireCount":{"r":{"$numberLong":"6000"}}},"Mutex":{"acquireCount":{"r":{"$numberLong":"12"}}},"ReplicationStateTransition":{"acquireCount":{"w":{"$numberLong":"6001"}}},"Global":{"acquireCount":{"r":{"$numberLong":"6001"}}}},"flowControl":{},"command":{"pipeline":[{"$match":{"$text":{"$search":"法国 比利时 癌症 研究 中 年轻 患者 生活 质量 发现"},"teamId":{"$oid":"6710d3a18980a3369f98c2a2"},"datasetId":{"$in":[{"$oid":"67b709f9460b03446665ac40"}]}}},{"$sort":{"score":{"$meta":"textScore"}}},{"$limit":100},{"$project":{"score":{"$meta":"textScore"},"dataId":1,"_id":1,"collectionId":1}}],"cursor":{},"lsid":{"id":{"$binary":"dMgVKcAXRBi+vL+XjWIpGA==","$type":"04"}},"$readPreference":{"mode":"secondaryPreferred"},"$db":"fastgpt-dev-ent","$clusterTime":{"clusterTime":{"$timestamp":{"t":1749183856,"i":1}},"signature":{"keyId":{"$numberLong":"7468306811817295874"},"hash":{"$binary":"1Fr6C71g1ucdSvKLngVpGV6LDGk=","$type":"00"}}},"readConcern":{"level":"local"},"aggregate":"dataset_data_texts"},"queryHash":"D73E05E4","protocol":"op_msg","keysExamined":518187,"planCacheKey":"37CA3561","numYield":5988,"replRole":{"stateStr":"SECONDARY","_id":3},"docsExamined":494778,"docsExaminedBySizeInBytes":6165993669,"cursorExhausted":true,"millis":5516}

github-actions · 2025-06-06T06:28:28Z

Preview sandbox Image: ghcr.io/labring/fastgpt-pr:fatsgpt_sandbox_bce36fd09c328edf75d32ff934d0c39d13ec8706

github-actions · 2025-06-06T06:32:22Z

Preview fastgpt Image: ghcr.io/labring/fastgpt-pr:fatsgpt_bce36fd09c328edf75d32ff934d0c39d13ec8706

c121914yu · 2025-06-06T07:28:00Z

datasetId 无法支持 $in 模式，如果是多个知识库会导致需要多次查找，并且无法得到正确分数

cursor-com · 2025-06-06T07:29:16Z

🚨 BugBot failed to run

Remote branch not found for this Pull Request. It may have been merged or deleted (requestId: serverGenReqId_da2d575b-9d48-4ae5-9194-b411e8239cfa).

github-actions · 2025-06-06T07:29:40Z

Preview mcp_server Image: ghcr.io/labring/fastgpt-pr:fatsgpt_mcp_server_bce36fd09c328edf75d32ff934d0c39d13ec8706

github-actions · 2025-06-06T07:31:08Z

Preview sandbox Image: ghcr.io/labring/fastgpt-pr:fatsgpt_sandbox_bce36fd09c328edf75d32ff934d0c39d13ec8706

github-actions · 2025-06-06T07:35:25Z

Preview fastgpt Image: ghcr.io/labring/fastgpt-pr:fatsgpt_bce36fd09c328edf75d32ff934d0c39d13ec8706

shikaiwei1 · 2025-06-06T10:00:07Z

@c121914yu 这个优化对于全文查询性能优化明显，特别是数据库中知识库比较多的时候。是否可以考虑在查询时使用其他方式，能够使得datasetId加入到索引中，有效减少扫描文件数量和数据量大小

feat:优化全文索引，增加datasetId作为联合索引

bce36fd

增加DatasetID作为联合索引的字段，用于优化全文索引效率

pull-request-size bot added the size/XS label Jun 6, 2025

c121914yu closed this Jun 6, 2025

c121914yu reopened this Jun 6, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

feat:优化全文索引，增加datasetId作为联合索引 #4970

feat:优化全文索引，增加datasetId作为联合索引 #4970

Uh oh!

shikaiwei1 commented Jun 6, 2025

Uh oh!

gru-agent bot commented Jun 6, 2025

Uh oh!

github-actions bot commented Jun 6, 2025

Uh oh!

shikaiwei1 commented Jun 6, 2025 •

edited

Loading

Uh oh!

github-actions bot commented Jun 6, 2025

Uh oh!

github-actions bot commented Jun 6, 2025

Uh oh!

c121914yu commented Jun 6, 2025 •

edited

Loading

Uh oh!

cursor-com bot commented Jun 6, 2025

Uh oh!

github-actions bot commented Jun 6, 2025

Uh oh!

github-actions bot commented Jun 6, 2025

Uh oh!

github-actions bot commented Jun 6, 2025

Uh oh!

shikaiwei1 commented Jun 6, 2025 •

edited

Loading

Uh oh!

Uh oh!

feat:优化全文索引，增加datasetId作为联合索引 #4970

Are you sure you want to change the base?

feat:优化全文索引，增加datasetId作为联合索引 #4970

Uh oh!

Conversation

shikaiwei1 commented Jun 6, 2025

Uh oh!

gru-agent bot commented Jun 6, 2025

TestGru Assignment

Summary

Uh oh!

github-actions bot commented Jun 6, 2025

Uh oh!

shikaiwei1 commented Jun 6, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

索引优化后

索引优化前

Uh oh!

github-actions bot commented Jun 6, 2025

Uh oh!

github-actions bot commented Jun 6, 2025

Uh oh!

c121914yu commented Jun 6, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

cursor-com bot commented Jun 6, 2025

🚨 BugBot failed to run

Uh oh!

github-actions bot commented Jun 6, 2025

Uh oh!

github-actions bot commented Jun 6, 2025

Uh oh!

github-actions bot commented Jun 6, 2025

Uh oh!

shikaiwei1 commented Jun 6, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

shikaiwei1 commented Jun 6, 2025 •

edited

Loading

c121914yu commented Jun 6, 2025 •

edited

Loading

shikaiwei1 commented Jun 6, 2025 •

edited

Loading