Skip to content

feat:优化全文索引,增加datasetId作为联合索引 #4970

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 1 commit into
base: main
Choose a base branch
from

Conversation

shikaiwei1
Copy link
Contributor

我这边的应用,在执行的时候发现问题:每次查询时,扫描的数据数量几乎为整个知识库(整个DB)的文本数量,而不是单个DataSet的文本数量。进一步查询发现,此处的索引未针对datasetID字段做联合索引,导致所有的全文检索扫描都会对比所有文本。
通过将datasetId字段加入联合索引,并在本地进行测试后,测试结论如下:

以下是针对10w左右数据量的一个dataset进行查询的前后对比。
增加DatasetID作为联合索引的字段,用于优化全文索引效率。优化效果如下:
优化前耗时
优化后耗时

以下是索引优化前后,慢SQL日志对比
image

经过上述对比,可见文档扫描数量和数据检索数量都有大幅下降。
该优化对于1个TeamID下有很多知识库,但是目标查询知识库占总知识库文本数量较低的查询优化效果更明显。

增加DatasetID作为联合索引的字段,用于优化全文索引效率
Copy link
Contributor

gru-agent bot commented Jun 6, 2025

TestGru Assignment

Summary

Link CommitId Status Reason
Detail bce36fd 🚫 Skipped No files need to be tested {"packages/service/core/dataset/data/dataTextSchema.ts":"File path does not match include patterns."}

Tip

You can @gru-agent and leave your feedback. TestGru will make adjustments based on your input

Copy link

github-actions bot commented Jun 6, 2025

Preview mcp_server Image: ghcr.io/labring/fastgpt-pr:fatsgpt_mcp_server_bce36fd09c328edf75d32ff934d0c39d13ec8706

@shikaiwei1
Copy link
Contributor Author

shikaiwei1 commented Jun 6, 2025

补充:以下是同一个数据库,同一个搜索词的慢查询日志

索引优化后

{"op":"command","ns":"fastgpt-dev-ent.dataset_data_texts","hasSortStage":true,"planSummary":"IXSCAN { teamId: 1, datasetId: 1, _fts: \"text\", _ftsx: 1 }, IXSCAN { teamId: 1, datasetId: 1, _fts: \"text\", _ftsx: 1 }, IXSCAN { teamId: 1, datasetId: 1, _fts: \"text\", _ftsx: 1 }, IXSCAN { teamId: 1, datasetId: 1, _fts: \"text\", _ftsx: 1 }, IXSCAN { teamId: 1, datasetId: 1, _fts: \"text\", _ftsx: 1 }, IXSCAN { teamId: 1, datasetId: 1, _fts: \"text\", _ftsx: 1 }, IXSCAN { teamId: 1, datasetId: 1, _fts: \"text\", _ftsx: 1 }, IXSCAN { teamId: 1, datasetId: 1, _fts: \"text\", _ftsx: 1 }, IXSCAN { teamId: 1, datasetId: 1, _fts: \"text\", _ftsx: 1 }, IXSCAN { teamId: 1, datasetId: 1, _fts: \"text\", _ftsx: 1 }","nreturned":100,"keysExaminedBySizeInBytes":162529790,"responseLength":8938,"storage":{"data":{"timeReadingMicros":{"$numberLong":"318539"},"bytesRead":{"$numberLong":"166513881"}}},"locks":{"Database":{"acquireCount":{"r":{"$numberLong":"3056"}}},"Collection":{"acquireCount":{"r":{"$numberLong":"3056"}}},"Mutex":{"acquireCount":{"r":{"$numberLong":"12"}}},"ReplicationStateTransition":{"acquireCount":{"w":{"$numberLong":"3057"}}},"Global":{"acquireCount":{"r":{"$numberLong":"3057"}}}},"flowControl":{},"command":{"pipeline":[{"$match":{"$text":{"$search":"法国 比利时 癌症 研究 中 年轻 患者 生活 质量 发现"},"teamId":{"$oid":"6710d3a18980a3369f98c2a2"},"datasetId":{"$in":[{"$oid":"67b709f9460b03446665ac40"}]}}},{"$sort":{"score":{"$meta":"textScore"}}},{"$limit":100},{"$project":{"score":{"$meta":"textScore"},"dataId":1,"_id":1,"collectionId":1}}],"cursor":{},"lsid":{"id":{"$binary":"WVBEdXbnTqeXI1kGtXdc2w==","$type":"04"}},"$readPreference":{"mode":"secondaryPreferred"},"$db":"fastgpt-dev-ent","$clusterTime":{"clusterTime":{"$timestamp":{"t":1749190017,"i":1}},"signature":{"keyId":{"$numberLong":"7468306811817295874"},"hash":{"$binary":"4waZcTZ+poYrxson9xiiFBjE7Lo=","$type":"00"}}},"readConcern":{"level":"local"},"aggregate":"dataset_data_texts"},"queryHash":"D73E05E4","protocol":"op_msg","keysExamined":294100,"planCacheKey":"9D5F8967","numYield":3044,"replRole":{"stateStr":"SECONDARY","_id":3},"docsExamined":95600,"docsExaminedBySizeInBytes":2150100720,"cursorExhausted":true,"millis":1730}

索引优化前

{"op":"command","ns":"fastgpt-dev-ent.dataset_data_texts","hasSortStage":true,"planSummary":"IXSCAN { teamId: 1, _fts: \"text\", _ftsx: 1 }, IXSCAN { teamId: 1, _fts: \"text\", _ftsx: 1 }, IXSCAN { teamId: 1, _fts: \"text\", _ftsx: 1 }, IXSCAN { teamId: 1, _fts: \"text\", _ftsx: 1 }, IXSCAN { teamId: 1, _fts: \"text\", _ftsx: 1 }, IXSCAN { teamId: 1, _fts: \"text\", _ftsx: 1 }, IXSCAN { teamId: 1, _fts: \"text\", _ftsx: 1 }, IXSCAN { teamId: 1, _fts: \"text\", _ftsx: 1 }, IXSCAN { teamId: 1, _fts: \"text\", _ftsx: 1 }, IXSCAN { teamId: 1, _fts: \"text\", _ftsx: 1 }","nreturned":100,"keysExaminedBySizeInBytes":214051440,"responseLength":8938,"storage":{"data":{"timeReadingMicros":{"$numberLong":"2305063"},"bytesRead":{"$numberLong":"1193457705"}}},"locks":{"Database":{"acquireCount":{"r":{"$numberLong":"6000"}}},"Collection":{"acquireCount":{"r":{"$numberLong":"6000"}}},"Mutex":{"acquireCount":{"r":{"$numberLong":"12"}}},"ReplicationStateTransition":{"acquireCount":{"w":{"$numberLong":"6001"}}},"Global":{"acquireCount":{"r":{"$numberLong":"6001"}}}},"flowControl":{},"command":{"pipeline":[{"$match":{"$text":{"$search":"法国 比利时 癌症 研究 中 年轻 患者 生活 质量 发现"},"teamId":{"$oid":"6710d3a18980a3369f98c2a2"},"datasetId":{"$in":[{"$oid":"67b709f9460b03446665ac40"}]}}},{"$sort":{"score":{"$meta":"textScore"}}},{"$limit":100},{"$project":{"score":{"$meta":"textScore"},"dataId":1,"_id":1,"collectionId":1}}],"cursor":{},"lsid":{"id":{"$binary":"dMgVKcAXRBi+vL+XjWIpGA==","$type":"04"}},"$readPreference":{"mode":"secondaryPreferred"},"$db":"fastgpt-dev-ent","$clusterTime":{"clusterTime":{"$timestamp":{"t":1749183856,"i":1}},"signature":{"keyId":{"$numberLong":"7468306811817295874"},"hash":{"$binary":"1Fr6C71g1ucdSvKLngVpGV6LDGk=","$type":"00"}}},"readConcern":{"level":"local"},"aggregate":"dataset_data_texts"},"queryHash":"D73E05E4","protocol":"op_msg","keysExamined":518187,"planCacheKey":"37CA3561","numYield":5988,"replRole":{"stateStr":"SECONDARY","_id":3},"docsExamined":494778,"docsExaminedBySizeInBytes":6165993669,"cursorExhausted":true,"millis":5516}

Copy link

github-actions bot commented Jun 6, 2025

Preview sandbox Image: ghcr.io/labring/fastgpt-pr:fatsgpt_sandbox_bce36fd09c328edf75d32ff934d0c39d13ec8706

Copy link

github-actions bot commented Jun 6, 2025

Preview fastgpt Image: ghcr.io/labring/fastgpt-pr:fatsgpt_bce36fd09c328edf75d32ff934d0c39d13ec8706

@c121914yu
Copy link
Collaborator

c121914yu commented Jun 6, 2025

datasetId 无法支持 $in 模式,如果是多个知识库会导致需要多次查找,并且无法得到正确分数

@c121914yu c121914yu closed this Jun 6, 2025
@c121914yu c121914yu reopened this Jun 6, 2025
Copy link

cursor-com bot commented Jun 6, 2025

🚨 BugBot failed to run

Remote branch not found for this Pull Request. It may have been merged or deleted (requestId: serverGenReqId_da2d575b-9d48-4ae5-9194-b411e8239cfa).

Copy link

github-actions bot commented Jun 6, 2025

Preview mcp_server Image: ghcr.io/labring/fastgpt-pr:fatsgpt_mcp_server_bce36fd09c328edf75d32ff934d0c39d13ec8706

Copy link

github-actions bot commented Jun 6, 2025

Preview sandbox Image: ghcr.io/labring/fastgpt-pr:fatsgpt_sandbox_bce36fd09c328edf75d32ff934d0c39d13ec8706

Copy link

github-actions bot commented Jun 6, 2025

Preview fastgpt Image: ghcr.io/labring/fastgpt-pr:fatsgpt_bce36fd09c328edf75d32ff934d0c39d13ec8706

@shikaiwei1
Copy link
Contributor Author

shikaiwei1 commented Jun 6, 2025

@c121914yu 这个优化对于全文查询性能优化明显,特别是数据库中知识库比较多的时候。是否可以考虑在查询时使用其他方式,能够使得datasetId加入到索引中,有效减少扫描文件数量和数据量大小

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants