-
Notifications
You must be signed in to change notification settings - Fork 625
Optimize/builtin knowledge text splitter #757
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Merged
zerob13
merged 17 commits into
ThinkInAIXYZ:dev
from
hllshiro:optimize/builtin-knowledge-text-splitter
Aug 20, 2025
Merged
Changes from all commits
Commits
Show all changes
17 commits
Select commit
Hold shift + click to select a range
9b83248
feat: enhance file handling by adding MIME type extraction and file n…
hllshiro e2cdc17
refactor: update sanitizeText function documentation and improve whit…
hllshiro 188f87a
fix: update chunk ID generation to include file ID for better tracking
hllshiro eba1d41
feat: add methods to get supported languages and separators for progr…
hllshiro 440083a
Merge remote-tracking branch 'upstream/dev' into optimize/builtin-kno…
hllshiro 79f3bd8
feat: update separators handling and add localization for separators …
hllshiro 9cf6ce1
perf: builtin knowledge support custom separators
hllshiro b71b850
perf: RecursiveCharacterTextSplitter load custom separators
hllshiro 468bd49
feat: implement FileValidationService with MIME type validation and s…
hllshiro 826a21b
feat: integrate FileValidationService into FilePresenter for file val…
hllshiro ea7d1a8
feat: add file validation methods to KnowledgePresenter for supported…
hllshiro 529fd27
feat: dynamically load supported file extensions and enhance file upl…
hllshiro 7a4f97a
feat: update file support messages and improve error handling for uns…
hllshiro 19b7d57
feat: update icon in settings and reset separators value on config ad…
hllshiro aea2881
feat: adjust popover width and update text color in BuiltinKnowledgeS…
hllshiro 8295006
fix: correct placeholder attribute casing in BuiltinKnowledgeSettings…
hllshiro 192c91e
fix: update text color for language selection and handle empty separa…
hllshiro File filter
Filter by extension
Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
There are no files selected for viewing
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
227 changes: 227 additions & 0 deletions
227
src/main/presenter/filePresenter/FileValidationService.ts
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,227 @@ | ||
| import { FileAdapterConstructor } from './FileAdapterConstructor' | ||
| import { getMimeTypeAdapterMap, detectMimeType } from './mime' | ||
| import { UnsupportFileAdapter } from './UnsupportFileAdapter' | ||
| import * as mimeTypes from 'mime-types' | ||
|
|
||
| export interface FileValidationResult { | ||
| isSupported: boolean | ||
| mimeType?: string | ||
| adapterType?: string | ||
| error?: string | ||
| suggestedExtensions?: string[] | ||
| } | ||
|
|
||
| export interface IFileValidationService { | ||
| validateFile(filePath: string): Promise<FileValidationResult> | ||
| getSupportedExtensions(): string[] | ||
| getSupportedMimeTypes(): string[] | ||
| } | ||
|
|
||
| export class FileValidationService implements IFileValidationService { | ||
| private excludedAdapters = [ | ||
| 'AudioFileAdapter', | ||
| 'ImageFileAdapter', | ||
| 'UnsupportFileAdapter', | ||
| 'DirectoryAdapter' | ||
| ] | ||
|
|
||
| constructor() { | ||
| // Constructor kept for future extensibility | ||
| } | ||
|
|
||
| /** | ||
| * Validates if a file is supported for knowledge base processing | ||
| * @param filePath Path to the file to validate | ||
| * @returns FileValidationResult with validation details | ||
| */ | ||
| async validateFile(filePath: string): Promise<FileValidationResult> { | ||
| try { | ||
| // Detect MIME type from file content | ||
| const mimeType = await detectMimeType(filePath) | ||
|
|
||
| if (!mimeType) { | ||
| return { | ||
| isSupported: false, | ||
| error: 'Could not determine file type', | ||
| suggestedExtensions: this.getSupportedExtensions() | ||
| } | ||
| } | ||
|
|
||
| // Get adapter map and find appropriate adapter | ||
| const adapterMap = getMimeTypeAdapterMap() | ||
| const AdapterConstructor = this.findAdapterForMimeType(mimeType, adapterMap) | ||
|
|
||
| if (!AdapterConstructor) { | ||
| return { | ||
| isSupported: false, | ||
| mimeType, | ||
| error: 'No adapter found for this file type', | ||
| suggestedExtensions: this.getSupportedExtensions() | ||
| } | ||
| } | ||
|
|
||
| // Check if adapter is supported (not in excluded list) | ||
| const isSupported = this.isAdapterSupported(AdapterConstructor) | ||
| const adapterType = AdapterConstructor.name | ||
|
|
||
| if (!isSupported) { | ||
| return { | ||
| isSupported: false, | ||
| mimeType, | ||
| adapterType, | ||
| error: 'File type not supported', | ||
| suggestedExtensions: this.getSupportedExtensions() | ||
| } | ||
| } | ||
|
|
||
| return { | ||
| isSupported: true, | ||
| mimeType, | ||
| adapterType | ||
| } | ||
| } catch (error) { | ||
| return { | ||
| isSupported: false, | ||
| error: `Error validating file: ${error instanceof Error ? error.message : 'Unknown error'}`, | ||
| suggestedExtensions: this.getSupportedExtensions() | ||
| } | ||
| } | ||
| } | ||
|
|
||
| /** | ||
| * Checks if an adapter is supported for knowledge base processing | ||
| * @param adapterConstructor The adapter constructor to check | ||
| * @returns true if adapter is supported, false otherwise | ||
| */ | ||
| private isAdapterSupported(adapterConstructor: FileAdapterConstructor): boolean { | ||
| const adapterName = adapterConstructor.name | ||
| return !this.excludedAdapters.includes(adapterName) | ||
| } | ||
|
|
||
| /** | ||
| * Finds the appropriate adapter for a given MIME type | ||
| * @param mimeType The MIME type to find an adapter for | ||
| * @param adapterMap Map of MIME types to adapter constructors | ||
| * @returns The adapter constructor or undefined if not found | ||
| */ | ||
| private findAdapterForMimeType( | ||
| mimeType: string, | ||
| adapterMap: Map<string, FileAdapterConstructor> | ||
| ): FileAdapterConstructor | undefined { | ||
| // First try exact match | ||
| const exactMatch = adapterMap.get(mimeType) | ||
| if (exactMatch) { | ||
| return exactMatch | ||
| } | ||
|
|
||
| // Try wildcard match | ||
| const type = mimeType.split('/')[0] | ||
| const wildcardMatch = adapterMap.get(`${type}/*`) | ||
|
|
||
| if (wildcardMatch) { | ||
| return wildcardMatch | ||
| } | ||
|
|
||
| // Return UnsupportFileAdapter as fallback | ||
| return UnsupportFileAdapter | ||
| } | ||
|
|
||
| /** | ||
| * Gets all supported file extensions for knowledge base processing | ||
| * @returns Array of supported file extensions (without dots) | ||
| */ | ||
| getSupportedExtensions(): string[] { | ||
| try { | ||
| const adapterMap = getMimeTypeAdapterMap() | ||
| const supportedExtensions = new Set<string>() | ||
|
|
||
| // Iterate through all MIME types in the adapter map | ||
| for (const [mimeType, AdapterConstructor] of adapterMap.entries()) { | ||
| // Skip excluded adapters and wildcard entries | ||
| if (!this.isAdapterSupported(AdapterConstructor) || mimeType.includes('*')) { | ||
| continue | ||
| } | ||
|
|
||
| // Get extensions for this MIME type | ||
| const extension = mimeTypes.extension(mimeType) | ||
| if (extension) { | ||
| supportedExtensions.add(extension) | ||
| } | ||
| } | ||
|
|
||
| // Add some common extensions that might not be in the MIME type map | ||
| const commonExtensions = ['md', 'markdown', 'txt', 'json', 'yaml', 'yml', 'xml'] | ||
| commonExtensions.forEach((ext) => supportedExtensions.add(ext)) | ||
|
|
||
| return Array.from(supportedExtensions).sort() | ||
| } catch (error) { | ||
| // Fallback to common extensions if adapter map fails | ||
| console.error('Error getting supported extensions:', error) | ||
| return [ | ||
| 'txt', | ||
| 'md', | ||
| 'markdown', | ||
| 'pdf', | ||
| 'docx', | ||
| 'pptx', | ||
| 'xlsx', | ||
| 'csv', | ||
| 'json', | ||
| 'yaml', | ||
| 'yml', | ||
| 'xml', | ||
| 'js', | ||
| 'ts', | ||
| 'py', | ||
| 'java', | ||
| 'cpp', | ||
| 'c', | ||
| 'h', | ||
| 'css', | ||
| 'html' | ||
| ].sort() | ||
| } | ||
| } | ||
|
|
||
| /** | ||
| * Gets all supported MIME types for knowledge base processing | ||
| * @returns Array of supported MIME types | ||
| */ | ||
| getSupportedMimeTypes(): string[] { | ||
| try { | ||
| const adapterMap = getMimeTypeAdapterMap() | ||
| const supportedMimeTypes: string[] = [] | ||
|
|
||
| // Iterate through all MIME types in the adapter map | ||
| for (const [mimeType, AdapterConstructor] of adapterMap.entries()) { | ||
| // Skip excluded adapters and wildcard entries | ||
| if (!this.isAdapterSupported(AdapterConstructor) || mimeType.includes('*')) { | ||
| continue | ||
| } | ||
|
|
||
| supportedMimeTypes.push(mimeType) | ||
| } | ||
|
|
||
| return supportedMimeTypes.sort() | ||
| } catch (error) { | ||
| // Fallback to common MIME types if adapter map fails | ||
| console.error('Error getting supported MIME types:', error) | ||
| return [ | ||
| 'text/plain', | ||
| 'text/markdown', | ||
| 'application/pdf', | ||
| 'application/msword', | ||
| 'application/vnd.openxmlformats-officedocument.wordprocessingml.document', | ||
| 'application/vnd.ms-powerpoint', | ||
| 'application/vnd.openxmlformats-officedocument.presentationml.presentation', | ||
| 'application/vnd.ms-excel', | ||
| 'application/vnd.openxmlformats-officedocument.spreadsheetml.sheet', | ||
| 'text/csv', | ||
| 'application/json', | ||
| 'application/javascript', | ||
| 'text/html', | ||
| 'text/css' | ||
| ].sort() | ||
| } | ||
| } | ||
| } | ||
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Oops, something went wrong.
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Uh oh!
There was an error while loading. Please reload this page.