-
Notifications
You must be signed in to change notification settings - Fork 22
perf(data): optimize struct marshaling/unmarshaling with caching and … #1117
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Merged
Conversation
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
pinglin
added a commit
that referenced
this pull request
Sep 22, 2025
#1117) Because - The data marshaling/unmarshaling framework was performing expensive operations repeatedly, including: - Reflection type computations on every operation - Regex pattern compilation for validation on each use - Instill tag parsing without caching - Time format slice creation for every time parsing operation - File type checking with repeated type slice creation - These operations were causing performance bottlenecks in high-throughput scenarios - Memory allocations were not optimized, leading to unnecessary garbage collection overhead **This commit** - **Implements reflection type caching**: Pre-computes `reflect.Type` instances for commonly used types (`time.Time`, `time.Duration`, `format.Value`, etc.) to avoid repeated `reflect.TypeOf()` calls - **Adds regex pattern caching**: Implements thread-safe LRU cache for compiled regex patterns with `sync.RWMutex` protection to eliminate repeated compilation overhead - **Introduces tag parsing cache**: Caches parsed instill tag results to avoid expensive string parsing operations on every field processing - **Pre-compiles time formats**: Stores common time parsing formats in global variables to eliminate repeated slice creation - **Pre-computes file type slices**: Maintains global file type arrays for efficient type checking operations - **Enhances JSON string to struct conversion**: Adds fast pre-check for JSON-like strings to minimize unnecessary parsing attempts - **Consolidates test suite**: Combines all benchmarks into `struct_test.go` for comprehensive performance tracking ## Performance Improvements (Benchmarked) | Optimization | Before | After | Improvement | Memory Impact | | ------------------------------- | --------------------------- | ------------------------- | ------------------ | ---------------------------- | | **Reflection Type Caching** | 5.146 ns/op | 0.2555 ns/op | **20.1x faster** | 0 B/op (no allocations) | | **Regex Pattern Caching** | Repeated compilation | LRU cached compilation | **12.6x faster** | 100% memory reduction | | **Tag Parsing Optimization** | String parsing every access | Cached parsing results | **13.0x faster** | 100% memory reduction | | **Time Format Pre-compilation** | Format slice creation | Pre-compiled arrays | **1.03x faster** | Eliminates slice allocations | | **File Type Checking** | Type slice creation | Pre-computed global array | **6.2x faster** | Eliminates reflection calls | ### Overall Performance Metrics - **Complete Struct Unmarshaling**: 1635 ns/op, 496 B/op, 14 allocs/op - **Complete Struct Marshaling**: 1476 ns/op, 1184 B/op, 23 allocs/op - **Concurrent Access**: Regex cache (114.4 ns/op), Tag cache (98.70 ns/op) ## Technical Implementation Details ### 🏗️ **Architecture Enhancements** 1. **Pre-computed Global Variables**: ```go var ( timeTimeType = reflect.TypeOf(time.Time{}) timeDurationType = reflect.TypeOf(time.Duration(0)) formatValueType = reflect.TypeOf((*format.Value)(nil)).Elem() // ... more pre-computed types ) ``` 2. **Thread-Safe Caching**: ```go type regexCache struct { cache map[string]*regexp.Regexp mu sync.RWMutex } ``` 3. **LRU Cache Implementation**: - Automatic eviction of least recently used entries - Configurable cache sizes for different use cases - Double-checked locking for optimal performance 4. **Fast JSON Pre-check**: ```go if len(stringValue) > 1 && (stringValue[0] == '{' || stringValue[0] == '[') { // Only attempt JSON parsing for JSON-like strings } ``` ### 🧪 **Comprehensive Test Coverage** - **287 unit tests** covering all functionality - **9 benchmark suites** measuring performance improvements - **Edge case testing** for concurrent access patterns - **Memory allocation profiling** to prevent regressions - **Performance regression detection** through continuous benchmarking ### 🔒 **Thread Safety & Reliability** - All caches use `sync.RWMutex` for concurrent access - Double-checked locking patterns for initialization - Graceful fallback for cache misses - Zero breaking changes to existing APIs
pinglin
added a commit
that referenced
this pull request
Sep 22, 2025
#1117) Because - The data marshaling/unmarshaling framework was performing expensive operations repeatedly, including: - Reflection type computations on every operation - Regex pattern compilation for validation on each use - Instill tag parsing without caching - Time format slice creation for every time parsing operation - File type checking with repeated type slice creation - These operations were causing performance bottlenecks in high-throughput scenarios - Memory allocations were not optimized, leading to unnecessary garbage collection overhead **This commit** - **Implements reflection type caching**: Pre-computes `reflect.Type` instances for commonly used types (`time.Time`, `time.Duration`, `format.Value`, etc.) to avoid repeated `reflect.TypeOf()` calls - **Adds regex pattern caching**: Implements thread-safe LRU cache for compiled regex patterns with `sync.RWMutex` protection to eliminate repeated compilation overhead - **Introduces tag parsing cache**: Caches parsed instill tag results to avoid expensive string parsing operations on every field processing - **Pre-compiles time formats**: Stores common time parsing formats in global variables to eliminate repeated slice creation - **Pre-computes file type slices**: Maintains global file type arrays for efficient type checking operations - **Enhances JSON string to struct conversion**: Adds fast pre-check for JSON-like strings to minimize unnecessary parsing attempts - **Consolidates test suite**: Combines all benchmarks into `struct_test.go` for comprehensive performance tracking ## Performance Improvements (Benchmarked) | Optimization | Before | After | Improvement | Memory Impact | | ------------------------------- | --------------------------- | ------------------------- | ------------------ | ---------------------------- | | **Reflection Type Caching** | 5.146 ns/op | 0.2555 ns/op | **20.1x faster** | 0 B/op (no allocations) | | **Regex Pattern Caching** | Repeated compilation | LRU cached compilation | **12.6x faster** | 100% memory reduction | | **Tag Parsing Optimization** | String parsing every access | Cached parsing results | **13.0x faster** | 100% memory reduction | | **Time Format Pre-compilation** | Format slice creation | Pre-compiled arrays | **1.03x faster** | Eliminates slice allocations | | **File Type Checking** | Type slice creation | Pre-computed global array | **6.2x faster** | Eliminates reflection calls | ### Overall Performance Metrics - **Complete Struct Unmarshaling**: 1635 ns/op, 496 B/op, 14 allocs/op - **Complete Struct Marshaling**: 1476 ns/op, 1184 B/op, 23 allocs/op - **Concurrent Access**: Regex cache (114.4 ns/op), Tag cache (98.70 ns/op) ## Technical Implementation Details ### 🏗️ **Architecture Enhancements** 1. **Pre-computed Global Variables**: ```go var ( timeTimeType = reflect.TypeOf(time.Time{}) timeDurationType = reflect.TypeOf(time.Duration(0)) formatValueType = reflect.TypeOf((*format.Value)(nil)).Elem() // ... more pre-computed types ) ``` 2. **Thread-Safe Caching**: ```go type regexCache struct { cache map[string]*regexp.Regexp mu sync.RWMutex } ``` 3. **LRU Cache Implementation**: - Automatic eviction of least recently used entries - Configurable cache sizes for different use cases - Double-checked locking for optimal performance 4. **Fast JSON Pre-check**: ```go if len(stringValue) > 1 && (stringValue[0] == '{' || stringValue[0] == '[') { // Only attempt JSON parsing for JSON-like strings } ``` ### 🧪 **Comprehensive Test Coverage** - **287 unit tests** covering all functionality - **9 benchmark suites** measuring performance improvements - **Edge case testing** for concurrent access patterns - **Memory allocation profiling** to prevent regressions - **Performance regression detection** through continuous benchmarking ### 🔒 **Thread Safety & Reliability** - All caches use `sync.RWMutex` for concurrent access - Double-checked locking patterns for initialization - Graceful fallback for cache misses - Zero breaking changes to existing APIs
jvallesm
added a commit
to instill-ai/instill-core
that referenced
this pull request
Sep 23, 2025
Because - The version of the pipeline-backend service is not updated in the instill-core repository. This commit - updates the `PIPELINE_BACKEND_VERSION` in the `.env` file to `1b4cd1f`. - updates the `pipelineBackend.image.tag` in the helm chart values.yaml file to `1b4cd1f`. ## Changes in pipeline-backend - fix(text): correct positions on duplicate markdown chunks (instill-ai/pipeline-backend#1120) - refactor(component,generic,http): replace env-based URL validation with constructor injection (instill-ai/pipeline-backend#1121) - fix(usage): add missing error filtering for users/admin (instill-ai/pipeline-backend#1119) - feat(component,ai,gemini): implement File API support for large files… (instill-ai/pipeline-backend#1118) - perf(data): optimize struct marshaling/unmarshaling with caching and … (instill-ai/pipeline-backend#1117) - feat(data): enhance unmarshaler with JSON string to struct conversion (instill-ai/pipeline-backend#1116) - feat(data): implement time types support with pattern validation (instill-ai/pipeline-backend#1115) - feat(component,ai,gemini): add multimedia support with unified format… (instill-ai/pipeline-backend#1114) - ci(workflows): adopt GitHub-hosted runner (instill-ai/pipeline-backend#1113) - perf(data): enhance comprehensive format coverage and optimize test performance (instill-ai/pipeline-backend#1112) - ci(workflows): adopt loarger runner for coverage test (instill-ai/pipeline-backend#1111) - perf(component,operator,document): optimize unit tests and fix LibreOffice dependency failures (instill-ai/pipeline-backend#1110) - perf(component,operator,video): optimize unit test performance by 59.7% (instill-ai/pipeline-backend#1109) - perf(component,operator,image): optimize unit tests for 98.5% faster … (instill-ai/pipeline-backend#1107) - ci(docker): optimize Dockerfiles with multi-stage builds for faster build times (instill-ai/pipeline-backend#1108) - perf(data): implement automatic field naming convention detection with LRU caching (instill-ai/pipeline-backend#1105) - feat(component,ai,gemini): enhance streaming to output all fields (instill-ai/pipeline-backend#1106) - fix(component,ai,gemini): correct text-based documents logic (instill-ai/pipeline-backend#1103) - test(component,generic,http): replace external httpbin.org dependency with local test server (instill-ai/pipeline-backend#1101) - ci(docker): add GitHub fallback for ffmpeg installation (instill-ai/pipeline-backend#1102) Co-authored-by: jvallesm <3977183+jvallesm@users.noreply.github.com>
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Because
This commit
reflect.Typeinstances for commonly used types (time.Time,time.Duration,format.Value, etc.) to avoid repeatedreflect.TypeOf()callssync.RWMutexprotection to eliminate repeated compilation overheadstruct_test.gofor comprehensive performance trackingPerformance Improvements (Benchmarked)
Overall Performance Metrics
Technical Implementation Details
🏗️ Architecture Enhancements
Pre-computed Global Variables:
Thread-Safe Caching:
LRU Cache Implementation:
Fast JSON Pre-check:
🧪 Comprehensive Test Coverage
🔒 Thread Safety & Reliability
sync.RWMutexfor concurrent access