-
-
Notifications
You must be signed in to change notification settings - Fork 890
Closed
Milestone
Description
Prerequisites
- I have written a descriptive issue title
- I have verified that I am running the latest version of ImageSharp
- I have searched open and closed issues to ensure it has not already been reported
The problem
- There are many places in the library where we operate on relatively small matrices known at compile time using a dynamically allocated heap array (usually wrapped in
Fast2DArray):ConvolutionProcessor-sPatternBrush-esResizeProcessorweight windows
- This approach is easy to code, but has implicit overheads with a significant negative effect on performance:
- Indexing heap arrays increases the probabilty of cache misses
- The JIT is not able to apply optimized register allocation with dynamic arrays. The data is always accessed by "read from memory" (
mov?) CPU instructions. - Calculations are iterating using nested for loops instead of having a simple expression
- All in all: the generated machine code is more complex, longer, slower to execute than it should be
The solution
- Define
KernelMatrix{N}x{M}structures for common kernel sizes (LikeKernelMatrix9x3,KernelMatrix4x1etc.) - Use specific kernel matrix structures (known at compile time!) as a generic arguments in implementation related classes (
ConvolutionProcessor<TColor, TKernelMatrix>!)- Using
KernelMatrix9x1-like constructs as resampler windows could bring significant performance improvements for ResizeProcessor (see Improve ResizeProcessor performance #139)
- Using
- Implement all kernel-related calculations as simple expressions without for loops, etc (Like implementations of
Matrix4x4methods) - Use a general matrix for unknown sizes (something similar to
Fast2DArray<float>)
TODO
- Implement a prototype/emulation benchmark comparing 2 Convolution implementations: heap matrix VS value matrix
- Design and implement stuff based on conclusions
Remarks
The concept is very similar to the idea behind Block8x8F in Jpeg. Utilizing information that is known at compile time is a very efficient technique to improve performance.
Reactions are currently unavailable