Description
Describe the issue
Feature Request: Allow "Best-Effort" Optimization for Custom Models via ipex.llm.optimize
Motivation:
The ipex.llm.optimize
API is a powerful API for accelerated inference on supported LLM families on xpus . However, its current design seems tightly coupled to these specific, verified architectures and ipex.optimize
is often limited.
Problem Description:
When working with custom decoder models face challenges when trying to leverage ipex.llm.optimize
.
These models include:
- Smaller, domain-specific decoders tailored for particular tasks.
- Decoder components within larger Vision Language Models (VLMs).
- Novel architectures developed during research.
Currently, applying ipex.llm.optimize
to such models often requires non-trivial workarounds, such as modifying the model's config.json
or using monkey-patching techniques to make the model appear as one of the supported types. This process is indirect, adds development overhead, and isn't guaranteed to apply optimizations correctly.
Proposed Solution:
Introduce a pathway for ipex.llm.optimize
to apply optimizations on a "best-effort" basis to models not explicitly listed as supported. This could involve:
- An Opt-in Mechanism: A boolean flag like
attempt_optimization_on_unsupported=True
could allow users to explicitly request optimization, acknowledging it might not be fully tuned or guaranteed. - Heuristic-Based Optimization: The optimizer could inspect the provided
torch.nn.Module
and apply optimizations known to be generally applicable to transformer decoder blocks (e.g., optimizing linear layers, specific activation functions, KV caching if patterns are detected) without relying on exact model family identification. - User Hints (Optional): Potentially allow users to provide basic hints about the model structure if needed (though a fully automatic approach is preferred).
Benefits:
- Reduced Friction: Lowers the barrier for developers to experiment with IPEX optimizations on custom models.
- Faster Iteration: Enables quicker testing and deployment of optimized custom architectures.
- Broader Applicability: Extends the reach and utility of IPEX optimizations beyond the core supported model list.
- Flexibility: Allows optimizing components (like VLM decoders) independently.
Conclusion:
Providing a mechanism, even if experimental or "best-effort," to apply ipex.llm.optimize
to a wider range of decoder-like models would be a valuable addition for the community building and deploying custom AI solutions on Intel hardware.