Open
Description
I'm working on adding LLaVA to bumblebee as a learning exercise.
I need some guidance on a few things:
- From the official implementation of LLaVA as seen here , they are using ClipVisionModel from the huggingface transformers package to extract image features. Should I go ahead and reimplement this or just use the existing ClipVisionModel implementation already in bumblebee?
- In the implementations there's a params_mapping section. for example for LLaMA here. How do I go about identifying the layers of the model and what they map to in the Axon model?
- I would also require some guidance on implementing the core logic of the model.
The transformers package has not added support for LLaVA but there's an ongoing PR that can be found here but has not been merged yet.
Thanks.
Activity