|
| 1 | +# LlamaIndex Multi_Modal_Llms Integration: Huggingface |
| 2 | + |
| 3 | +This project integrates Hugging Face's multimodal language models into the LlamaIndex framework, enabling advanced multimodal capabilities for various AI applications. |
| 4 | + |
| 5 | +## Features |
| 6 | + |
| 7 | +- Seamless integration of Hugging Face multimodal models with LlamaIndex |
| 8 | +- Support for multiple state-of-the-art vision-language models and their **finetunes**: |
| 9 | + - [Qwen2 Vision](https://huggingface.co/collections/Qwen/qwen2-vl-66cee7455501d7126940800d) |
| 10 | + - [Florence2](https://huggingface.co/collections/microsoft/florence-6669f44df0d87d9c3bfb76de) |
| 11 | + - [Phi-3.5 Vision](https://huggingface.co/collections/microsoft/phi-3-6626e15e9585a200d2d761e3) |
| 12 | + - [PaLI-Gemma](https://huggingface.co/collections/google/paligemma-release-6643a9ffbf57de2ae0448dda) |
| 13 | +- Easy-to-use interface for multimodal tasks like image captioning and visual question answering |
| 14 | +- Configurable model parameters for fine-tuned performance |
| 15 | + |
| 16 | +--- |
| 17 | + |
| 18 | +## Author of that Integration [GitHub](https://github.com/g-hano) | [LinkedIn](https://www.linkedin.com/in/chanyalcin/) | [Email](mcihan.yalcin@outlook.com) |
| 19 | + |
| 20 | +## Installation |
| 21 | + |
| 22 | +```bash |
| 23 | +pip install llama-index-multi-modal-llms-huggingface |
| 24 | +``` |
| 25 | + |
| 26 | +Make sure to set your Hugging Face API token as an environment variable: |
| 27 | + |
| 28 | +```bash |
| 29 | +export HF_TOKEN=your_huggingface_token_here |
| 30 | +``` |
| 31 | + |
| 32 | +## Usage |
| 33 | + |
| 34 | +Here's a basic example of how to use the Hugging Face multimodal integration: |
| 35 | + |
| 36 | +```python |
| 37 | +from llama_index.multi_modal_llms.huggingface import HuggingFaceMultiModal |
| 38 | +from llama_index.schema import ImageDocument |
| 39 | + |
| 40 | +# Initialize the model |
| 41 | +model = HuggingFaceMultiModal.from_model_name("Qwen/Qwen2-VL-2B-Instruct") |
| 42 | + |
| 43 | +# Prepare your image and prompt |
| 44 | +image_document = ImageDocument(image_path="path/to/your/image.jpg") |
| 45 | +prompt = "Describe this image in detail." |
| 46 | + |
| 47 | +# Generate a response |
| 48 | +response = model.complete(prompt, image_documents=[image_document]) |
| 49 | + |
| 50 | +print(response.text) |
| 51 | +``` |
| 52 | + |
| 53 | +You can also refer to this [Colab notebook](examples\huggingface_multimodal.ipynb) |
| 54 | + |
| 55 | +## Supported Models |
| 56 | + |
| 57 | +1. Qwen2VisionMultiModal |
| 58 | +2. Florence2MultiModal |
| 59 | +3. Phi35VisionMultiModal |
| 60 | +4. PaliGemmaMultiModal |
| 61 | + |
| 62 | +Each model has its unique capabilities and can be selected based on your specific use case. |
| 63 | + |
| 64 | +## Configuration |
| 65 | + |
| 66 | +You can configure various parameters when initializing a model: |
| 67 | + |
| 68 | +```python |
| 69 | +model = HuggingFaceMultiModal( |
| 70 | + model_name="Qwen/Qwen2-VL-2B-Instruct", |
| 71 | + device="cuda", # or "cpu" |
| 72 | + torch_dtype=torch.float16, |
| 73 | + max_new_tokens=100, |
| 74 | + temperature=0.7, |
| 75 | +) |
| 76 | +``` |
| 77 | + |
| 78 | +## Limitations |
| 79 | + |
| 80 | +- Async streaming is not supported for any of the models. |
| 81 | +- Some models have specific requirements or limitations. Please refer to the individual model classes for details. |
| 82 | + |
| 83 | +--- |
| 84 | + |
| 85 | +## Author of that Integration [GitHub](https://github.com/g-hano) | [LinkedIn](https://www.linkedin.com/in/chanyalcin/) | [Email](mcihan.yalcin@outlook.com) |
0 commit comments