-
Notifications
You must be signed in to change notification settings - Fork 891
Closed
Labels
good first issueGood for newcomersGood for newcomers
Description
Huggingface Model: https://huggingface.co/microsoft/Phi-3.5-vision-instruct
Fine-tuned Dataset: https://huggingface.co/datasets/linxy/LaTeX_OCR
Usually, fine-tuning a multimodal large model involves using a custom dataset for fine-tuning. Here, we will demonstrate a runnable demo.
Before starting the fine-tuning, please ensure that your environment is properly prepared.
git clone https://github.com/modelscope/ms-swift.git
cd swift
pip install -e .[llm]
Inference
# ModelScope
CUDA_VISIBLE_DEVICES=0 swift infer \
--model_type phi3_5-vision-instruct \
--use_flash_attn false
# HuggingFace
USE_HF=1 CUDA_VISIBLE_DEVICES=0 swift infer \
--model_type phi3_5-vision-instruct \
--model_id_or_path microsoft/Phi-3.5-vision-instruct \
--use_flash_attn false
Results
<<< who are you
I am Phi, an AI developed by Microsoft to assist with providing information, answering questions, and helping users find solutions to their queries. How can I assist you today?
--------------------------------------------------
<<< <image>please describe the image.
Input an image path or URL <<< http://modelscope-open.oss-cn-hangzhou.aliyuncs.com/images/cat.png
The image features a close-up of a kitten with striking blue eyes and a white and grey striped coat. The kitten's fur is soft and fluffy, and it appears to be looking directly at the camera with a curious and innocent expression. The background is blurred, which puts the focus entirely on the kitten's face.
--------------------------------------------------
<<< <image>What is the result of the calculation?
Input an image path or URL <<< http://modelscope-open.oss-cn-hangzhou.aliyuncs.com/images/math.png
The result of the calculation 1452 + 45304 is 46756.
GPU Memory:

tastelikefeet, hikerell and lin72h
Metadata
Metadata
Assignees
Labels
good first issueGood for newcomersGood for newcomers