Skip to content

Commit

Permalink
Update readme.md
Browse files Browse the repository at this point in the history
  • Loading branch information
kijai authored Dec 17, 2024
1 parent 879b039 commit 8a27384
Showing 1 changed file with 3 additions and 6 deletions.
9 changes: 3 additions & 6 deletions readme.md
Original file line number Diff line number Diff line change
Expand Up @@ -29,22 +29,19 @@ Use the original `xtuner/llava-llama-3-8b-v1_1-transformers` model which include

**Note:** It's recommended to offload the text encoder since the vision tower requires additional VRAM.

## Step 2: Set Model Type
Set the `lm_type` to `vision_language`.

## Step 3: Load and Connect Image
## Step 2: Load and Connect Image
- Use the comfy native node to load the image.
- Connect the loaded image to the `Hunyuan TextImageEncode` node.
- You can connect up to 2 images to this node.

## Step 4: Prompting with Images
## Step 3: Prompting with Images
- Reference the image in your prompt by including `<image>`.
- The number of `<image>` tags should match the number of images provided to the sampler.
- Example prompt: `Describe this <image> in great detail.`

You can also choose to give CLIP a prompt that does not reference the image separately.

## Step 5: Advanced Configuration - `image_token_selection_expression`
## Step 4: Advanced Configuration - `image_token_selection_expression`
This expression is for advanced users and serves as a boolean mask to select which part of the image hidden state will be used for conditioning. Here are some details and recommendations:

- The hidden state sequence length (or number of tokens) per image in llava-llama-3 is 576.
Expand Down

0 comments on commit 8a27384

Please sign in to comment.