Encoding Image and Text Together

#66

by dawn17 - opened Aug 13

Aug 13

•

In the processor's process_images function, I see we have hardcoded string "Describe the image." along with the image.
https://huggingface.co/jinaai/jina-embeddings-v4/blob/main/modeling_jina_embeddings_v4.py#L62

If our document has image and text both, is there a way to encode them together? Instead of encoding the image and text separately and fusing later.

I don't see any way to do that right now. May be the hardcoded string could be replaced with document prompt/text. But, the function get_single_vector_embeddings doesn't consider text while pooling.
I'm not sure how well it will work though. Or, we could do it like Qwen2.5-VL-3B-Instruct does -

messages = [
{
"role": "user",
"content": [
{
"type": "image",
"image": "https://qianwen-res.oss-cn-beijing.aliyuncs.com/Qwen-VL/assets/demo.jpeg",
},
{"type": "text", "text": "Describe this image."},
],
}
]

Any comment/direction might help. Thanks so much :)

weiendabo

Aug 20

Another similar question:
If a document (doc) contains both TEXTand IMAGE, can these relevance scores between the embedding of the QUERY and the embeddings of the TEXTand the IMAGE be comparable?
If the modality gap were eliminated, it should theoretically be comparable.

dawn17

Sep 1

bump

onlylrs

Oct 7

bump

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment