Encoding Image and Text Together

#66
by dawn17 - opened

In the processor's process_images function, I see we have hardcoded string "Describe the image." along with the image.
https://huggingface.co/jinaai/jina-embeddings-v4/blob/main/modeling_jina_embeddings_v4.py#L62

If our document has image and text both, is there a way to encode them together? Instead of encoding the image and text separately and fusing later.

I don't see any way to do that right now. May be the hardcoded string could be replaced with document prompt/text. But, the function get_single_vector_embeddings doesn't consider text while pooling.
I'm not sure how well it will work though. Or, we could do it like Qwen2.5-VL-3B-Instruct does -

messages = [
{
"role": "user",
"content": [
{
"type": "image",
"image": "https://qianwen-res.oss-cn-beijing.aliyuncs.com/Qwen-VL/assets/demo.jpeg",
},
{"type": "text", "text": "Describe this image."},
],
}
]

Any comment/direction might help. Thanks so much :)

Another similar question:
If a document (doc) contains both TEXTand IMAGE, can these relevance scores between the embedding of the QUERY and the embeddings of the TEXTand the IMAGE be comparable?
If the modality gap were eliminated, it should theoretically be comparable.

bump

bump

Sign up or log in to comment