Commit
·
44077eb
1
Parent(s):
25ca911
docs: minor README fixes
Browse files
README.md
CHANGED
|
@@ -148,17 +148,17 @@ Multimodal embeddings enable searching and understanding data across different m
|
|
| 148 |
Built upon [`jina-clip-v1`](https://huggingface.co/jinaai/jina-clip-v1) and our recently released [`jina-embeddings-v3`](https://huggingface.co/jinaai/jina-embeddings-v3), `jina-clip-v2` features several significant improvements:
|
| 149 |
|
| 150 |
* **Improved Performance**: v2 shows a 3% performance improvement over v1 in both text-image and text-text retrieval tasks. Similar to v1, v2's text encoder can serve as an effective multilingual long-context dense retriever. It performs on par with our frontier model `jina-embeddings-v3` (currently the best multilingual embeddings under 1B parameters on MTEB).
|
| 151 |
-
* **Multilingual Support**:
|
| 152 |
* **Higher Image Resolution**: v2 now supports 512x512 input image resolution, a significant increase from v1's 224x224. This higher resolution enables better processing of detailed images, improved feature extraction, and more accurate recognition of fine-grained visual elements.
|
| 153 |
* **Matryoshka Representations**: v2 allows users to truncate the output dimensions of both text and image embeddings from 1024 down to 64, reducing storage and processing overhead while maintaining strong performance.
|
| 154 |
|
| 155 |
Measuring 0.9B parameters, `jina-clip-v2` combines two powerful encoders:
|
| 156 |
-
* the text encoder `
|
| 157 |
* the vision encoder `EVA02-L14` (an efficient vision Transformer developed by BAAI).
|
| 158 |
|
| 159 |
| FEATURE | TEXT ENCODER | IMAGE ENCODER |
|
| 160 |
|-----------------------|-------------------------|------------------|
|
| 161 |
-
| Base Model | Jina
|
| 162 |
| Parameters | 561M | 304M |
|
| 163 |
| Input Specification | 8,192 tokens (max) | 512×512 pixels |
|
| 164 |
| Min Output Dimensions | 64 | 64 |
|
|
@@ -330,12 +330,16 @@ sentences = [
|
|
| 330 |
image_urls = ['https://i.ibb.co/nQNGqL0/beach1.jpg', 'https://i.ibb.co/r5w8hG8/beach2.jpg']
|
| 331 |
|
| 332 |
# Encode text and images
|
| 333 |
-
text_embeddings = model.encode(sentences)
|
| 334 |
-
image_embeddings = model.encode(
|
|
|
|
|
|
|
| 335 |
|
| 336 |
# Encode query text
|
| 337 |
query = 'beautiful sunset over the beach' # English
|
| 338 |
-
query_embeddings = model.encode(
|
|
|
|
|
|
|
| 339 |
```
|
| 340 |
</details>
|
| 341 |
|
|
@@ -388,7 +392,7 @@ _, _, text_embeddings, image_embeddings = output
|
|
| 388 |
|
| 389 |
## License
|
| 390 |
|
| 391 |
-
|
| 392 |
|
| 393 |
|
| 394 |
## Contact
|
|
|
|
| 148 |
Built upon [`jina-clip-v1`](https://huggingface.co/jinaai/jina-clip-v1) and our recently released [`jina-embeddings-v3`](https://huggingface.co/jinaai/jina-embeddings-v3), `jina-clip-v2` features several significant improvements:
|
| 149 |
|
| 150 |
* **Improved Performance**: v2 shows a 3% performance improvement over v1 in both text-image and text-text retrieval tasks. Similar to v1, v2's text encoder can serve as an effective multilingual long-context dense retriever. It performs on par with our frontier model `jina-embeddings-v3` (currently the best multilingual embeddings under 1B parameters on MTEB).
|
| 151 |
+
* **Multilingual Support**: Using the same backbone as `jina-embeddings-v3` for the text tower, `jina-clip-v2` supports 89 languages for multilingual-image retrieval, showing up to 4% improvement compared to `nllb-clip-large-siglip` on multilingual image retrieval tasks.
|
| 152 |
* **Higher Image Resolution**: v2 now supports 512x512 input image resolution, a significant increase from v1's 224x224. This higher resolution enables better processing of detailed images, improved feature extraction, and more accurate recognition of fine-grained visual elements.
|
| 153 |
* **Matryoshka Representations**: v2 allows users to truncate the output dimensions of both text and image embeddings from 1024 down to 64, reducing storage and processing overhead while maintaining strong performance.
|
| 154 |
|
| 155 |
Measuring 0.9B parameters, `jina-clip-v2` combines two powerful encoders:
|
| 156 |
+
* the text encoder `Jina-XLM-RoBERTa` (the backbone of `jina-embeddings-v3`) and
|
| 157 |
* the vision encoder `EVA02-L14` (an efficient vision Transformer developed by BAAI).
|
| 158 |
|
| 159 |
| FEATURE | TEXT ENCODER | IMAGE ENCODER |
|
| 160 |
|-----------------------|-------------------------|------------------|
|
| 161 |
+
| Base Model | Jina-XLM-RoBERTa | EVA02-L |
|
| 162 |
| Parameters | 561M | 304M |
|
| 163 |
| Input Specification | 8,192 tokens (max) | 512×512 pixels |
|
| 164 |
| Min Output Dimensions | 64 | 64 |
|
|
|
|
| 330 |
image_urls = ['https://i.ibb.co/nQNGqL0/beach1.jpg', 'https://i.ibb.co/r5w8hG8/beach2.jpg']
|
| 331 |
|
| 332 |
# Encode text and images
|
| 333 |
+
text_embeddings = model.encode(sentences, normalize_embeddings=True)
|
| 334 |
+
image_embeddings = model.encode(
|
| 335 |
+
image_urls, normalize_embeddings=True
|
| 336 |
+
) # also accepts PIL.Image.Image, local filenames, dataURI
|
| 337 |
|
| 338 |
# Encode query text
|
| 339 |
query = 'beautiful sunset over the beach' # English
|
| 340 |
+
query_embeddings = model.encode(
|
| 341 |
+
query, prompt_name='retrieval.query', normalize_embeddings=True
|
| 342 |
+
)
|
| 343 |
```
|
| 344 |
</details>
|
| 345 |
|
|
|
|
| 392 |
|
| 393 |
## License
|
| 394 |
|
| 395 |
+
This model is licensed to download and run under [CC BY-NC 4.0](https://creativecommons.org/licenses/by-nc/4.0/deed.en). It is available for commercial use via the [Jina Embeddings API](https://jina.ai/embeddings/), [AWS](https://aws.amazon.com/marketplace/pp/prodview-bfbctuqmky676), [Azure](https://azuremarketplace.microsoft.com/en-gb/marketplace/apps/jinaai.jina-clip-v2-vm?tab=Overview), and [GCP](https://console.cloud.google.com/marketplace/browse?hl=en&inv=1&invt=AbiFWQ&q=jina). To download for commercial use, please [contact us](https://jina.ai/contact-sales).
|
| 396 |
|
| 397 |
|
| 398 |
## Contact
|