YuPeng0214 commited on
Commit
a7d7526
·
verified ·
1 Parent(s): 4b7309e

Upload folder using huggingface_hub

Browse files
.gitattributes CHANGED
@@ -40,3 +40,9 @@ image-16.png filter=lfs diff=lfs merge=lfs -text
40
  image-18.png filter=lfs diff=lfs merge=lfs -text
41
  image-9.png filter=lfs diff=lfs merge=lfs -text
42
  tokenizer.json filter=lfs diff=lfs merge=lfs -text
 
 
 
 
 
 
 
40
  image-18.png filter=lfs diff=lfs merge=lfs -text
41
  image-9.png filter=lfs diff=lfs merge=lfs -text
42
  tokenizer.json filter=lfs diff=lfs merge=lfs -text
43
+ assets/image-1.png filter=lfs diff=lfs merge=lfs -text
44
+ assets/image-10.png filter=lfs diff=lfs merge=lfs -text
45
+ assets/image-11.png filter=lfs diff=lfs merge=lfs -text
46
+ assets/image-16.png filter=lfs diff=lfs merge=lfs -text
47
+ assets/image-18.png filter=lfs diff=lfs merge=lfs -text
48
+ assets/image-9.png filter=lfs diff=lfs merge=lfs -text
README.md ADDED
@@ -0,0 +1,214 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ license: apache-2.0
3
+ tags:
4
+ - sentence-transformers
5
+ - sentence-similarity
6
+ - mteb
7
+ - retriever
8
+ - text-embeddings-inference
9
+ ---
10
+ # QZhou-Embedding
11
+ <div align="center">
12
+ <img src="assets/image-1.png" width="800" height="300"></img>
13
+ </div>
14
+
15
+ ## Introduction
16
+ We have released <a href="https://huggingface.co/Kingsoft-LLM/QZhou-Embedding">QZhou-Embedding</a> (called "Qingzhou Embedding"), a large-scale text embedding model designed for general use,excelling at various text embedding tasks (retrieval, re-ranking, sentence similarity, and classification). Leveraging the general language capabilities of its underlying model, and pre-trained on massive amounts of text, QZhou-Embedding achieves even more powerful text embedding representations. QZhou-Embedding is continuously trained using millions of high-quality open-source embedding datasets and over 5 million high-quality synthetic data (using two synthetic techniques: rewriting and expansion). Initial retrieval training provides the model with a foundation for query-doc semantic matching capabilities. Later, multi-dimensional training such as STS and clustering, helps the model achieve continuous breakthroughs in various tasks. QZhou-Embedding is a 7B model and can embed long text vectors up to 8k in size. It achieved the highest average score on the mteb/cmteb evaluation benchmarks. In terms of various task scores, its clustering, sentence pair classification, rearrangement, and STS task achieved the highest average scores.
17
+ ## Basic Features
18
+
19
+ - Powerful text embedding capabilities;
20
+ - Long context: up to 8k context length;
21
+ - 7B parameter size
22
+
23
+
24
+ ## Technical Introduction
25
+ ### Unified Task Modeling Framework
26
+ We unify the text embedding objectives into three major modeling optimization issues and propose a unified training data structured solution and corresponding training mechanism. This approach can integrate most open source data as retrieval training sets. The structured data can be as follows:
27
+ - Retrieval
28
+ - title-body
29
+ - title-abstract
30
+ - Question Answering Dataset
31
+ - Reading comprehension
32
+ - ...
33
+
34
+ - STS
35
+ - text pair + label in {true, false}、{yes, no}
36
+ - text pair + score(such as 0.2, 3.1. 4.8, etc.)
37
+ - NLI dataset:text pair + label in {'entailment', 'neutral', 'contradiction'}
38
+
39
+ - CLS
40
+ - text+CLS label
41
+
42
+ <div align="center"><img src="assets/image-18.png" width="1000" height="600"></img></div>
43
+ <div align="center"><img src="assets/image-16.png" width="1000" height="550"></img></div>
44
+
45
+ ### Training Objectives
46
+
47
+ - Retrieval: Apply InfoNCE contrastive loss function, and follow the gte/qwen3-embedding to add the query-query negative as part of the denominator.<br>
48
+ $$
49
+ L_{ret}=-\frac{1}{n}\sum_{i} log{\frac{e^{sim(q_i,d_i^+)/\tau}}{e^{sim(q_i,d_i^+)/\tau}+\sum_{j}e^{sim(q_i,d_j^-)/\tau}+\sum_{j≠i}e^{sim(q_i,q_j)/\tau}}}
50
+ $$
51
+
52
+ - STS:Apply Cosent loss:
53
+ $$
54
+ L_{cosent}=log \bigg(1+\sum_{sim(i,j)>sim(k,l)}exp(\frac{sim(x_k, x_l)-sim(x_i,x_j)}{\tau})\bigg)
55
+ $$
56
+
57
+ - CLS: Apply the same InfoNCE loss as retrieval, but for In-Batch Negative, due to the high probability of same-class conflicts, a mask mechanism is used to cover up similar samples in negative examples shared by different samples.
58
+ $$
59
+ L_{ret}=-\frac{1}{n}\sum_{i} log{\frac{e^{sim(t_i,t_i^+)/\tau}}{e^{sim(t_i,t_i^+)/\tau}+\sum_{n}MASK(t_i,t_{i,n}^-)·e^{sim(t_i,t_{i,n}^-)/\tau}+\sum_{j≠i}MASK(t_i,t_j)·e^{sim(t_i,t_j)/\tau}+\sum_{j≠i}\sum_{n}MASK(t_i,t_{j,n}^-)e^{sim(t_i,t_{j,n}^-)/\tau}}}
60
+ $$
61
+ $$
62
+ where\:\:C_{t_i}=C_{t_i^+}
63
+ $$
64
+ $$
65
+ MASK(t_i, t_j)=
66
+ \begin{cases}
67
+ 0 & \quad \text{if } C_{t_i}=C_{t_j}, \\
68
+ 1 & \quad \text{otherwise}
69
+ \end{cases}
70
+ $$
71
+ Where $C_{t_i}$ represents the class label of sample $t_i$ , and $n$ is the number of negative samples for a single data point.
72
+ ### Feature Enhancement Data Synthesis Technology
73
+ In the context of powerful languages and writing capabilities in LLMs, we've fully leveraged the LLMs API to propose a data synthesis technology. To address issues like limited data and narrow topics/features in training sets, we've proposed rewriting and expanding synthesis techniques. Furthermore, to increase the difficulty of negative examples during training, we've designed a hard negative example synthesis technology based on big models, combined with existing strong retriever-based hard negative examples sampling. Several of these technologies are described below:
74
+ <div align="center"><img src="assets/image-9.png" width="930" height="290"></img></div>
75
+ <div align="center"><img src="assets/image-10.png" width="880" height="220"></img></div>
76
+ <div align="center"><img src="assets/image-11.png" width="880" height="210"></img></div>
77
+
78
+ For more details, including reproduction of evaluation results, Instruction content and adding method, please refer to our <a href="https://github.com/Kingsoft-LLM/QZhou-Embedding">GitHub</a> repo, thanks!
79
+
80
+ ## Evaluation Results
81
+ ### mteb details
82
+ <div align="center"><img src="assets/image-7.png" width="1100" height="260"></img></div>
83
+
84
+ ### cmteb details
85
+ <div align="center"><img src="assets/image-8.png" width="1000" height="260"></img></div>
86
+
87
+ ## Usage
88
+ ### Completely reproduce the benchmark results
89
+ We provide detailed parameters and environment configurations so that you can run results that are completely consistent with the mteb leaderboard on your own machine, including configurations such as environment dependencies and model arguments.
90
+ #### Requirements
91
+ - Python: 3.10.12
92
+ - Sentence Transformers: 3.4.1
93
+ - Transformers: 4.51.1
94
+ - PyTorch: 2.7.1
95
+ - Accelerate: 1.3.0
96
+ - Datasets: 3.2.0
97
+ - Tokenizers: 0.21.2
98
+ #### Transformers model load arguments
99
+ torch_dtype=torch.bfloat16<br>
100
+ attn_implementation='sdpa'<br>
101
+ **NOTE:** The ranking results use the sdpa mode. Other modes ('eager', 'flash_attention_2') may have deviations in results, but still keep the overall performance consistent.
102
+ #### Instruction Adding Rules
103
+ Details can be found on our <a href="https://github.com/Kingsoft-LLM/QZhou-Embedding">GitHub</a>.
104
+ #### Evaluation code usage
105
+ Find our benchmark evaluation code on <a href="https://github.com/Kingsoft-LLM/QZhou-Embedding">GitHub</a>. The mteb benchmark script is **run_mteb_all_v2.py**, and the cmteb benchmark script is **run_cmteb_all.py**. Run the following command:
106
+ ```
107
+ POOLING_MODE=mean
108
+ normalize=true
109
+ use_instruction=true
110
+ export TOKENIZERS_PARALLELISM=true
111
+
112
+ model_name_or_path=<model dir>
113
+
114
+ python3 ./run_cmteb_all.py \
115
+ --model_name_or_path ${model_name_or_path} \
116
+ --pooling_mode ${POOLING_MODE} \
117
+ --normalize ${normalize} \
118
+ --use_instruction ${use_instruction} \
119
+ --output_dir <output dir>
120
+
121
+ python3 ./run_mteb_all_v2.py \
122
+ --model_name_or_path ${model_name_or_path} \
123
+ --pooling_mode ${POOLING_MODE} \
124
+ --normalize ${normalize} \
125
+ --use_instruction ${use_instruction} \
126
+ --output_dir <output dir>
127
+ ```
128
+ The "<>" should be replaced with your actual setting.<br>
129
+ This is a general script that can be used to evaluate other huggingface embedding models, but you need to ensure that the pooling and other configurations are correct.
130
+
131
+ ### Sentence-transformers
132
+
133
+ ```
134
+ from sentence_transformers import SentenceTransformer
135
+
136
+ model = SentenceTransformer("QZhou-Embedding")
137
+
138
+ model = SentenceTransformer(
139
+ "QZhou-Embedding",
140
+ model_kwargs={"device_map": "auto", "trust_remote_code": True},
141
+ tokenizer_kwargs={"padding_side": "left", "trust_remote_code": True},
142
+ trust_remote_code=True
143
+ )
144
+
145
+ queries = [
146
+ "What is photosynthesis?",
147
+ "Who invented the telephone?",
148
+ ]
149
+ documents = [
150
+ "Photosynthesis is the process by which green plants use sunlight, carbon dioxide, and water to produce glucose and oxygen. This biochemical reaction occurs in chloroplasts.",
151
+ "Alexander Graham Bell is credited with inventing the first practical telephone in 1876, receiving US patent number 174,465 for his device."
152
+ ]
153
+
154
+ query_embeddings = model.encode(queries, prompt_name="query", normalize_embeddings=True)
155
+ document_embeddings = model.encode(documents, normalize_embeddings=True)
156
+
157
+ similarity = model.similarity(query_embeddings, document_embeddings)
158
+ ```
159
+
160
+ ### Huggingface Transformers
161
+
162
+ ```
163
+ import torch
164
+ import torch.nn.functional as F
165
+
166
+ from torch import Tensor
167
+ from transformers import AutoTokenizer, AutoModel
168
+
169
+
170
+ def last_token_pool(last_hidden_states: Tensor,
171
+ attention_mask: Tensor) -> Tensor:
172
+ left_padding = (attention_mask[:, -1].sum() == attention_mask.shape[0])
173
+ if left_padding:
174
+ return last_hidden_states[:, -1]
175
+ else:
176
+ sequence_lengths = attention_mask.sum(dim=1) - 1
177
+ batch_size = last_hidden_states.shape[0]
178
+ return last_hidden_states[torch.arange(batch_size, device=last_hidden_states.device), sequence_lengths]
179
+
180
+
181
+ def get_detailed_instruct(task_description: str, query: str) -> str:
182
+ return f'Instruct: {task_description}\nQuery:{query}'
183
+
184
+ task = 'Given a web search query, retrieve relevant passages that answer the query'
185
+
186
+ queries = [
187
+ get_detailed_instruct(task, 'What is photosynthesis?'),
188
+ get_detailed_instruct(task, 'Who invented the telephone?')
189
+ ]
190
+
191
+ documents = [
192
+ "Photosynthesis is the process by which green plants use sunlight, carbon dioxide, and water to produce glucose and oxygen. This biochemical reaction occurs in chloroplasts.",
193
+ "Alexander Graham Bell is credited with inventing the first practical telephone in 1876, receiving US patent number 174,465 for his device."
194
+ ]
195
+
196
+ input_texts = queries + documents
197
+
198
+ tokenizer = AutoTokenizer.from_pretrained('QZhou-Embedding', padding_side='left', trust_remote_code=True)
199
+ model = AutoModel.from_pretrained('QZhou-Embedding', trust_remote_code=True, device_map='auto')
200
+
201
+ batch_dict = tokenizer(
202
+ input_texts,
203
+ padding=True,
204
+ truncation=True,
205
+ max_length=8192,
206
+ return_tensors="pt",
207
+ )
208
+ batch_dict.to(model.device)
209
+ outputs = model(**batch_dict)
210
+ embeddings = last_token_pool(outputs.last_hidden_state, batch_dict['attention_mask'])
211
+
212
+ embeddings = F.normalize(embeddings, p=2, dim=1)
213
+ scores = (embeddings[:2] @ embeddings[2:].T)
214
+ ```
README_zh.md ADDED
@@ -0,0 +1,215 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ license: apache-2.0
3
+ tags:
4
+ - sentence-transformers
5
+ - sentence-similarity
6
+ - mteb
7
+ - retriever
8
+ - text-embeddings-inference
9
+ ---
10
+ # QZhou-Embedding
11
+ <div align="center">
12
+ <img src="assets/image-1.png" width="800" height="300"></img>
13
+ </div>
14
+
15
+ ## 简介
16
+ 我们发布<a href="https://huggingface.co/Kingsoft-LLM/QZhou-Embedding">QZhou-Embedding</a>(轻舟Embedding😈😈😈),面向通用领域的文本向量表示大模型,擅长各种文本嵌入(检索、重排、句对相似度、分类)任务。得益于基础模型在海量文本上预训练获得的通用语言能力,QZhou-Embedding能够获得更加强大的文本嵌入表示。QZhou-Embedding使用百万量级高质量开源检索数据,以及500万+高质量合成数据(改写、扩展两大合成技术)进行持续训练。我们通过第一阶段检索训练为模型提供query-doc语义匹配能力基础,第二阶段的STS、聚类等多维度能力训练帮助模型在各种场景下持续突破。QZhou-Embedding的模型参数为7B,具备最大8k的长文本向量嵌入能力。在mteb/cmteb评测基准上取得均值全榜最高,各任务指标方面,聚类、句对分类、重排、STS任务指标均值全榜最高的效果。
17
+
18
+ ## QZhou-Embedding基本特点
19
+
20
+ - 强大的文本嵌入能力;
21
+ - 长上下文:最大支持8k;
22
+ - 参数量7B
23
+
24
+
25
+ ## 技术介绍
26
+ ### 统一任务建模框架
27
+ 将文本嵌入目标统一为三大问题建模优化,提出统一的训练数据结构化方案和对应的训练机制---可融入大部分开源数据作为检索训练集,可结构化数据如下:
28
+ - 检索
29
+ - title-body
30
+ - title-abstract
31
+ - 问答类数据
32
+ - 阅读理解
33
+ - ...
34
+
35
+ - STS
36
+ - 文本对+{true, false}、{yes, no}标签
37
+ - 文本对+分数(如0.2、3.1、4.8等)
38
+ - NLI数据:文本对+{'entailment', 'neutral', 'contradiction'}标签
39
+
40
+ - CLS
41
+ - 句子+类标签
42
+
43
+ <div align="center"><img src="assets/image-18.png" width="1000" height="600"></img></div>
44
+ <div align="center"><img src="assets/image-16.png" width="1000" height="550"></img></div>
45
+
46
+ ### 训练目标
47
+
48
+ - 检索:使用InfoNCE对比学习loss函数,效仿gte/qwen3-embedding的改进增加q-q对负样例惩罚<br>
49
+ $$
50
+ L_{ret}=-\frac{1}{n}\sum_{i} log{\frac{e^{sim(q_i,d_i^+)/\tau}}{e^{sim(q_i,d_i^+)/\tau}+\sum_{j}e^{sim(q_i,d_j^-)/\tau}+\sum_{j≠i}e^{sim(q_i,q_j)/\tau}}}
51
+ $$
52
+
53
+ - STS:使用Cosent loss:
54
+ $$
55
+ L_{cosent}=log \bigg(1+\sum_{sim(i,j)>sim(k,l)}exp(\frac{sim(x_k, x_l)-sim(x_i,x_j)}{\tau})\bigg)
56
+ $$
57
+
58
+ - CLS:同检索一致使用InfoNCE loss,但In-Batch Negative时由于同类冲突概率大,使用mask机制掩盖不同样本共享的负样例中的同类样本。
59
+ $$
60
+ L_{ret}=-\frac{1}{n}\sum_{i} log{\frac{e^{sim(t_i,t_i^+)/\tau}}{e^{sim(t_i,t_i^+)/\tau}+\sum_{n}MASK(t_i,t_{i,n}^-)·e^{sim(t_i,t_{i,n}^-)/\tau}+\sum_{j≠i}MASK(t_i,t_j)·e^{sim(t_i,t_j)/\tau}+\sum_{j≠i}\sum_{n}MASK(t_i,t_{j,n}^-)e^{sim(t_i,t_{j,n}^-)/\tau}}}
61
+ $$
62
+ $$
63
+ 其中C_{t_i}=C_{t_i^+}
64
+ $$
65
+ $$
66
+ MASK(t_i, t_j)=
67
+ \begin{cases}
68
+ 0 & \quad \text{if } C_{t_i}=C_{t_j}, \\
69
+ 1 & \quad \text{otherwise}
70
+ \end{cases}
71
+ $$
72
+ 其中${C_{t_i}}$表示样本${t_i}$的类标签,n是单条数据的负样本数。
73
+
74
+ ### 特征增强数据合成技术
75
+ 在当今大模型语言及创作能力强大的背景下,我们充分利用了大模型API设计数据合成技术。针对训练集中存在数据少、话题狭隘等问题,我们提出改写、扩展合成技术;同时为增强训练时的负样例难度,我们在现有基于强大Embedding实现难负例采样的基础上,使用基于大模型的难负样例合成技术。几种技术介绍如下:
76
+ <div align="center"><img src="assets/image-9.png" width="930" height="290"></img></div>
77
+ <div align="center"><img src="assets/image-10.png" width="880" height="220"></img></div>
78
+ <div align="center"><img src="assets/image-11.png" width="880" height="210"></img></div>
79
+
80
+ 想要获取更多信息(如评测脚本、指令格式等),欢迎访问我们的Github:<a href="https://github.com/Kingsoft-LLM/QZhou-Embedding">GitHub</a>
81
+
82
+ ## 评测结果
83
+ ### mteb榜单明细
84
+ <div align="center"><img src="assets/image-7.png" width="1100" height="260"></img></div>
85
+
86
+ ### cmteb榜单明细
87
+ <div align="center"><img src="assets/image-8.png" width="1000" height="260"></img></div>
88
+
89
+ ## 使用指南
90
+ ### 完全复现榜单结果
91
+ 我们提供详细的参数、环境配置,以便能够在自己的机器上完全跑出跟榜单一致的结果,包括环境依赖、模型参数等配置。
92
+ #### 环境依赖版本
93
+ - Python: 3.10.12
94
+ - Sentence Transformers: 3.4.1
95
+ - Transformers: 4.51.1
96
+ - PyTorch: 2.7.1
97
+ - Accelerate: 1.3.0
98
+ - Datasets: 3.2.0
99
+ - Tokenizers: 0.21.2
100
+ #### 模型加载参数
101
+ torch_dtype=torch.bfloat16<br>
102
+ attn_implementation='sdpa'<br>
103
+ **注:** 榜单结果使用了sdpa模式,其他模式('eager'、 'flash_attention_2')存在偏差,但不影响整体表现
104
+ #### 指令添加规则
105
+ 在我们的<a href="https://github.com/Kingsoft-LLM/QZhou-Embedding">GitHub</a>上可以找到。
106
+ #### 评测代码使用
107
+ 在<a href="https://github.com/Kingsoft-LLM/QZhou-Embedding">GitHub</a>上找到我们的评测代码,其中mteb评测脚本是**run_mteb_all_v2.py**,cmteb评测脚本是**run_cmteb_all.py**,运行如下命令:
108
+ ```
109
+ POOLING_MODE=mean
110
+ normalize=true
111
+ use_instruction=true
112
+ export TOKENIZERS_PARALLELISM=true
113
+
114
+ model_name_or_path=模型目录位置
115
+
116
+ python3 ./run_cmteb_all.py \
117
+ --model_name_or_path ${model_name_or_path} \
118
+ --pooling_mode ${POOLING_MODE} \
119
+ --normalize ${normalize} \
120
+ --use_instruction ${use_instruction} \
121
+ --output_dir 结果输出路径
122
+
123
+ python3 ./run_mteb_all_v2.py \
124
+ --model_name_or_path ${model_name_or_path} \
125
+ --pooling_mode ${POOLING_MODE} \
126
+ --normalize ${normalize} \
127
+ --use_instruction ${use_instruction} \
128
+ --output_dir 结果输出路径
129
+ ```
130
+ 这是一套通用脚本,可以用于其他huggingface embedding模型的评测,但需要确保pooling等配置正确。
131
+
132
+ ### Sentence Transformers
133
+
134
+ ```
135
+ from sentence_transformers import SentenceTransformer
136
+
137
+ model = SentenceTransformer("QZhou-Embedding")
138
+
139
+ model = SentenceTransformer(
140
+ "QZhou-Embedding",
141
+ model_kwargs={"device_map": "auto", "trust_remote_code": True},
142
+ tokenizer_kwargs={"padding_side": "left", "trust_remote_code": True},
143
+ trust_remote_code=True
144
+ )
145
+
146
+ queries = [
147
+ "What is photosynthesis?",
148
+ "Who invented the telephone?",
149
+ ]
150
+ documents = [
151
+ "Photosynthesis is the process by which green plants use sunlight, carbon dioxide, and water to produce glucose and oxygen. This biochemical reaction occurs in chloroplasts.",
152
+ "Alexander Graham Bell is credited with inventing the first practical telephone in 1876, receiving US patent number 174,465 for his device."
153
+ ]
154
+
155
+ query_embeddings = model.encode(queries, prompt_name="query", normalize_embeddings=True)
156
+ document_embeddings = model.encode(documents, normalize_embeddings=True)
157
+
158
+ similarity = model.similarity(query_embeddings, document_embeddings)
159
+ ```
160
+
161
+ ### Huggingface Transformers
162
+
163
+ ```
164
+ import torch
165
+ import torch.nn.functional as F
166
+
167
+ from torch import Tensor
168
+ from transformers import AutoTokenizer, AutoModel
169
+
170
+
171
+ def last_token_pool(last_hidden_states: Tensor,
172
+ attention_mask: Tensor) -> Tensor:
173
+ left_padding = (attention_mask[:, -1].sum() == attention_mask.shape[0])
174
+ if left_padding:
175
+ return last_hidden_states[:, -1]
176
+ else:
177
+ sequence_lengths = attention_mask.sum(dim=1) - 1
178
+ batch_size = last_hidden_states.shape[0]
179
+ return last_hidden_states[torch.arange(batch_size, device=last_hidden_states.device), sequence_lengths]
180
+
181
+
182
+ def get_detailed_instruct(task_description: str, query: str) -> str:
183
+ return f'Instruct: {task_description}\nQuery:{query}'
184
+
185
+ task = 'Given a web search query, retrieve relevant passages that answer the query'
186
+
187
+ queries = [
188
+ get_detailed_instruct(task, 'What is photosynthesis?'),
189
+ get_detailed_instruct(task, 'Who invented the telephone?')
190
+ ]
191
+
192
+ documents = [
193
+ "Photosynthesis is the process by which green plants use sunlight, carbon dioxide, and water to produce glucose and oxygen. This biochemical reaction occurs in chloroplasts.",
194
+ "Alexander Graham Bell is credited with inventing the first practical telephone in 1876, receiving US patent number 174,465 for his device."
195
+ ]
196
+
197
+ input_texts = queries + documents
198
+
199
+ tokenizer = AutoTokenizer.from_pretrained('QZhou-Embedding', padding_side='left', trust_remote_code=True)
200
+ model = AutoModel.from_pretrained('QZhou-Embedding', trust_remote_code=True, device_map='auto')
201
+
202
+ batch_dict = tokenizer(
203
+ input_texts,
204
+ padding=True,
205
+ truncation=True,
206
+ max_length=8192,
207
+ return_tensors="pt",
208
+ )
209
+ batch_dict.to(model.device)
210
+ outputs = model(**batch_dict)
211
+ embeddings = last_token_pool(outputs.last_hidden_state, batch_dict['attention_mask'])
212
+
213
+ embeddings = F.normalize(embeddings, p=2, dim=1)
214
+ scores = (embeddings[:2] @ embeddings[2:].T)
215
+ ```
assets/image-1.png ADDED

Git LFS Details

  • SHA256: 5176a7a58a1d0cf04d6cd81c58129a1417f81f6a4ab8a25481b5d8de2baa5da6
  • Pointer size: 131 Bytes
  • Size of remote file: 704 kB
assets/image-10.png ADDED

Git LFS Details

  • SHA256: 6e73984905fa9f64e512b0bf73f2fafeaeca2314a4e0964aee00960328b78df4
  • Pointer size: 131 Bytes
  • Size of remote file: 167 kB
assets/image-11.png ADDED

Git LFS Details

  • SHA256: 3773c291e1f56cc0fa0f0a74bec06dacd36b993a3c2085b5f9a367847f3ae90f
  • Pointer size: 131 Bytes
  • Size of remote file: 142 kB
assets/image-16.png ADDED

Git LFS Details

  • SHA256: c2c9e9dc7dd496eb41a796b453ee77a6760778c52a70f091afecae4808e05c5c
  • Pointer size: 131 Bytes
  • Size of remote file: 190 kB
assets/image-18.png ADDED

Git LFS Details

  • SHA256: 7dcfdd868f2c5640d91dfde973a413f5ba220e89d59528570836f844317314fc
  • Pointer size: 131 Bytes
  • Size of remote file: 159 kB
assets/image-7.png ADDED
assets/image-8.png ADDED
assets/image-9.png ADDED

Git LFS Details

  • SHA256: 02be8d68c2d949b09958fb3aff46db77141422ca2ca51ac591c480d46df1cdce
  • Pointer size: 131 Bytes
  • Size of remote file: 191 kB
model-00001-of-00003.safetensors ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:dae144628ac262a2e3e585e2a04a08f285efa689f41c1a71432cec49db080263
3
+ size 4877660152
model-00002-of-00003.safetensors ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:890cf237c6407600e7b86ef0a4645614ccebbff609b338e268379e662e18f039
3
+ size 4932750280
model-00003-of-00003.safetensors ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:3184d6e94fe1e62efb858cfb4336a7ccfb6b2e2620333bda0c6ac62123c3726f
3
+ size 4330864528