Update README.md
Browse files
README.md
CHANGED
@@ -5,7 +5,7 @@ license: cc-by-nc-sa-4.0
|
|
5 |
|
6 |
## Latest Update
|
7 |
|
8 |
-
+ July
|
9 |
|
10 |
## 1.Overview
|
11 |
|
@@ -141,7 +141,7 @@ Example:
|
|
141 |
|
142 |
### 2.3 Generation Algorithms for Prompt Injection Attacks
|
143 |
|
144 |
-
In terms of technical approaches to induce AI systems into generating harmful content, researchers have proposed a variety of prompt injection generation algorithms, which continue to evolve rapidly. Notable influential algorithms include: Rewrite Attack[<sup>[Andriushchenko2024]</sup>](#Andriushchenko2024a)、PAIR[<sup>[Chao2025]</sup>](#Chao2025)、GCG[<sup>[Zou2023]</sup>](#Zou2023)、AutoDAN[<sup>[Liu2024]</sup>](#Liu2024)、TAP[<sup>[Mehrotra2024]</sup>](#Mehrotra2024)、Overload Attack[<sup>[Dong2024]</sup>](#Dong2024)、
|
145 |
|
146 |
## 3. Dataset Construction
|
147 |
|
@@ -159,7 +159,7 @@ Currently, the dataset contains over 53,000 samples, including more than 17,000
|
|
159 |
|
160 |
#### 3.1.1 Distribution of Harmful Samples
|
161 |
|
162 |
-
To address security risks related to content compliance, discrimination, and other categories, we comprehensively employed multiple advanced attack sample generation algorithms, including TAP[<sup>[Mehrotra2024]</sup>](#Mehrotra2024)、AutoDAN[<sup>[Liu2024]</sup>](#Liu2024)、
|
163 |
|
164 |
|
165 |
|
@@ -303,7 +303,7 @@ Detailed Evaluation Results Across Individual Datasets:
|
|
303 |
| harmful samples | panda-guard | **0.6592** | 0.4485 | 0.5446 | 0.0162 | 0.4485 |
|
304 |
| benign samples | awesome-chatgpt-prompts | **0.9951** | 0.9852 | 0.9901 | 0.9901 | 0.9704 |
|
305 |
| benign samples | StrongReject-Benign | 0.6455 | 0.6911 | 0.5982 | **0.8721** | 0.7408 |
|
306 |
-
| benign samples | COIG-
|
307 |
|
308 |
The above evaluation indicators are accuracy
|
309 |
|
@@ -412,7 +412,7 @@ We welcome feedback and contributions from the community!
|
|
412 |
<a name="Dong2024"></a>
|
413 |
[[Dong2024](https://arxiv.org/abs/2410.04190)] Dong, Yiting, Guobin Shen, Dongcheng Zhao, Xiang He, and Yi Zeng. "Harnessing Task Overload for Scalable Jailbreak Attacks on Large Language Models." arXiv preprint arXiv:2410.04190 (2024).
|
414 |
|
415 |
-
*
|
416 |
|
417 |
<a name="Jiang2024"></a>
|
418 |
[[Jiang2024](https://aclanthology.org/2024.acl-long.809/)] Jiang, Fengqing, Zhangchen Xu, Luyao Niu, Zhen Xiang, Bhaskar Ramasubramanian, Bo Li, and Radha Poovendran. "Artprompt: Ascii art-based jailbreak attacks against aligned llms." In Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pp. 15157-15173. 2024.
|
@@ -595,7 +595,7 @@ We welcome feedback and contributions from the community!
|
|
595 |
<a name="He2023"></a>
|
596 |
[[He2023](https://arxiv.org/abs/2111.09543)] He, Pengcheng, Jianfeng Gao, and Weizhu Chen. "DeBERTaV3: Improving DeBERTa using ELECTRA-Style Pre-Training with Gradient-Disentangled Embedding Sharing." In The Eleventh International Conference on Learning Representations.
|
597 |
|
598 |
-
*
|
599 |
|
600 |
<a name="Vaswani2017"></a>
|
601 |
[[Vaswani2017](https://proceedings.neurips.cc/paper/2017/hash/3f5ee243547dee91fbd053c1c4a845aa-Abstract.html)] Vaswani, Ashish, et al. "Attention is all you need." Advances in neural information processing systems 30 (2017).
|
|
|
5 |
|
6 |
## Latest Update
|
7 |
|
8 |
+
+ July 27, 2025: Released the Prompt Injection Attack Detection Module of ChangeWay Guardrails.
|
9 |
|
10 |
## 1.Overview
|
11 |
|
|
|
141 |
|
142 |
### 2.3 Generation Algorithms for Prompt Injection Attacks
|
143 |
|
144 |
+
In terms of technical approaches to induce AI systems into generating harmful content, researchers have proposed a variety of prompt injection generation algorithms, which continue to evolve rapidly. Notable influential algorithms include: Rewrite Attack[<sup>[Andriushchenko2024]</sup>](#Andriushchenko2024a)、PAIR[<sup>[Chao2025]</sup>](#Chao2025)、GCG[<sup>[Zou2023]</sup>](#Zou2023)、AutoDAN[<sup>[Liu2024]</sup>](#Liu2024)、TAP[<sup>[Mehrotra2024]</sup>](#Mehrotra2024)、Overload Attack[<sup>[Dong2024]</sup>](#Dong2024)、ArtPrompt[<sup>[Jiang2024]</sup>](#Jiang2024)、DeepInception[<sup>[Li2023]</sup>](#Li2023)、GPT4-Cipher[<sup>[Yuan2025]</sup>](#Yuan2025)、SCAV[<sup>[Xu2024]</sup>](#Xu2024)、RandomSearch[<sup>[Andriushchenko2024]</sup>](#Andriushchenko2024b)、ICA[<sup>[Wei2023]</sup>](#Wei2023)、Cold Attack[<sup>[Guo2024]</sup>](#Guo2024)、GPTFuzzer[<sup>[Yu2023]</sup>](#Yu2023)、ReNeLLM[<sup>[Ding2023]</sup>](#Ding2023), among others.
|
145 |
|
146 |
## 3. Dataset Construction
|
147 |
|
|
|
159 |
|
160 |
#### 3.1.1 Distribution of Harmful Samples
|
161 |
|
162 |
+
To address security risks related to content compliance, discrimination, and other categories, we comprehensively employed multiple advanced attack sample generation algorithms, including TAP[<sup>[Mehrotra2024]</sup>](#Mehrotra2024)、AutoDAN[<sup>[Liu2024]</sup>](#Liu2024)、GPTFuzzer[<sup>[Yu2023]</sup>](#Yu2023)、GCG[<sup>[Zou2023]</sup>](#Zou2023). Through these methods, we systematically constructed high-quality harmful samples spanning different attack categories such as assumption-based, attention diversion, and privilege escalation attacks.
|
163 |
|
164 |
|
165 |
|
|
|
303 |
| harmful samples | panda-guard | **0.6592** | 0.4485 | 0.5446 | 0.0162 | 0.4485 |
|
304 |
| benign samples | awesome-chatgpt-prompts | **0.9951** | 0.9852 | 0.9901 | 0.9901 | 0.9704 |
|
305 |
| benign samples | StrongReject-Benign | 0.6455 | 0.6911 | 0.5982 | **0.8721** | 0.7408 |
|
306 |
+
| benign samples | COIG-CQIA | 0.9967 | 0.9996 | 0.8888 | **1.0** | 0.9999 |
|
307 |
|
308 |
The above evaluation indicators are accuracy
|
309 |
|
|
|
412 |
<a name="Dong2024"></a>
|
413 |
[[Dong2024](https://arxiv.org/abs/2410.04190)] Dong, Yiting, Guobin Shen, Dongcheng Zhao, Xiang He, and Yi Zeng. "Harnessing Task Overload for Scalable Jailbreak Attacks on Large Language Models." arXiv preprint arXiv:2410.04190 (2024).
|
414 |
|
415 |
+
* ArtPrompt
|
416 |
|
417 |
<a name="Jiang2024"></a>
|
418 |
[[Jiang2024](https://aclanthology.org/2024.acl-long.809/)] Jiang, Fengqing, Zhangchen Xu, Luyao Niu, Zhen Xiang, Bhaskar Ramasubramanian, Bo Li, and Radha Poovendran. "Artprompt: Ascii art-based jailbreak attacks against aligned llms." In Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pp. 15157-15173. 2024.
|
|
|
595 |
<a name="He2023"></a>
|
596 |
[[He2023](https://arxiv.org/abs/2111.09543)] He, Pengcheng, Jianfeng Gao, and Weizhu Chen. "DeBERTaV3: Improving DeBERTa using ELECTRA-Style Pre-Training with Gradient-Disentangled Embedding Sharing." In The Eleventh International Conference on Learning Representations.
|
597 |
|
598 |
+
* Transformer
|
599 |
|
600 |
<a name="Vaswani2017"></a>
|
601 |
[[Vaswani2017](https://proceedings.neurips.cc/paper/2017/hash/3f5ee243547dee91fbd053c1c4a845aa-Abstract.html)] Vaswani, Ashish, et al. "Attention is all you need." Advances in neural information processing systems 30 (2017).
|