IIIllll commited on
Commit
cef5671
·
verified ·
1 Parent(s): af46575

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +6 -6
README.md CHANGED
@@ -5,7 +5,7 @@ license: cc-by-nc-sa-4.0
5
 
6
  ## Latest Update
7
 
8
- + July 15, 2025: Released the Prompt Injection Attack Detection Module of ChangeWay Guardrails.
9
 
10
  ## 1.Overview
11
 
@@ -141,7 +141,7 @@ Example:
141
 
142
  ### 2.3 Generation Algorithms for Prompt Injection Attacks
143
 
144
- In terms of technical approaches to induce AI systems into generating harmful content, researchers have proposed a variety of prompt injection generation algorithms, which continue to evolve rapidly. Notable influential algorithms include: Rewrite Attack[<sup>[Andriushchenko2024]</sup>](#Andriushchenko2024a)、PAIR[<sup>[Chao2025]</sup>](#Chao2025)、GCG[<sup>[Zou2023]</sup>](#Zou2023)、AutoDAN[<sup>[Liu2024]</sup>](#Liu2024)、TAP[<sup>[Mehrotra2024]</sup>](#Mehrotra2024)、Overload Attack[<sup>[Dong2024]</sup>](#Dong2024)、ArtPropmt[<sup>[Jiang2024]</sup>](#Jiang2024)、DeepInception[<sup>[Li2023]</sup>](#Li2023)、GPT4-Cipher[<sup>[Yuan2025]</sup>](#Yuan2025)、SCAV[<sup>[Xu2024]</sup>](#Xu2024)、RandomSearch[<sup>[Andriushchenko2024]</sup>](#Andriushchenko2024b)、ICA[<sup>[Wei2023]</sup>](#Wei2023)、Cold Attack[<sup>[Guo2024]</sup>](#Guo2024)、GPTFuzzer[<sup>[Yu2023]</sup>](#Yu2023)、ReNeLLM[<sup>[Ding2023]</sup>](#Ding2023), among others.
145
 
146
  ## 3. Dataset Construction
147
 
@@ -159,7 +159,7 @@ Currently, the dataset contains over 53,000 samples, including more than 17,000
159
 
160
  #### 3.1.1 Distribution of Harmful Samples
161
 
162
- To address security risks related to content compliance, discrimination, and other categories, we comprehensively employed multiple advanced attack sample generation algorithms, including TAP[<sup>[Mehrotra2024]</sup>](#Mehrotra2024)、AutoDAN[<sup>[Liu2024]</sup>](#Liu2024)、GPTFuzz[<sup>[Yu2023]</sup>](#Yu2023)、GCG[<sup>[Zou2023]</sup>](#Zou2023). Through these methods, we systematically constructed high-quality harmful samples spanning different attack categories such as assumption-based, attention diversion, and privilege escalation attacks.
163
 
164
 
165
 
@@ -303,7 +303,7 @@ Detailed Evaluation Results Across Individual Datasets:
303
  | harmful samples | panda-guard | **0.6592** | 0.4485 | 0.5446 | 0.0162 | 0.4485 |
304
  | benign samples | awesome-chatgpt-prompts | **0.9951** | 0.9852 | 0.9901 | 0.9901 | 0.9704 |
305
  | benign samples | StrongReject-Benign | 0.6455 | 0.6911 | 0.5982 | **0.8721** | 0.7408 |
306
- | benign samples | COIG-COIA | 0.9967 | 0.9996 | 0.8888 | **1.0** | 0.9999 |
307
 
308
  The above evaluation indicators are accuracy
309
 
@@ -412,7 +412,7 @@ We welcome feedback and contributions from the community!
412
  <a name="Dong2024"></a>
413
  [[Dong2024](https://arxiv.org/abs/2410.04190)] Dong, Yiting, Guobin Shen, Dongcheng Zhao, Xiang He, and Yi Zeng. "Harnessing Task Overload for Scalable Jailbreak Attacks on Large Language Models." arXiv preprint arXiv:2410.04190 (2024).
414
 
415
- * ArtPropmt
416
 
417
  <a name="Jiang2024"></a>
418
  [[Jiang2024](https://aclanthology.org/2024.acl-long.809/)] Jiang, Fengqing, Zhangchen Xu, Luyao Niu, Zhen Xiang, Bhaskar Ramasubramanian, Bo Li, and Radha Poovendran. "Artprompt: Ascii art-based jailbreak attacks against aligned llms." In Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pp. 15157-15173. 2024.
@@ -595,7 +595,7 @@ We welcome feedback and contributions from the community!
595
  <a name="He2023"></a>
596
  [[He2023](https://arxiv.org/abs/2111.09543)] He, Pengcheng, Jianfeng Gao, and Weizhu Chen. "DeBERTaV3: Improving DeBERTa using ELECTRA-Style Pre-Training with Gradient-Disentangled Embedding Sharing." In The Eleventh International Conference on Learning Representations.
597
 
598
- * Transfomer
599
 
600
  <a name="Vaswani2017"></a>
601
  [[Vaswani2017](https://proceedings.neurips.cc/paper/2017/hash/3f5ee243547dee91fbd053c1c4a845aa-Abstract.html)] Vaswani, Ashish, et al. "Attention is all you need." Advances in neural information processing systems 30 (2017).
 
5
 
6
  ## Latest Update
7
 
8
+ + July 27, 2025: Released the Prompt Injection Attack Detection Module of ChangeWay Guardrails.
9
 
10
  ## 1.Overview
11
 
 
141
 
142
  ### 2.3 Generation Algorithms for Prompt Injection Attacks
143
 
144
+ In terms of technical approaches to induce AI systems into generating harmful content, researchers have proposed a variety of prompt injection generation algorithms, which continue to evolve rapidly. Notable influential algorithms include: Rewrite Attack[<sup>[Andriushchenko2024]</sup>](#Andriushchenko2024a)、PAIR[<sup>[Chao2025]</sup>](#Chao2025)、GCG[<sup>[Zou2023]</sup>](#Zou2023)、AutoDAN[<sup>[Liu2024]</sup>](#Liu2024)、TAP[<sup>[Mehrotra2024]</sup>](#Mehrotra2024)、Overload Attack[<sup>[Dong2024]</sup>](#Dong2024)、ArtPrompt[<sup>[Jiang2024]</sup>](#Jiang2024)、DeepInception[<sup>[Li2023]</sup>](#Li2023)、GPT4-Cipher[<sup>[Yuan2025]</sup>](#Yuan2025)、SCAV[<sup>[Xu2024]</sup>](#Xu2024)、RandomSearch[<sup>[Andriushchenko2024]</sup>](#Andriushchenko2024b)、ICA[<sup>[Wei2023]</sup>](#Wei2023)、Cold Attack[<sup>[Guo2024]</sup>](#Guo2024)、GPTFuzzer[<sup>[Yu2023]</sup>](#Yu2023)、ReNeLLM[<sup>[Ding2023]</sup>](#Ding2023), among others.
145
 
146
  ## 3. Dataset Construction
147
 
 
159
 
160
  #### 3.1.1 Distribution of Harmful Samples
161
 
162
+ To address security risks related to content compliance, discrimination, and other categories, we comprehensively employed multiple advanced attack sample generation algorithms, including TAP[<sup>[Mehrotra2024]</sup>](#Mehrotra2024)、AutoDAN[<sup>[Liu2024]</sup>](#Liu2024)、GPTFuzzer[<sup>[Yu2023]</sup>](#Yu2023)、GCG[<sup>[Zou2023]</sup>](#Zou2023). Through these methods, we systematically constructed high-quality harmful samples spanning different attack categories such as assumption-based, attention diversion, and privilege escalation attacks.
163
 
164
 
165
 
 
303
  | harmful samples | panda-guard | **0.6592** | 0.4485 | 0.5446 | 0.0162 | 0.4485 |
304
  | benign samples | awesome-chatgpt-prompts | **0.9951** | 0.9852 | 0.9901 | 0.9901 | 0.9704 |
305
  | benign samples | StrongReject-Benign | 0.6455 | 0.6911 | 0.5982 | **0.8721** | 0.7408 |
306
+ | benign samples | COIG-CQIA | 0.9967 | 0.9996 | 0.8888 | **1.0** | 0.9999 |
307
 
308
  The above evaluation indicators are accuracy
309
 
 
412
  <a name="Dong2024"></a>
413
  [[Dong2024](https://arxiv.org/abs/2410.04190)] Dong, Yiting, Guobin Shen, Dongcheng Zhao, Xiang He, and Yi Zeng. "Harnessing Task Overload for Scalable Jailbreak Attacks on Large Language Models." arXiv preprint arXiv:2410.04190 (2024).
414
 
415
+ * ArtPrompt
416
 
417
  <a name="Jiang2024"></a>
418
  [[Jiang2024](https://aclanthology.org/2024.acl-long.809/)] Jiang, Fengqing, Zhangchen Xu, Luyao Niu, Zhen Xiang, Bhaskar Ramasubramanian, Bo Li, and Radha Poovendran. "Artprompt: Ascii art-based jailbreak attacks against aligned llms." In Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pp. 15157-15173. 2024.
 
595
  <a name="He2023"></a>
596
  [[He2023](https://arxiv.org/abs/2111.09543)] He, Pengcheng, Jianfeng Gao, and Weizhu Chen. "DeBERTaV3: Improving DeBERTa using ELECTRA-Style Pre-Training with Gradient-Disentangled Embedding Sharing." In The Eleventh International Conference on Learning Representations.
597
 
598
+ * Transformer
599
 
600
  <a name="Vaswani2017"></a>
601
  [[Vaswani2017](https://proceedings.neurips.cc/paper/2017/hash/3f5ee243547dee91fbd053c1c4a845aa-Abstract.html)] Vaswani, Ashish, et al. "Attention is all you need." Advances in neural information processing systems 30 (2017).