CTCT-CT2
/

changeway_guardrails

Safetensors

deberta-v2

Model card Files Files and versions

xet

Community

IIIllll commited on Jul 26

Commit

cef5671

verified ·

1 Parent(s): af46575

Update README.md

Browse files

Files changed (1) hide show

README.md +6 -6

README.md CHANGED Viewed

@@ -5,7 +5,7 @@ license: cc-by-nc-sa-4.0
 ## Latest Update
-+ July 15, 2025: Released the Prompt Injection Attack Detection Module of ChangeWay Guardrails.
 ## 1.Overview
@@ -141,7 +141,7 @@ Example:
 ### 2.3 Generation Algorithms for Prompt Injection Attacks
-In terms of technical approaches to induce AI systems into generating harmful content, researchers have proposed a variety of prompt injection generation algorithms, which continue to evolve rapidly. Notable influential algorithms include: Rewrite Attack[<sup>[Andriushchenko2024]</sup>](#Andriushchenko2024a)、PAIR[<sup>[Chao2025]</sup>](#Chao2025)、GCG[<sup>[Zou2023]</sup>](#Zou2023)、AutoDAN[<sup>[Liu2024]</sup>](#Liu2024)、TAP[<sup>[Mehrotra2024]</sup>](#Mehrotra2024)、Overload Attack[<sup>[Dong2024]</sup>](#Dong2024)、ArtPropmt[<sup>[Jiang2024]</sup>](#Jiang2024)、DeepInception[<sup>[Li2023]</sup>](#Li2023)、GPT4-Cipher[<sup>[Yuan2025]</sup>](#Yuan2025)、SCAV[<sup>[Xu2024]</sup>](#Xu2024)、RandomSearch[<sup>[Andriushchenko2024]</sup>](#Andriushchenko2024b)、ICA[<sup>[Wei2023]</sup>](#Wei2023)、Cold Attack[<sup>[Guo2024]</sup>](#Guo2024)、GPTFuzzer[<sup>[Yu2023]</sup>](#Yu2023)、ReNeLLM[<sup>[Ding2023]</sup>](#Ding2023), among others.
 ## 3. Dataset Construction
@@ -159,7 +159,7 @@ Currently, the dataset contains over 53,000 samples, including more than 17,000
 #### 3.1.1 Distribution of Harmful Samples
-To address security risks related to content compliance, discrimination, and other categories, we comprehensively employed multiple advanced attack sample generation algorithms, including TAP[<sup>[Mehrotra2024]</sup>](#Mehrotra2024)、AutoDAN[<sup>[Liu2024]</sup>](#Liu2024)、GPTFuzz[<sup>[Yu2023]</sup>](#Yu2023)、GCG[<sup>[Zou2023]</sup>](#Zou2023). Through these methods, we systematically constructed high-quality harmful samples spanning different attack categories such as assumption-based, attention diversion, and privilege escalation attacks.
@@ -303,7 +303,7 @@ Detailed Evaluation Results Across Individual Datasets:
 | harmful samples | panda-guard             | **0.6592**            | 0.4485              | 0.5446                             | 0.0162                            | 0.4485               |
 | benign samples | awesome-chatgpt-prompts | **0.9951**            | 0.9852              | 0.9901                             | 0.9901                            | 0.9704               |
 | benign samples | StrongReject-Benign     | 0.6455                | 0.6911              | 0.5982                             | **0.8721**                        | 0.7408               |
-| benign samples | COIG-COIA               | 0.9967                | 0.9996              | 0.8888                             | **1.0**                           | 0.9999               |
 The above evaluation indicators are accuracy
@@ -412,7 +412,7 @@ We welcome feedback and contributions from the community!
 <a name="Dong2024"></a>
 [[Dong2024](https://arxiv.org/abs/2410.04190)] Dong, Yiting, Guobin Shen, Dongcheng Zhao, Xiang He, and Yi Zeng. "Harnessing Task Overload for Scalable Jailbreak Attacks on Large Language Models." arXiv preprint arXiv:2410.04190 (2024).
-* ArtPropmt
 <a name="Jiang2024"></a>
 [[Jiang2024](https://aclanthology.org/2024.acl-long.809/)] Jiang, Fengqing, Zhangchen Xu, Luyao Niu, Zhen Xiang, Bhaskar Ramasubramanian, Bo Li, and Radha Poovendran. "Artprompt: Ascii art-based jailbreak attacks against aligned llms." In Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pp. 15157-15173. 2024.
@@ -595,7 +595,7 @@ We welcome feedback and contributions from the community!
 <a name="He2023"></a>
 [[He2023](https://arxiv.org/abs/2111.09543)] He, Pengcheng, Jianfeng Gao, and Weizhu Chen. "DeBERTaV3: Improving DeBERTa using ELECTRA-Style Pre-Training with Gradient-Disentangled Embedding Sharing." In The Eleventh International Conference on Learning Representations.
-* Transfomer
 <a name="Vaswani2017"></a>
 [[Vaswani2017](https://proceedings.neurips.cc/paper/2017/hash/3f5ee243547dee91fbd053c1c4a845aa-Abstract.html)] Vaswani, Ashish, et al. "Attention is all you need." Advances in neural information processing systems 30 (2017).

 ## Latest Update
++ July 27, 2025: Released the Prompt Injection Attack Detection Module of ChangeWay Guardrails.
 ## 1.Overview
 ### 2.3 Generation Algorithms for Prompt Injection Attacks
+In terms of technical approaches to induce AI systems into generating harmful content, researchers have proposed a variety of prompt injection generation algorithms, which continue to evolve rapidly. Notable influential algorithms include: Rewrite Attack[<sup>[Andriushchenko2024]</sup>](#Andriushchenko2024a)、PAIR[<sup>[Chao2025]</sup>](#Chao2025)、GCG[<sup>[Zou2023]</sup>](#Zou2023)、AutoDAN[<sup>[Liu2024]</sup>](#Liu2024)、TAP[<sup>[Mehrotra2024]</sup>](#Mehrotra2024)、Overload Attack[<sup>[Dong2024]</sup>](#Dong2024)、ArtPrompt[<sup>[Jiang2024]</sup>](#Jiang2024)、DeepInception[<sup>[Li2023]</sup>](#Li2023)、GPT4-Cipher[<sup>[Yuan2025]</sup>](#Yuan2025)、SCAV[<sup>[Xu2024]</sup>](#Xu2024)、RandomSearch[<sup>[Andriushchenko2024]</sup>](#Andriushchenko2024b)、ICA[<sup>[Wei2023]</sup>](#Wei2023)、Cold Attack[<sup>[Guo2024]</sup>](#Guo2024)、GPTFuzzer[<sup>[Yu2023]</sup>](#Yu2023)、ReNeLLM[<sup>[Ding2023]</sup>](#Ding2023), among others.
 ## 3. Dataset Construction
 #### 3.1.1 Distribution of Harmful Samples
+To address security risks related to content compliance, discrimination, and other categories, we comprehensively employed multiple advanced attack sample generation algorithms, including TAP[<sup>[Mehrotra2024]</sup>](#Mehrotra2024)、AutoDAN[<sup>[Liu2024]</sup>](#Liu2024)、GPTFuzzer[<sup>[Yu2023]</sup>](#Yu2023)、GCG[<sup>[Zou2023]</sup>](#Zou2023). Through these methods, we systematically constructed high-quality harmful samples spanning different attack categories such as assumption-based, attention diversion, and privilege escalation attacks.
 | harmful samples | panda-guard             | **0.6592**            | 0.4485              | 0.5446                             | 0.0162                            | 0.4485               |
 | benign samples | awesome-chatgpt-prompts | **0.9951**            | 0.9852              | 0.9901                             | 0.9901                            | 0.9704               |
 | benign samples | StrongReject-Benign     | 0.6455                | 0.6911              | 0.5982                             | **0.8721**                        | 0.7408               |
+| benign samples | COIG-CQIA               | 0.9967                | 0.9996              | 0.8888                             | **1.0**                           | 0.9999               |
 The above evaluation indicators are accuracy
 <a name="Dong2024"></a>
 [[Dong2024](https://arxiv.org/abs/2410.04190)] Dong, Yiting, Guobin Shen, Dongcheng Zhao, Xiang He, and Yi Zeng. "Harnessing Task Overload for Scalable Jailbreak Attacks on Large Language Models." arXiv preprint arXiv:2410.04190 (2024).
+* ArtPrompt
 <a name="Jiang2024"></a>
 [[Jiang2024](https://aclanthology.org/2024.acl-long.809/)] Jiang, Fengqing, Zhangchen Xu, Luyao Niu, Zhen Xiang, Bhaskar Ramasubramanian, Bo Li, and Radha Poovendran. "Artprompt: Ascii art-based jailbreak attacks against aligned llms." In Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pp. 15157-15173. 2024.
 <a name="He2023"></a>
 [[He2023](https://arxiv.org/abs/2111.09543)] He, Pengcheng, Jianfeng Gao, and Weizhu Chen. "DeBERTaV3: Improving DeBERTa using ELECTRA-Style Pre-Training with Gradient-Disentangled Embedding Sharing." In The Eleventh International Conference on Learning Representations.
+* Transformer
 <a name="Vaswani2017"></a>
 [[Vaswani2017](https://proceedings.neurips.cc/paper/2017/hash/3f5ee243547dee91fbd053c1c4a845aa-Abstract.html)] Vaswani, Ashish, et al. "Attention is all you need." Advances in neural information processing systems 30 (2017).