Update README.md
Browse files
README.md
CHANGED
|
@@ -1,16 +1,20 @@
|
|
| 1 |
---
|
| 2 |
language: he
|
| 3 |
license: mit
|
|
|
|
| 4 |
tags:
|
| 5 |
- hebrew
|
| 6 |
- ner
|
| 7 |
- pii-detection
|
| 8 |
- token-classification
|
| 9 |
- xlm-roberta
|
|
|
|
|
|
|
|
|
|
| 10 |
datasets:
|
| 11 |
-
-
|
| 12 |
model-index:
|
| 13 |
-
- name: GolemPII-
|
| 14 |
results:
|
| 15 |
- task:
|
| 16 |
name: Token Classification
|
|
@@ -27,15 +31,35 @@ model-index:
|
|
| 27 |
value: 0.9982
|
| 28 |
---
|
| 29 |
|
| 30 |
-
# GolemPII-
|
| 31 |
|
| 32 |
-
This model is trained to detect personally identifiable information (PII) in Hebrew text. While based on the multilingual XLM-RoBERTa model, it has been specifically fine-tuned on Hebrew data.
|
| 33 |
|
| 34 |
## Model Details
|
| 35 |
- Based on xlm-roberta-base
|
| 36 |
-
- Fine-tuned on
|
| 37 |
- Optimized for token classification tasks in Hebrew text
|
| 38 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 39 |
## Performance Metrics
|
| 40 |
|
| 41 |
### Final Evaluation Results
|
|
@@ -77,6 +101,10 @@ eval_accuracy: 0.999795
|
|
| 77 |
| 4 | 0.000900 | 0.000896 | 0.997626 | 0.997868| 0.997747 | 0.999750 |
|
| 78 |
| 5 | 0.000600 | 0.000729 | 0.997981 | 0.998191| 0.998086 | 0.999795 |
|
| 79 |
|
|
|
|
|
|
|
|
|
|
|
|
|
| 80 |
## Usage
|
| 81 |
```python
|
| 82 |
import torch
|
|
@@ -104,8 +132,43 @@ for token, label in zip(tokens, labels):
|
|
| 104 |
print(f"Token: {token}, Label: {label}")
|
| 105 |
```
|
| 106 |
|
| 107 |
-
|
| 108 |
-
|
| 109 |
-
|
| 110 |
-
-
|
| 111 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
---
|
| 2 |
language: he
|
| 3 |
license: mit
|
| 4 |
+
library_name: transformers
|
| 5 |
tags:
|
| 6 |
- hebrew
|
| 7 |
- ner
|
| 8 |
- pii-detection
|
| 9 |
- token-classification
|
| 10 |
- xlm-roberta
|
| 11 |
+
- privacy
|
| 12 |
+
- data-anonymization
|
| 13 |
+
- golemguard
|
| 14 |
datasets:
|
| 15 |
+
- CordwainerSmith/GolemGuard
|
| 16 |
model-index:
|
| 17 |
+
- name: GolemPII-v1
|
| 18 |
results:
|
| 19 |
- task:
|
| 20 |
name: Token Classification
|
|
|
|
| 31 |
value: 0.9982
|
| 32 |
---
|
| 33 |
|
| 34 |
+
# GolemPII-v1 - Hebrew PII Detection Model
|
| 35 |
|
| 36 |
+
This model is trained to detect personally identifiable information (PII) in Hebrew text. While based on the multilingual XLM-RoBERTa model, it has been specifically fine-tuned on Hebrew data to achieve high accuracy in identifying and classifying various types of PII.
|
| 37 |
|
| 38 |
## Model Details
|
| 39 |
- Based on xlm-roberta-base
|
| 40 |
+
- Fine-tuned on the GolemGuard: Hebrew Privacy Information Detection Corpus
|
| 41 |
- Optimized for token classification tasks in Hebrew text
|
| 42 |
|
| 43 |
+
## Intended Uses & Limitations
|
| 44 |
+
|
| 45 |
+
This model is intended for:
|
| 46 |
+
|
| 47 |
+
* **Privacy Protection:** Detecting and masking PII in Hebrew text to protect individual privacy.
|
| 48 |
+
* **Data Anonymization:** Automating the process of de-identifying Hebrew documents in legal, medical, and other sensitive contexts.
|
| 49 |
+
* **Research:** Supporting research in Hebrew natural language processing and PII detection.
|
| 50 |
+
|
| 51 |
+
## Training Parameters
|
| 52 |
+
|
| 53 |
+
* **Batch Size:** 32
|
| 54 |
+
* **Learning Rate:** 2e-5 with linear warmup and decay.
|
| 55 |
+
* **Optimizer:** AdamW
|
| 56 |
+
* **Hardware:** Trained on a single NVIDIA A100GPU.
|
| 57 |
+
|
| 58 |
+
## Dataset Details
|
| 59 |
+
|
| 60 |
+
* **Dataset Name:** GolemGuard: Hebrew Privacy Information Detection Corpus
|
| 61 |
+
* **Dataset Link:** [https://huggingface.co/datasets/CordwainerSmith/GolemGuard](https://huggingface.co/datasets/CordwainerSmith/GolemGuard)
|
| 62 |
+
|
| 63 |
## Performance Metrics
|
| 64 |
|
| 65 |
### Final Evaluation Results
|
|
|
|
| 101 |
| 4 | 0.000900 | 0.000896 | 0.997626 | 0.997868| 0.997747 | 0.999750 |
|
| 102 |
| 5 | 0.000600 | 0.000729 | 0.997981 | 0.998191| 0.998086 | 0.999795 |
|
| 103 |
|
| 104 |
+
## Model Architecture
|
| 105 |
+
|
| 106 |
+
The model is based on the `FacebookAI/xlm-roberta-base` architecture, a transformer-based language model pre-trained on a massive multilingual dataset. No architectural modifications were made to the base model during fine-tuning.
|
| 107 |
+
|
| 108 |
## Usage
|
| 109 |
```python
|
| 110 |
import torch
|
|
|
|
| 132 |
print(f"Token: {token}, Label: {label}")
|
| 133 |
```
|
| 134 |
|
| 135 |
+
|
| 136 |
+
## License
|
| 137 |
+
|
| 138 |
+
The GolemPII-v1 model is released under MIT License with the following additional terms:
|
| 139 |
+
|
| 140 |
+
```
|
| 141 |
+
MIT License
|
| 142 |
+
|
| 143 |
+
Copyright (c) 2024 Liran Baba
|
| 144 |
+
|
| 145 |
+
Permission is hereby granted, free of charge, to any person obtaining a copy
|
| 146 |
+
of this dataset and associated documentation files (the "Dataset"), to deal
|
| 147 |
+
in the Dataset without restriction, including without limitation the rights
|
| 148 |
+
to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
|
| 149 |
+
copies of the Dataset, and to permit persons to whom the Dataset is
|
| 150 |
+
furnished to do so, subject to the following conditions:
|
| 151 |
+
|
| 152 |
+
1. The above copyright notice and this permission notice shall be included in all
|
| 153 |
+
copies or substantial portions of the Dataset.
|
| 154 |
+
|
| 155 |
+
2. Any academic or professional work that uses this Dataset must include an
|
| 156 |
+
appropriate citation as specified below.
|
| 157 |
+
|
| 158 |
+
THE DATASET IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
|
| 159 |
+
IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
|
| 160 |
+
FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
|
| 161 |
+
AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
|
| 162 |
+
LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
|
| 163 |
+
OUT OF OR IN CONNECTION WITH THE DATASET OR THE USE OR OTHER DEALINGS IN THE
|
| 164 |
+
DATASET.
|
| 165 |
+
```
|
| 166 |
+
|
| 167 |
+
### How to Cite
|
| 168 |
+
|
| 169 |
+
If you use this model in your research, project, or application, please include the following citation:
|
| 170 |
+
|
| 171 |
+
For informal usage (e.g., blog posts, documentation):
|
| 172 |
+
```
|
| 173 |
+
GolemPII-v1 model by Liran Baba (https://huggingface.co/CordwainerSmith/GolemPII-v1)
|
| 174 |
+
```
|