Update README.md
Browse files
README.md
CHANGED
|
@@ -14,14 +14,15 @@ metrics:
|
|
| 14 |
- accuracy
|
| 15 |
mask_token: "[MASK]"
|
| 16 |
widget:
|
| 17 |
-
|
| 18 |
---
|
| 19 |
|
| 20 |
# Model Card for Japanese DeBERTa V2 large
|
| 21 |
|
| 22 |
## Model description
|
| 23 |
|
| 24 |
-
This is a Japanese DeBERTa V2 large model pre-trained on Japanese Wikipedia, the Japanese portion of CC-100, and the
|
|
|
|
| 25 |
|
| 26 |
## How to use
|
| 27 |
|
|
@@ -29,6 +30,7 @@ You can use this model for masked language modeling as follows:
|
|
| 29 |
|
| 30 |
```python
|
| 31 |
from transformers import AutoTokenizer, AutoModelForMaskedLM
|
|
|
|
| 32 |
tokenizer = AutoTokenizer.from_pretrained('ku-nlp/deberta-v2-large-japanese')
|
| 33 |
model = AutoModelForMaskedLM.from_pretrained('ku-nlp/deberta-v2-large-japanese')
|
| 34 |
|
|
@@ -41,7 +43,9 @@ You can also fine-tune this model on downstream tasks.
|
|
| 41 |
|
| 42 |
## Tokenization
|
| 43 |
|
| 44 |
-
The input text should be segmented into words by [Juman++](https://github.com/ku-nlp/jumanpp) in
|
|
|
|
|
|
|
| 45 |
|
| 46 |
## Training data
|
| 47 |
|
|
@@ -52,14 +56,17 @@ We used the following corpora for pre-training:
|
|
| 52 |
- Japanese portion of OSCAR (54GB, 326M sentences, 25M documents)
|
| 53 |
|
| 54 |
Note that we filtered out documents annotated with "header", "footer", or "noisy" tags in OSCAR.
|
| 55 |
-
Also note that Japanese Wikipedia was duplicated 10 times to make the total size of the corpus comparable to that of
|
|
|
|
| 56 |
|
| 57 |
## Training procedure
|
| 58 |
|
| 59 |
We first segmented texts in the corpora into words using [Juman++](https://github.com/ku-nlp/jumanpp).
|
| 60 |
-
Then, we built a sentencepiece model with 32000 tokens including words ([JumanDIC](https://github.com/ku-nlp/JumanDIC))
|
|
|
|
| 61 |
|
| 62 |
-
We tokenized the segmented corpora into subwords using the sentencepiece model and trained the Japanese DeBERTa model
|
|
|
|
| 63 |
The training took 36 days using 8 NVIDIA A100-SXM4-40GB GPUs.
|
| 64 |
|
| 65 |
The following hyperparameters were used during pre-training:
|
|
@@ -82,18 +89,23 @@ The evaluation set consists of 5,000 randomly sampled documents from each of the
|
|
| 82 |
## Fine-tuning on NLU tasks
|
| 83 |
|
| 84 |
We fine-tuned the following models and evaluated them on the dev set of JGLUE.
|
| 85 |
-
We tuned learning rate and training epochs for each model and task
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 86 |
|
| 87 |
-
|
| 88 |
-
|-------------------------------|-------------|---------------|----------|-----------|-----------|------------|
|
| 89 |
-
| Waseda RoBERTa base | 0.965 | 0.876 | 0.905 | 0.853 | 0.916 | 0.853 |
|
| 90 |
-
| Waseda RoBERTa large (seq512) | 0.969 | 0.890 | 0.928 | 0.910 | 0.955 | 0.900 |
|
| 91 |
-
| LUKE Japanese base* | 0.965 | 0.877 | 0.912 | - | - | 0.842 |
|
| 92 |
-
| LUKE Japanese large* | 0.965 | 0.902 | 0.927 | - | - | 0.893 |
|
| 93 |
-
| DeBERTaV2 base | 0.970 | 0.886 | 0.922 | 0.899 | 0.951 | 0.873 |
|
| 94 |
-
| DeBERTaV2 large | 0.968 | 0.892 | 0.924 | 0.912 | 0.959 | 0.890 |
|
| 95 |
|
| 96 |
## Acknowledgments
|
| 97 |
|
| 98 |
-
This work was supported by Joint Usage/Research Center for Interdisciplinary Large-scale Information Infrastructures (
|
|
|
|
|
|
|
| 99 |
For training models, we used the mdx: a platform for the data-driven future.
|
|
|
|
| 14 |
- accuracy
|
| 15 |
mask_token: "[MASK]"
|
| 16 |
widget:
|
| 17 |
+
- text: "京都 大学 で 自然 言語 処理 を [MASK] する 。"
|
| 18 |
---
|
| 19 |
|
| 20 |
# Model Card for Japanese DeBERTa V2 large
|
| 21 |
|
| 22 |
## Model description
|
| 23 |
|
| 24 |
+
This is a Japanese DeBERTa V2 large model pre-trained on Japanese Wikipedia, the Japanese portion of CC-100, and the
|
| 25 |
+
Japanese portion of OSCAR.
|
| 26 |
|
| 27 |
## How to use
|
| 28 |
|
|
|
|
| 30 |
|
| 31 |
```python
|
| 32 |
from transformers import AutoTokenizer, AutoModelForMaskedLM
|
| 33 |
+
|
| 34 |
tokenizer = AutoTokenizer.from_pretrained('ku-nlp/deberta-v2-large-japanese')
|
| 35 |
model = AutoModelForMaskedLM.from_pretrained('ku-nlp/deberta-v2-large-japanese')
|
| 36 |
|
|
|
|
| 43 |
|
| 44 |
## Tokenization
|
| 45 |
|
| 46 |
+
The input text should be segmented into words by [Juman++](https://github.com/ku-nlp/jumanpp) in
|
| 47 |
+
advance. [Juman++ 2.0.0-rc3](https://github.com/ku-nlp/jumanpp/releases/tag/v2.0.0-rc3) was used for pre-training. Each
|
| 48 |
+
word is tokenized into subwords by [sentencepiece](https://github.com/google/sentencepiece).
|
| 49 |
|
| 50 |
## Training data
|
| 51 |
|
|
|
|
| 56 |
- Japanese portion of OSCAR (54GB, 326M sentences, 25M documents)
|
| 57 |
|
| 58 |
Note that we filtered out documents annotated with "header", "footer", or "noisy" tags in OSCAR.
|
| 59 |
+
Also note that Japanese Wikipedia was duplicated 10 times to make the total size of the corpus comparable to that of
|
| 60 |
+
CC-100 and OSCAR. As a result, the total size of the training data is 171GB.
|
| 61 |
|
| 62 |
## Training procedure
|
| 63 |
|
| 64 |
We first segmented texts in the corpora into words using [Juman++](https://github.com/ku-nlp/jumanpp).
|
| 65 |
+
Then, we built a sentencepiece model with 32000 tokens including words ([JumanDIC](https://github.com/ku-nlp/JumanDIC))
|
| 66 |
+
and subwords induced by the unigram language model of [sentencepiece](https://github.com/google/sentencepiece).
|
| 67 |
|
| 68 |
+
We tokenized the segmented corpora into subwords using the sentencepiece model and trained the Japanese DeBERTa model
|
| 69 |
+
using [transformers](https://github.com/huggingface/transformers) library.
|
| 70 |
The training took 36 days using 8 NVIDIA A100-SXM4-40GB GPUs.
|
| 71 |
|
| 72 |
The following hyperparameters were used during pre-training:
|
|
|
|
| 89 |
## Fine-tuning on NLU tasks
|
| 90 |
|
| 91 |
We fine-tuned the following models and evaluated them on the dev set of JGLUE.
|
| 92 |
+
We tuned learning rate and training epochs for each model and task
|
| 93 |
+
following [the JGLUE paper](https://www.jstage.jst.go.jp/article/jnlp/30/1/30_63/_pdf/-char/ja).
|
| 94 |
+
|
| 95 |
+
| Model | MARC-ja/acc | JSTS/pearson | JSTS/spearman | JNLI/acc | JSQuAD/EM | JSQuAD/F1 | JComQA/acc |
|
| 96 |
+
|-------------------------------|-------------|--------------|---------------|----------|-----------|-----------|------------|
|
| 97 |
+
| Waseda RoBERTa base | 0.965 | 0.913 | 0.876 | 0.905 | 0.853 | 0.916 | 0.853 |
|
| 98 |
+
| Waseda RoBERTa large (seq512) | 0.969 | 0.925 | 0.890 | 0.928 | 0.910 | 0.955 | 0.900 |
|
| 99 |
+
| LUKE Japanese base* | 0.965 | 0.916 | 0.877 | 0.912 | - | - | 0.842 |
|
| 100 |
+
| LUKE Japanese large* | 0.965 | 0.932 | 0.902 | 0.927 | - | - | 0.893 |
|
| 101 |
+
| DeBERTaV2 base | 0.970 | 0.922 | 0.886 | 0.922 | 0.899 | 0.951 | 0.873 |
|
| 102 |
+
| DeBERTaV2 large | 0.968 | 0.925 | 0.892 | 0.924 | 0.912 | 0.959 | 0.890 |
|
| 103 |
|
| 104 |
+
*The scores of LUKE are from [the official repository](https://github.com/studio-ousia/luke).
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 105 |
|
| 106 |
## Acknowledgments
|
| 107 |
|
| 108 |
+
This work was supported by Joint Usage/Research Center for Interdisciplinary Large-scale Information Infrastructures (
|
| 109 |
+
JHPCN) through General Collaboration Project no. jh221004, "Developing a Platform for Constructing and Sharing of
|
| 110 |
+
Large-Scale Japanese Language Models".
|
| 111 |
For training models, we used the mdx: a platform for the data-driven future.
|