Migrate model card from transformers-repo
Browse filesRead announcement at https://discuss.huggingface.co/t/announcement-all-model-cards-will-be-migrated-to-hf-co-model-repos/2755
Original file history: https://github.com/huggingface/transformers/commits/master/model_cards/KB/albert-base-swedish-cased-alpha/README.md
README.md
ADDED
|
@@ -0,0 +1,121 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
---
|
| 2 |
+
language: sv
|
| 3 |
+
---
|
| 4 |
+
|
| 5 |
+
# Swedish BERT Models
|
| 6 |
+
|
| 7 |
+
The National Library of Sweden / KBLab releases three pretrained language models based on BERT and ALBERT. The models are trained on approximately 15-20GB of text (200M sentences, 3000M tokens) from various sources (books, news, government publications, swedish wikipedia and internet forums) aiming to provide a representative BERT model for Swedish text. A more complete description will be published later on.
|
| 8 |
+
|
| 9 |
+
The following three models are currently available:
|
| 10 |
+
|
| 11 |
+
- **bert-base-swedish-cased** (*v1*) - A BERT trained with the same hyperparameters as first published by Google.
|
| 12 |
+
- **bert-base-swedish-cased-ner** (*experimental*) - a BERT fine-tuned for NER using SUC 3.0.
|
| 13 |
+
- **albert-base-swedish-cased-alpha** (*alpha*) - A first attempt at an ALBERT for Swedish.
|
| 14 |
+
|
| 15 |
+
All models are cased and trained with whole word masking.
|
| 16 |
+
|
| 17 |
+
## Files
|
| 18 |
+
|
| 19 |
+
| **name** | **files** |
|
| 20 |
+
|---------------------------------|-----------|
|
| 21 |
+
| bert-base-swedish-cased | [config](https://s3.amazonaws.com/models.huggingface.co/bert/KB/bert-base-swedish-cased/config.json), [vocab](https://s3.amazonaws.com/models.huggingface.co/bert/KB/bert-base-swedish-cased/vocab.txt), [pytorch_model.bin](https://s3.amazonaws.com/models.huggingface.co/bert/KB/bert-base-swedish-cased/pytorch_model.bin) |
|
| 22 |
+
| bert-base-swedish-cased-ner | [config](https://s3.amazonaws.com/models.huggingface.co/bert/KB/bert-base-swedish-cased-ner/config.json), [vocab](https://s3.amazonaws.com/models.huggingface.co/bert/KB/bert-base-swedish-cased-ner/vocab.txt) [pytorch_model.bin](https://s3.amazonaws.com/models.huggingface.co/bert/KB/bert-base-swedish-cased-ner/pytorch_model.bin) |
|
| 23 |
+
| albert-base-swedish-cased-alpha | [config](https://s3.amazonaws.com/models.huggingface.co/bert/KB/albert-base-swedish-cased-alpha/config.json), [sentencepiece model](https://s3.amazonaws.com/models.huggingface.co/bert/KB/albert-base-swedish-cased-alpha/spiece.model), [pytorch_model.bin](https://s3.amazonaws.com/models.huggingface.co/bert/KB/albert-base-swedish-cased-alpha/pytorch_model.bin) |
|
| 24 |
+
|
| 25 |
+
TensorFlow model weights will be released soon.
|
| 26 |
+
|
| 27 |
+
## Usage requirements / installation instructions
|
| 28 |
+
|
| 29 |
+
The examples below require Huggingface Transformers 2.4.1 and Pytorch 1.3.1 or greater. For Transformers<2.4.0 the tokenizer must be instantiated manually and the `do_lower_case` flag parameter set to `False` and `keep_accents` to `True` (for ALBERT).
|
| 30 |
+
|
| 31 |
+
To create an environment where the examples can be run, run the following in an terminal on your OS of choice.
|
| 32 |
+
|
| 33 |
+
```
|
| 34 |
+
# git clone https://github.com/Kungbib/swedish-bert-models
|
| 35 |
+
# cd swedish-bert-models
|
| 36 |
+
# python3 -m venv venv
|
| 37 |
+
# source venv/bin/activate
|
| 38 |
+
# pip install --upgrade pip
|
| 39 |
+
# pip install -r requirements.txt
|
| 40 |
+
```
|
| 41 |
+
|
| 42 |
+
### BERT Base Swedish
|
| 43 |
+
|
| 44 |
+
A standard BERT base for Swedish trained on a variety of sources. Vocabulary size is ~50k. Using Huggingface Transformers the model can be loaded in Python as follows:
|
| 45 |
+
|
| 46 |
+
```python
|
| 47 |
+
from transformers import AutoModel,AutoTokenizer
|
| 48 |
+
|
| 49 |
+
tok = AutoTokenizer.from_pretrained('KB/bert-base-swedish-cased')
|
| 50 |
+
model = AutoModel.from_pretrained('KB/bert-base-swedish-cased')
|
| 51 |
+
```
|
| 52 |
+
|
| 53 |
+
|
| 54 |
+
### BERT base fine-tuned for Swedish NER
|
| 55 |
+
|
| 56 |
+
This model is fine-tuned on the SUC 3.0 dataset. Using the Huggingface pipeline the model can be easily instantiated. For Transformer<2.4.1 it seems the tokenizer must be loaded separately to disable lower-casing of input strings:
|
| 57 |
+
|
| 58 |
+
```python
|
| 59 |
+
from transformers import pipeline
|
| 60 |
+
|
| 61 |
+
nlp = pipeline('ner', model='KB/bert-base-swedish-cased-ner', tokenizer='KB/bert-base-swedish-cased-ner')
|
| 62 |
+
|
| 63 |
+
nlp('Idag släpper KB tre språkmodeller.')
|
| 64 |
+
```
|
| 65 |
+
|
| 66 |
+
Running the Python code above should produce in something like the result below. Entity types used are `TME` for time, `PRS` for personal names, `LOC` for locations, `EVN` for events and `ORG` for organisations. These labels are subject to change.
|
| 67 |
+
|
| 68 |
+
```python
|
| 69 |
+
[ { 'word': 'Idag', 'score': 0.9998126029968262, 'entity': 'TME' },
|
| 70 |
+
{ 'word': 'KB', 'score': 0.9814832210540771, 'entity': 'ORG' } ]
|
| 71 |
+
```
|
| 72 |
+
|
| 73 |
+
The BERT tokenizer often splits words into multiple tokens, with the subparts starting with `##`, for example the string `Engelbert kör Volvo till Herrängens fotbollsklubb` gets tokenized as `Engel ##bert kör Volvo till Herr ##ängens fotbolls ##klubb`. To glue parts back together one can use something like this:
|
| 74 |
+
|
| 75 |
+
```python
|
| 76 |
+
text = 'Engelbert tar Volvon till Tele2 Arena för att titta på Djurgården IF ' +\
|
| 77 |
+
'som spelar fotboll i VM klockan två på kvällen.'
|
| 78 |
+
|
| 79 |
+
l = []
|
| 80 |
+
for token in nlp(text):
|
| 81 |
+
if token['word'].startswith('##'):
|
| 82 |
+
l[-1]['word'] += token['word'][2:]
|
| 83 |
+
else:
|
| 84 |
+
l += [ token ]
|
| 85 |
+
|
| 86 |
+
print(l)
|
| 87 |
+
```
|
| 88 |
+
|
| 89 |
+
Which should result in the following (though less cleanly formatted):
|
| 90 |
+
|
| 91 |
+
```python
|
| 92 |
+
[ { 'word': 'Engelbert', 'score': 0.99..., 'entity': 'PRS'},
|
| 93 |
+
{ 'word': 'Volvon', 'score': 0.99..., 'entity': 'OBJ'},
|
| 94 |
+
{ 'word': 'Tele2', 'score': 0.99..., 'entity': 'LOC'},
|
| 95 |
+
{ 'word': 'Arena', 'score': 0.99..., 'entity': 'LOC'},
|
| 96 |
+
{ 'word': 'Djurgården', 'score': 0.99..., 'entity': 'ORG'},
|
| 97 |
+
{ 'word': 'IF', 'score': 0.99..., 'entity': 'ORG'},
|
| 98 |
+
{ 'word': 'VM', 'score': 0.99..., 'entity': 'EVN'},
|
| 99 |
+
{ 'word': 'klockan', 'score': 0.99..., 'entity': 'TME'},
|
| 100 |
+
{ 'word': 'två', 'score': 0.99..., 'entity': 'TME'},
|
| 101 |
+
{ 'word': 'på', 'score': 0.99..., 'entity': 'TME'},
|
| 102 |
+
{ 'word': 'kvällen', 'score': 0.54..., 'entity': 'TME'} ]
|
| 103 |
+
```
|
| 104 |
+
|
| 105 |
+
### ALBERT base
|
| 106 |
+
|
| 107 |
+
The easiest way to do this is, again, using Huggingface Transformers:
|
| 108 |
+
|
| 109 |
+
```python
|
| 110 |
+
from transformers import AutoModel,AutoTokenizer
|
| 111 |
+
|
| 112 |
+
tok = AutoTokenizer.from_pretrained('KB/albert-base-swedish-cased-alpha'),
|
| 113 |
+
model = AutoModel.from_pretrained('KB/albert-base-swedish-cased-alpha')
|
| 114 |
+
```
|
| 115 |
+
|
| 116 |
+
## Acknowledgements ❤️
|
| 117 |
+
|
| 118 |
+
- Resources from Stockholms University, Umeå University and Swedish Language Bank at Gothenburg University were used when fine-tuning BERT for NER.
|
| 119 |
+
- Model pretraining was made partly in-house at the KBLab and partly (for material without active copyright) with the support of Cloud TPUs from Google's TensorFlow Research Cloud (TFRC).
|
| 120 |
+
- Models are hosted on S3 by Huggingface 🤗
|
| 121 |
+
|