tomaarsen
/

span-marker-roberta-large-fewnerd-fine-super

@@ -50,14 +50,14 @@ model-index:
 # SpanMarker with roberta-large on FewNERD
-This is a [SpanMarker](https://github.com/tomaarsen/SpanMarkerNER) model trained on the [FewNERD](https://huggingface.co/datasets/DFKI-SLT/few-nerd) dataset that can be used for Named Entity Recognition. This SpanMarker model uses [roberta-large](https://huggingface.co/models/roberta-large) as the underlying encoder. See [train.py](train.py) for the training script.
 ## Model Details
 ### Model Description
 - **Model Type:** SpanMarker
-- **Encoder:** [roberta-large](https://huggingface.co/models/roberta-large)
 - **Maximum Sequence Length:** 256 tokens
 - **Maximum Entity Length:** 8 words
 - **Training Dataset:** [FewNERD](https://huggingface.co/datasets/DFKI-SLT/few-nerd)
@@ -179,7 +179,7 @@ trainer.save_model("tomaarsen/span-marker-roberta-large-fewnerd-fine-super-finet
 </details>
 ### ⚠️ Tokenizer Warning
-The [roberta-large](https://huggingface.co/models/roberta-large) tokenizer distinguishes between punctuation directly attached to a word and punctuation separated from a word by a space. For example, `Paris.` and `Paris .` are tokenized into different tokens. During training, this model is only exposed to the latter style, i.e. all words are separated by a space. Consequently, the model may perform worse when the inference text is in the former style.
 In short, it is recommended to preprocess your inference text such that all words and punctuation are separated by a space. Some potential approaches to convert regular text into this format are NLTK [`word_tokenize`](https://www.nltk.org/api/nltk.tokenize.word_tokenize.html) or spaCy [`Doc`](https://spacy.io/api/doc#iter) and join the resulting words with a space.

 # SpanMarker with roberta-large on FewNERD
+This is a [SpanMarker](https://github.com/tomaarsen/SpanMarkerNER) model trained on the [FewNERD](https://huggingface.co/datasets/DFKI-SLT/few-nerd) dataset that can be used for Named Entity Recognition. This SpanMarker model uses [roberta-large](https://huggingface.co/roberta-large) as the underlying encoder. See [train.py](train.py) for the training script.
 ## Model Details
 ### Model Description
 - **Model Type:** SpanMarker
+- **Encoder:** [roberta-large](https://huggingface.co/roberta-large)
 - **Maximum Sequence Length:** 256 tokens
 - **Maximum Entity Length:** 8 words
 - **Training Dataset:** [FewNERD](https://huggingface.co/datasets/DFKI-SLT/few-nerd)
 </details>
 ### ⚠️ Tokenizer Warning
+The [roberta-large](https://huggingface.co/roberta-large) tokenizer distinguishes between punctuation directly attached to a word and punctuation separated from a word by a space. For example, `Paris.` and `Paris .` are tokenized into different tokens. During training, this model is only exposed to the latter style, i.e. all words are separated by a space. Consequently, the model may perform worse when the inference text is in the former style.
 In short, it is recommended to preprocess your inference text such that all words and punctuation are separated by a space. Some potential approaches to convert regular text into this format are NLTK [`word_tokenize`](https://www.nltk.org/api/nltk.tokenize.word_tokenize.html) or spaCy [`Doc`](https://spacy.io/api/doc#iter) and join the resulting words with a space.