--- id: CardioNER.nl_128xtokenWindow name: CardioNER.nl_128xtokenWindow description: >- CardioBERTa.nl_clinical finetuned for multilabel NER task with tokenwindow of 128 license: gpl-3.0 language: nl tags: - lexical semantic - span classification - science - biology - clinical ner - biomedical - ner,medical - bionlp base_model: UMCU/CardioBERTa.nl_clinical pipeline_tag: token-classification datasets: - DT4H/CardioCCC - UMCU/cardioccc_dutch --- # Model Card for Cardioner.nl 128 This a UMCU/CardioBERTa.nl_clinical base model finetuned for span classification. For this model we used IOB-tagging. Using the IOB-tagging schema facilitates the aggregation of predictions over sequences. This specific model is trained on a batch of about 500 span-labeled documents. This is version was trained with context windows of 128 tokens. For the chunking we used a paragraph-based splitter. The training was performed with 10 fold CV, with weight averaging of the best epochs per fold. ### Expected input and output The input should be a string with **Dutch** clinical text related to **cardiology**. CardioNER.nl_128 is a multiclass span classification model. The classes that can be predicted are * **procedure**, * **medication**, * **disease**, * **symptom**. #### Extracting span classification from CardioNER.nl_128xtokenWindow The following script converts a string of <128 tokens to a list of span predictions. ```python from transformers import pipeline le_pipe = pipeline('ner', model=model, tokenizer=model, aggregation_strategy="simple", device=-1) named_ents = le_pipe(SOME_TEXT) ``` To process a string of *arbitrary length* you can split the string into sentences or paragraphs using e.g. pysbd or spacy(sentencizer) and iteratively parse the list of with the span-classification pipe. You can also use the strider built in the transformer pipeline, although this is limited to non-overlapping strides plus it requires a FastTokenizer and it does not work for aggregation_strategy=None; ```python named_ents = le_pipe(SOME_TEXT, stride=256) ``` # Data description CardioCCC; manually labeled cardiology discharge letters; procedure, medication, disease, symptom # Acknowledgement This is part of the [DT4H project](https://www.datatools4heart.eu/). # Doi and reference For more details about training/eval and other scripts, see CardioNER [github repo](https://github.com/DataTools4Heart/CardioNER). and for more information on the background, see Datatools4Heart [Huggingface](https://huggingface.co/DT4H)/[Website](https://www.datatools4heart.eu/)