|  | --- | 
					
						
						|  | license: mit | 
					
						
						|  | datasets: | 
					
						
						|  | - Riksarkivet/mini_cleaned_diachronic_swe | 
					
						
						|  | language: | 
					
						
						|  | - sv | 
					
						
						|  | metrics: | 
					
						
						|  | - perplexity | 
					
						
						|  | pipeline_tag: fill-mask | 
					
						
						|  | widget: | 
					
						
						|  | - text: Det vore [MASK] häller nödvändigt att bita af tungan än berättat hvad jag varit med om. | 
					
						
						|  |  | 
					
						
						|  | train-eval-index: | 
					
						
						|  | - config: Riksarkivet/mini_cleaned_diachronic_swe | 
					
						
						|  | task: fill-mask | 
					
						
						|  | task_id: fill-mask | 
					
						
						|  | splits: | 
					
						
						|  | eval_split: test | 
					
						
						|  | col_mapping: | 
					
						
						|  | text: text | 
					
						
						|  |  | 
					
						
						|  | model-index: | 
					
						
						|  | - name: bert-base-cased-swe-historical | 
					
						
						|  | results: | 
					
						
						|  | - task: | 
					
						
						|  | type: fill-mask | 
					
						
						|  | name: fill-mask | 
					
						
						|  | dataset: | 
					
						
						|  | name: Riksarkivet/mini_cleaned_diachronic_swe | 
					
						
						|  | type: Riksarkivet/mini_cleaned_diachronic_swe | 
					
						
						|  | split: test | 
					
						
						|  | metrics: | 
					
						
						|  | - type: perplexity | 
					
						
						|  | value: 3.42 | 
					
						
						|  | name: Perplexity (WIP) | 
					
						
						|  | --- | 
					
						
						|  |  | 
					
						
						|  | # Historical Swedish Bert Model | 
					
						
						|  |  | 
					
						
						|  | ** WORK IN PROGRESS ** (Will be updated with bigger datasets soon + new OCR is coming to extend the dataset even further) | 
					
						
						|  |  | 
					
						
						|  | A historical Swedish Bert model is released from the National Swedish Archives to better generalise to Swedish historical text. Researches are well-aware that the Swedish language has been subject to change over time which means that present-day point-of-view models less ideal candidates for the job. | 
					
						
						|  | However, this model can be used to interpret and analyse historical textual material and be fine-tuned for different downstream tasks. | 
					
						
						|  |  | 
					
						
						|  |  | 
					
						
						|  | ## Intended uses & limitations | 
					
						
						|  | This model should primarly be used to fine-tune further on and downstream tasks. | 
					
						
						|  |  | 
					
						
						|  | Inference for fill-mask with Huggingface Transformers in python: | 
					
						
						|  |  | 
					
						
						|  | ```python | 
					
						
						|  | from transformers import pipeline | 
					
						
						|  |  | 
					
						
						|  | summarizer = pipeline("fill-mask", model="Riksarkivet/bert-base-cased-swe-historical") | 
					
						
						|  | historical_text = """Det vore [MASK] häller nödvändigt att bita af tungan än berättat hvad jag varit med om.""" | 
					
						
						|  | print(summarizer(historical_text)) | 
					
						
						|  | ``` | 
					
						
						|  |  | 
					
						
						|  |  | 
					
						
						|  | ## Model Description | 
					
						
						|  | The training procedure can be recreated from here: [Src_code](https://github.com/Borg93/kbuhist2/tree/main). | 
					
						
						|  | The preprocessing procedure can be recreated from here: [Src_code](https://github.com/Borg93/kbuhist2/tree/main). | 
					
						
						|  |  | 
					
						
						|  | **Model**: | 
					
						
						|  | The following hyperparameters were used during training: | 
					
						
						|  | - learning_rate: 3e-05 | 
					
						
						|  | - train_batch_size: 8 | 
					
						
						|  | - eval_batch_size: 8 | 
					
						
						|  | - seed: 42 | 
					
						
						|  | - gradient_accumulation_steps: 0 | 
					
						
						|  | - optimizer: Adam with betas=(0.9,0.999) and epsilon=1e-08 | 
					
						
						|  | - lr_scheduler_type: linear | 
					
						
						|  | - num_epochs: 6 | 
					
						
						|  | - fp16: False | 
					
						
						|  |  | 
					
						
						|  | **Dataset (WIP)**: | 
					
						
						|  | - [Khubist2](https://huggingface.co/datasets/Riksarkivet/mini_cleaned_diachronic_swe), which has been cleaned and chunked. **(will be further extended)** | 
					
						
						|  |  | 
					
						
						|  | ## Acknowledgements | 
					
						
						|  | We gratefully acknowledge [EuroHPC](https://eurohpc-ju.europa.eu) for funding this research by providing computing resources of the HPC system [Vega](https://www.izum.si) | 
					
						
						|  | and [SWE-clarin](https://sweclarin.se/) for the datasets. | 
					
						
						|  |  | 
					
						
						|  | ## Citation Information | 
					
						
						|  |  | 
					
						
						|  | Eva Pettersson and Lars Borin (2022) | 
					
						
						|  | Swedish Diachronic Corpus | 
					
						
						|  | In Darja Fišer & Andreas Witt (eds.), CLARIN. The Infrastructure for Language Resources. Berlin: deGruyter. https://degruyter.com/document/doi/10.1515/9783110767377-022/html | 
					
						
						|  |  | 
					
						
						|  |  | 
					
						
						|  |  |