Riksarkivet
/

bert-base-cased-swe-historical

Model card Files Files and versions

bert-base-cased-swe-historical / README.md

Gabriel's picture

Update README.md

c386494 over 2 years ago

|

history blame contribute delete

3.05 kB

	---
	license: mit
	datasets:
	- Riksarkivet/mini_cleaned_diachronic_swe
	language:
	- sv
	metrics:
	- perplexity
	pipeline_tag: fill-mask
	widget:
	- text: Det vore [MASK] häller nödvändigt att bita af tungan än berättat hvad jag varit med om.

	train-eval-index:
	- config: Riksarkivet/mini_cleaned_diachronic_swe
	task: fill-mask
	task_id: fill-mask
	splits:
	eval_split: test
	col_mapping:
	text: text

	model-index:
	- name: bert-base-cased-swe-historical
	results:
	- task:
	type: fill-mask
	name: fill-mask
	dataset:
	name: Riksarkivet/mini_cleaned_diachronic_swe
	type: Riksarkivet/mini_cleaned_diachronic_swe
	split: test
	metrics:
	- type: perplexity
	value: 3.42
	name: Perplexity (WIP)
	---

	# Historical Swedish Bert Model

	WORK IN PROGRESS (Will be updated with bigger datasets soon + new OCR is coming to extend the dataset even further)

	A historical Swedish Bert model is released from the National Swedish Archives to better generalise to Swedish historical text. Researches are well-aware that the Swedish language has been subject to change over time which means that present-day point-of-view models less ideal candidates for the job.
	However, this model can be used to interpret and analyse historical textual material and be fine-tuned for different downstream tasks.


	## Intended uses & limitations
	This model should primarly be used to fine-tune further on and downstream tasks.

	Inference for fill-mask with Huggingface Transformers in python:

	```python
	from transformers import pipeline

	summarizer = pipeline("fill-mask", model="Riksarkivet/bert-base-cased-swe-historical")
	historical_text = """Det vore [MASK] häller nödvändigt att bita af tungan än berättat hvad jag varit med om."""
	print(summarizer(historical_text))
	```


	## Model Description
	The training procedure can be recreated from here: [Src_code](https://github.com/Borg93/kbuhist2/tree/main).
	The preprocessing procedure can be recreated from here: [Src_code](https://github.com/Borg93/kbuhist2/tree/main).

	Model:
	The following hyperparameters were used during training:
	- learning_rate: 3e-05
	- train_batch_size: 8
	- eval_batch_size: 8
	- seed: 42
	- gradient_accumulation_steps: 0
	- optimizer: Adam with betas=(0.9,0.999) and epsilon=1e-08
	- lr_scheduler_type: linear
	- num_epochs: 6
	- fp16: False

	Dataset (WIP):
	- [Khubist2](https://huggingface.co/datasets/Riksarkivet/mini_cleaned_diachronic_swe), which has been cleaned and chunked. (will be further extended)

	## Acknowledgements
	We gratefully acknowledge [EuroHPC](https://eurohpc-ju.europa.eu) for funding this research by providing computing resources of the HPC system [Vega](https://www.izum.si)
	and [SWE-clarin](https://sweclarin.se/) for the datasets.

	## Citation Information

	Eva Pettersson and Lars Borin (2022)
	Swedish Diachronic Corpus
	In Darja Fišer & Andreas Witt (eds.), CLARIN. The Infrastructure for Language Resources. Berlin: deGruyter. https://degruyter.com/document/doi/10.1515/9783110767377-022/html