Idiap
/

gated-deltanet-swa-0.4B-10B

Model card Files Files and versions

gated-deltanet-swa-0.4B-10B / README.md

Idiap-Data's picture

Update README.md

766fa0b verified 16 days ago

|

history blame contribute delete

1.78 kB

	---
	license: mit
	---

	# gated-deltanet-swa-0.4B-10B

	Gated DeltaNet + sliding-window attention (0.4B params, 10B tokens)

	## Overview

	* Training: gated-deltanet-swa-0.4B-10B was trained on [FineWeb-Edu](https://huggingface.co/datasets/HuggingFaceFW/fineweb-edu), which is realeased under [ODC-By v1.0](https://opendatacommons.org/licenses/by/1-0/)
	* Parameters: 0.4B
	* Task: Language modeling
	* Framework: HuggingFace, [flash-linear-attention](https://github.com/fla-org/flash-linear-attention)
	* Output structure: [batch_size, sequence_length, num_logits]

	## Performance

	Various; available in paper

	## Running Code

	* Minimal code to instantiate the model and perform inference:
	```python
	# Requires flash-linear-attention (https://github.com/fla-org/flash-linear-attention)
	import fla
	from transformers import AutoModelForCausalLM, AutoTokenizer
	model = AutoModelForCausalLM.from_pretrained(path_to_model).cuda()
	tokenizer = AutoTokenizer.from_pretrained(path_to_model).cuda()
	input_ids = tokenizer("All human beings are", return_tensors="pt").input_ids
	outputs = model.generate(input_ids, max_length=15)
	print(tokenizer.decode(outputs[0], skip_special_tokens=True))
	```

	## License

	Gated DeltaNet is released under [MIT License](LICENSE.txt)

	## Citation

	If you find our work useful, please cite the following publication:

	```bibtex
	@misc{he_alleviating_2025,
	title = {Alleviating {Forgetfulness} of {Linear} {Attention} by {Hybrid} {Sparse} {Attention} and {Contextualized} {Learnable} {Token} {Eviction}},
	url = {http://arxiv.org/abs/2510.20787},
	doi = {10.48550/arXiv.2510.20787},
	publisher = {arXiv},
	author = {He, Mutian and Garner, Philip N.},
	month = oct,
	year = {2025},
	note = {arXiv:2510.20787 [cs]},
	}
	```