Transformers
English
yuchenxie's picture
Update README.md
fc0e3e6 verified
---
license: apache-2.0
datasets:
- HuggingFaceFW/fineweb
language:
- en
base_model:
- yuchenxie/ArlowGPT-Tokenizer
library_name: transformers
---
# ArlowGPT Tokenizer
This repository contains a custom-trained BPE tokenizer for ArlowGPT, created by Yuchen Xie.
## Tokenizer Details
- Type: BPE (Byte-Pair Encoding)
- Vocabulary Size: 131,072 tokens
- Special Tokens:
- Start of Text: <|startoftext|>
- End of Text: <|endoftext|>
- Padding: <|pad|>
- Unknown: <|unk|>
- Mask: <|mask|>
- Message Start: <|im_start|>
- Message End: <|im_end|>
## Usage
```python
from transformers import AutoTokenizer
tokenizer = AutoTokenizer.from_pretrained("yuchenxie/arlowgpt-tokenizer-v2")
```
## Training Details
This tokenizer was trained on the 10B GPT-2 randomly shuffled tokens under a custom script composed by Yuchen Xie. This tokenizer is compatible with `HuggingFace Transformer's Auto Tokenizer` class.