|
|
--- |
|
|
license: apache-2.0 |
|
|
datasets: |
|
|
- HuggingFaceFW/fineweb |
|
|
language: |
|
|
- en |
|
|
base_model: |
|
|
- yuchenxie/ArlowGPT-Tokenizer |
|
|
library_name: transformers |
|
|
--- |
|
|
# ArlowGPT Tokenizer |
|
|
|
|
|
This repository contains a custom-trained BPE tokenizer for ArlowGPT, created by Yuchen Xie. |
|
|
|
|
|
## Tokenizer Details |
|
|
|
|
|
- Type: BPE (Byte-Pair Encoding) |
|
|
- Vocabulary Size: 131,072 tokens |
|
|
- Special Tokens: |
|
|
- Start of Text: <|startoftext|> |
|
|
- End of Text: <|endoftext|> |
|
|
- Padding: <|pad|> |
|
|
- Unknown: <|unk|> |
|
|
- Mask: <|mask|> |
|
|
- Message Start: <|im_start|> |
|
|
- Message End: <|im_end|> |
|
|
|
|
|
## Usage |
|
|
|
|
|
```python |
|
|
from transformers import AutoTokenizer |
|
|
|
|
|
tokenizer = AutoTokenizer.from_pretrained("yuchenxie/arlowgpt-tokenizer-v2") |
|
|
``` |
|
|
|
|
|
## Training Details |
|
|
|
|
|
This tokenizer was trained on the 10B GPT-2 randomly shuffled tokens under a custom script composed by Yuchen Xie. This tokenizer is compatible with `HuggingFace Transformer's Auto Tokenizer` class. |