is `~` being tokenized to unk?
#134
by sorryhyun - opened
I found this while tokenizing kaguya-sama_wa_kokurasetai_~tensai-tachi_no_renai_zunousen~
Bash(uv run python - <<'EOF'
from transformers import AutoTokenizer
tok = AutoTokenizer.from_pretrained("library/anima/configs/t5_old")
# Test the tilde
for ch in ["~", "βΌ", "γ", "ο½", "β", "β", "β’", "β
"]:
ids = tok(ch, add_special_tokens=False)["input_ids"]
toks = tok.convert_ids_to_tokens(ids)
print(f"{ch!r}: ids={ids} toks={toks}")
# Test on the failing string
s = "kaguya-sama wa kokurasetai ~tensai-tachi no renai zunousen~"
ids = tok(s, add_special_tokens=False)["input_ids"]
print(f"\nfull string: n_tok={len(ids)} <unk> count={ids.count(2)}")
EOF
echo "---"
grep -l '~' post_image_dataset/*.txt | wc -l
echo "txt files containing ~")
βΏ '~': ids=[3, 2] toks=['β', '<unk>']
'βΌ': ids=[3, 2] toks=['β', '<unk>']
'γ': ids=[3, 2] toks=['β', '<unk>']
'ο½': ids=[3, 2] toks=['β', '<unk>']
'β': ids=[3, 104] toks=['β', 'β']
'β': ids=[3, 318] toks=['β', 'β']
'β’': ids=[1697] toks=['ββ’']
'β
': ids=[3, 2] toks=['β', '<unk>']
so I guess anime title involving ~ would be tokenized to unk, not bothering if title sequence was trained well, but I was wondering ~ could be added user_vocab so that title sequences be trained better?
sorryhyun changed discussion title from is `~` tokenized to unk? to is `~` being tokenized to unk?