is `~` being tokenized to unk?

#134

by sorryhyun - opened 8 days ago

I found this while tokenizing kaguya-sama_wa_kokurasetai_~tensai-tachi_no_renai_zunousen~

Bash(uv run python - <<'EOF'
      from transformers import AutoTokenizer
      tok = AutoTokenizer.from_pretrained("library/anima/configs/t5_old")
      # Test the tilde
      for ch in ["~", "∼", "〜", "～", "–", "—", "•", "★"]:
          ids = tok(ch, add_special_tokens=False)["input_ids"]
          toks = tok.convert_ids_to_tokens(ids)
          print(f"{ch!r}: ids={ids} toks={toks}")
      # Test on the failing string
      s = "kaguya-sama wa kokurasetai ~tensai-tachi no renai zunousen~"
      ids = tok(s, add_special_tokens=False)["input_ids"]
      print(f"\nfull string: n_tok={len(ids)} <unk> count={ids.count(2)}")
      EOF
      echo "---"
      grep -l '~' post_image_dataset/*.txt | wc -l
      echo "txt files containing ~")
  ⎿  '~': ids=[3, 2] toks=['▁', '<unk>']
     '∼': ids=[3, 2] toks=['▁', '<unk>']
     '〜': ids=[3, 2] toks=['▁', '<unk>']
     '～': ids=[3, 2] toks=['▁', '<unk>']
     '–': ids=[3, 104] toks=['▁', '–']
     '—': ids=[3, 318] toks=['▁', '—']
     '•': ids=[1697] toks=['▁•']
     '★': ids=[3, 2] toks=['▁', '<unk>']

so I guess anime title involving ~ would be tokenized to unk, not bothering if title sequence was trained well, but I was wondering ~ could be added user_vocab so that title sequences be trained better?

sorryhyun changed discussion title from is `~` tokenized to unk? to is `~` being tokenized to unk? 8 days ago

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment