Diffusion Single File
comfyui

is `~` being tokenized to unk?

#134
by sorryhyun - opened

I found this while tokenizing kaguya-sama_wa_kokurasetai_~tensai-tachi_no_renai_zunousen~

Bash(uv run python - <<'EOF'
      from transformers import AutoTokenizer
      tok = AutoTokenizer.from_pretrained("library/anima/configs/t5_old")
      # Test the tilde
      for ch in ["~", "∼", "γ€œ", "~", "–", "β€”", "β€’", "β˜…"]:
          ids = tok(ch, add_special_tokens=False)["input_ids"]
          toks = tok.convert_ids_to_tokens(ids)
          print(f"{ch!r}: ids={ids} toks={toks}")
      # Test on the failing string
      s = "kaguya-sama wa kokurasetai ~tensai-tachi no renai zunousen~"
      ids = tok(s, add_special_tokens=False)["input_ids"]
      print(f"\nfull string: n_tok={len(ids)} <unk> count={ids.count(2)}")
      EOF
      echo "---"
      grep -l '~' post_image_dataset/*.txt | wc -l
      echo "txt files containing ~")
  ⎿  '~': ids=[3, 2] toks=['▁', '<unk>']
     '∼': ids=[3, 2] toks=['▁', '<unk>']
     'γ€œ': ids=[3, 2] toks=['▁', '<unk>']
     '~': ids=[3, 2] toks=['▁', '<unk>']
     '–': ids=[3, 104] toks=['▁', '–']
     'β€”': ids=[3, 318] toks=['▁', 'β€”']
     'β€’': ids=[1697] toks=['▁‒']
     'β˜…': ids=[3, 2] toks=['▁', '<unk>']

so I guess anime title involving ~ would be tokenized to unk, not bothering if title sequence was trained well, but I was wondering ~ could be added user_vocab so that title sequences be trained better?

sorryhyun changed discussion title from is `~` tokenized to unk? to is `~` being tokenized to unk?

Sign up or log in to comment