Hugging Face
Models
Datasets
Spaces
Community
Docs
Enterprise
Pricing
Log In
Sign Up
241.3
TFLOPS
11
3
55
Luca Di Liello
lucadiliello
Follow
xoxo2025's profile picture
belliedmonkey's profile picture
Viewegger's profile picture
7 followers
·
5 following
https://lucadiliello.github.io
lucadiliello
AI & ML interests
Applied Scientist II in Amazon AGI
Recent Activity
reacted
to
Norod78
's
post
with 🔥
7 days ago
Multilingual Tokenization Showdown Analyzing 12 LLM Tokenizers Across 204 Languages. First, I've created a dataset with Wikipedia's "Cat" article text in 272 languages: https://huggingface.co/datasets/Norod78/WikiCat-Multilingual For each language entry with at least 100 words, I tokenized the text using 12 tokenizers and calculated the "Characters per token" ratio and "Word per token" ratio. The higher this ratio is, the more information each token represents on average for that language (and perhaps allowing the llm to potentially learn more per-parameter if trained on a dataset of that language). You can see a slideshow summary of the results here: https://norod.github.io/wikicat-tokenizer-eval/tokenizer-slideshow.html I hope I interpreted the results correctly, I've made the code available on GitHub so you can re-create the raw results jsonl with this repo: https://github.com/Norod/wikicat-tokenizer-eval Post on X: https://x.com/Norod78/status/1984366900550266999
reacted
to
Norod78
's
post
with 👍
7 days ago
Multilingual Tokenization Showdown Analyzing 12 LLM Tokenizers Across 204 Languages. First, I've created a dataset with Wikipedia's "Cat" article text in 272 languages: https://huggingface.co/datasets/Norod78/WikiCat-Multilingual For each language entry with at least 100 words, I tokenized the text using 12 tokenizers and calculated the "Characters per token" ratio and "Word per token" ratio. The higher this ratio is, the more information each token represents on average for that language (and perhaps allowing the llm to potentially learn more per-parameter if trained on a dataset of that language). You can see a slideshow summary of the results here: https://norod.github.io/wikicat-tokenizer-eval/tokenizer-slideshow.html I hope I interpreted the results correctly, I've made the code available on GitHub so you can re-create the raw results jsonl with this repo: https://github.com/Norod/wikicat-tokenizer-eval Post on X: https://x.com/Norod78/status/1984366900550266999
liked
a dataset
about 1 month ago
nvidia/Nemotron-Pretraining-Dataset-sample
View all activity
Organizations
None yet
lucadiliello
's datasets
28
Sort: Recently updated
lucadiliello/STORIES
Viewer
•
Updated
Jul 18, 2023
•
947k
•
576
•
11
lucadiliello/fever
Viewer
•
Updated
Jul 17, 2023
•
185k
•
8
lucadiliello/cc_news
Viewer
•
Updated
Jun 20, 2023
•
150M
•
193
•
2
lucadiliello/hotpotqa
Viewer
•
Updated
Jun 6, 2023
•
78.8k
•
21
•
2
lucadiliello/newsqa
Viewer
•
Updated
Jun 6, 2023
•
78.4k
•
1.15k
•
9
lucadiliello/bioasqqa
Viewer
•
Updated
Jun 6, 2023
•
1.5k
•
14
lucadiliello/duorc.paraphrasercqa
Viewer
•
Updated
Jun 6, 2023
•
1.5k
•
12
lucadiliello/naturalquestionsshortqa
Viewer
•
Updated
Jun 6, 2023
•
117k
•
52
•
3
lucadiliello/dropqa
Viewer
•
Updated
Jun 6, 2023
•
1.5k
•
16
•
3
lucadiliello/searchqa
Viewer
•
Updated
Jun 6, 2023
•
134k
•
166
•
1
lucadiliello/raceqa
Viewer
•
Updated
Jun 6, 2023
•
674
•
15
lucadiliello/relationextractionqa
Viewer
•
Updated
Jun 6, 2023
•
2.95k
•
18
•
4
lucadiliello/textbookqa
Viewer
•
Updated
Jun 6, 2023
•
1.5k
•
40
•
3
lucadiliello/squadqa
Viewer
•
Updated
Jun 6, 2023
•
97.1k
•
12
lucadiliello/triviaqa
Viewer
•
Updated
Jun 6, 2023
•
69.5k
•
249
•
6
lucadiliello/wikiqa_grouped
Viewer
•
Updated
May 30, 2023
•
3.41k
•
8
lucadiliello/squad_as2
Viewer
•
Updated
May 20, 2023
•
496k
•
10
lucadiliello/wikipedia_512_pretraining
Viewer
•
Updated
Mar 24, 2023
•
6.9M
•
126
•
14
lucadiliello/trecqa
Viewer
•
Updated
Dec 5, 2022
•
65k
•
35
lucadiliello/wikiqa
Viewer
•
Updated
Dec 5, 2022
•
32.7k
•
13
lucadiliello/asnq
Viewer
•
Updated
Dec 5, 2022
•
21.3M
•
14
lucadiliello/bookcorpusopen
Viewer
•
Updated
Dec 4, 2022
•
17.9k
•
1.11k
•
8
lucadiliello/english_wikipedia
Viewer
•
Updated
Dec 4, 2022
•
4.18M
•
536
•
7
lucadiliello/news_as2
Viewer
•
Updated
Nov 29, 2022
•
1.94M
•
8
lucadiliello/search_as2
Viewer
•
Updated
Nov 29, 2022
•
3.76M
•
11
lucadiliello/trivia_as2
Viewer
•
Updated
Nov 29, 2022
•
2.08M
•
10
lucadiliello/hotpot_as2
Viewer
•
Updated
Nov 29, 2022
•
539k
•
8
lucadiliello/mnli
Viewer
•
Updated
Nov 10, 2022
•
432k
•
17