50 65 197

Nick Doiron

monsoon-nlp

https://mapmeld.com/plant-based-llms/

AI & ML interests

biology and multilingual models

Recent Activity

upvoted a paper 18 days ago

METAGENE-1: Metagenomic Foundation Model for Pandemic Monitoring

updated a model 20 days ago

monsoon-nlp/tomatotomato-gLM2-150M-v0.1

posted an update 20 days ago

Bio LLMs train on many genomes, but can we encode differences within a species? TomatoTomato adds pangenome tokens to represent a domestic tomato and a wild tomato in one sequence 🍅 🧬 https://huggingface.co/monsoon-nlp/tomatotomato-gLM2-150M-v0.1

View all activity

Organizations

upvoted a paper 18 days ago

METAGENE-1: Metagenomic Foundation Model for Pandemic Monitoring

Paper • 2501.02045 • Published Jan 3 • 23

updated a model 20 days ago

monsoon-nlp/tomatotomato-gLM2-150M-v0.1

Fill-Mask • 0.2B • Updated 20 days ago • 22

posted an update 20 days ago

Post

424

published a model 20 days ago

monsoon-nlp/tomatotomato-gLM2-150M-v0.1

Fill-Mask • 0.2B • Updated 20 days ago • 22

liked 3 models 20 days ago

liked a model 22 days ago

Qwen/Qwen3Guard-Gen-8B

Text Generation • 8B • Updated 22 days ago • 5.35k • 52

liked a dataset 22 days ago

futurehouse/BixBench

Viewer • Updated 15 days ago • 205 • 1.14k • 21

upvoted a collection 22 days ago

Qwen3Guard

Collection

7 items • Updated 16 days ago • 51

liked a dataset 29 days ago

CohereLabsCommunity/afri-aya

Viewer • Updated 29 days ago • 2.47k • 987 • 8

reacted to lysandre's post with 🚀 about 1 month ago

Post

6023

We're kick-starting the process of Transformers v5, with @ArthurZ and @cyrilvallez !

v5 should be significant: we're using it as a milestone for performance optimizations, saner defaults, and a much cleaner code base worthy of 2025.

Fun fact: v4.0.0-rc-1 came out on Nov 19, 2020, nearly five years ago!

6 replies

liked a model about 1 month ago

Qwen/Qwen3-Next-80B-A3B-Thinking

Text Generation • 81B • Updated Sep 15 • 1.44M • • 427

liked a dataset about 1 month ago

manufernandezbur/MoBiPlant

Viewer • Updated Jun 10 • 1.64k • 45 • 4

liked a model about 1 month ago

jhu-clsp/mmBERT-base

Fill-Mask • Updated 9 days ago • 63k • • 145

upvoted an article about 1 month ago

Article

mmBERT: ModernBERT goes Multilingual

Sep 9

• 112

reacted to tomaarsen's post with ❤️ about 1 month ago

Post

5420

ModernBERT goes MULTILINGUAL! One of the most requested models I've seen, The Johns Hopkins University's CLSP has trained state-of-the-art massively multilingual encoders using the ModernBERT architecture: mmBERT.

Model details:
- 2 model sizes:
- jhu-clsp/mmBERT-small
- jhu-clsp/mmBERT-base
- Uses the ModernBERT architecture, but with the Gemma2 multilingual tokenizer (so: flash attention, alternating global/local attention, unpadding/sequence packing, etc.)
- Maximum sequence length of 8192 tokens, on the high end for encoders
- Trained on 1833 languages using DCLM, FineWeb2, and many more sources
- 3 training phases: 2.3T tokens pretraining on 60 languages, 600B tokens mid-training on 110 languages, and 100B tokens decay training on all 1833 languages.
- Both models are MIT Licensed, and the full datasets and intermediary checkpoints are also publicly released

Evaluation details:
- Very competitive with ModernBERT at equivalent sizes on English (GLUE, MTEB v2 English after finetuning)
- Consistently outperforms equivalently sized models on all Multilingual tasks (XTREME, classification, MTEB v2 Multilingual after finetuning)
- In short: beats commonly used multilingual base models like mDistilBERT, XLM-R (multilingual RoBERTa), multilingual MiniLM, etc.
- Additionally: the ModernBERT-based mmBERT is much faster than the alternatives due to its architectural benefits. Easily up to 2x throughput in common scenarios.

Check out the full blogpost with more details. It's super dense & gets straight to the point: https://huggingface.co/blog/mmbert

Based on these results, mmBERT should be the new go-to multilingual encoder base models at 300M and below. Do note that the mmBERT models are "base" models, i.e. they're currently only trained to perform Mask Filling. They'll need to be finetuned for downstream tasks like semantic search, classification, clustering, etc.

upvoted a collection 2 months ago

Deep Ignorance

Collection

This collection contains the model and data artifacts from O'Brien et al. (2025). https://deepignorance.ai • 33 items • Updated 13 days ago • 6

reacted to meg's post with 👍 2 months ago

Post

2924

New work from my socially-minded colleagues at Hugging Face, creating some foundations for AI companionship behavior evaluation.
Evaluation Dataset: AI-companionship/INTIMA
Paper: AI-companionship/INTIMA
Work from @giadap , @frimelle , @yjernite .

2 replies

liked a dataset 2 months ago

AI-companionship/INTIMA

Viewer • Updated Aug 29 • 380 • 185 • 21

Nick Doiron

AI & ML interests

Recent Activity

Organizations

monsoon-nlp's activity

mmBERT: ModernBERT goes Multilingual