Carlo Moro

cnmoro

AI & ML interests

None yet

Recent Activity

liked a model about 10 hours ago

LH-Tech-AI/TinyMozart_v2_85M

liked a model 3 days ago

mlx-community/Qwen3.5-4B-OptiQ-4bit

liked a model 4 days ago

LiquidAI/LFM2.5-1.2B-Instruct

View all activity

Organizations

liked a model about 10 hours ago

LH-Tech-AI/TinyMozart_v2_85M

Text-to-Audio • Updated 3 days ago • 14

liked a model 3 days ago

mlx-community/Qwen3.5-4B-OptiQ-4bit

Text Generation • 0.8B • Updated 11 days ago • 4.94k • 15

liked a model 4 days ago

LiquidAI/LFM2.5-1.2B-Instruct

Text Generation • 1B • Updated Mar 30 • 409k • 581

liked 2 models 9 days ago

YTan2000/Qwen3.6-35B-A3B-TQ3_4S

Image-Text-to-Text • 35B • Updated 18 days ago • 9.76k • 74

minishlab/potion-code-16M

Updated 10 days ago • 1.91k • 10

reacted to SeaWolf-AI's post with 🔥 11 days ago

Post

8703

🧬 Introducing Darwin-9B-NEG — the first model with Native Entropy Gating (NEG)

🔗 Try it now: FINAL-Bench/Darwin-9B-NEG
🔗 Q4 bit : FINAL-Bench/Darwin-9B-MFP4

We're thrilled to release Darwin-9B-NEG, a 9B-parameter reasoning model
that embeds an architecturally-internalised sense of self-confidence directly
into the transformer — our proprietary Native Entropy Gating (NEG) technology.

📊 GPQA Diamond (198 PhD-level questions):

▸ Baseline Darwin-9B (no NEG) → 51.01 %
▸ Pure NEG (greedy · 1× cost) → 63.64 % 🔥 +12.63 %p
▸ + Permutation (4× cost) → 76.26 %
▸ + Ensemble Refinement (~20×) → 84.34 % 🏆

With only 9 billion parameters and 1× inference cost, Pure NEG jumps
+12.63 %p over the same model without NEG. Going all-in with ensemble
refinement pushes it to 84.34 % — surpassing the published Qwen3.5-9B
leaderboard score (81.7 %) by +2.64 %p.

🔬 What makes NEG different from Multi-Turn Iteration (MTI)?

Classical MTI needs 3-8× extra inference passes. NEG instead lives
INSIDE the single decoding loop. Two tiny modules ride with the
transformer: NEG-Head predicts per-token entropy from the last hidden
state, and NEG-Gate conditionally restricts the top-k choice when
confidence is low. The gate activates in only 4.36 % of tokens —
essentially free at inference time.

✨ Key differentiators
• Architecturally internalised — model file *is* the feature
• 1× inference cost (vs. 3-8× for MTI)
• Drop-in with vLLM / SGLang / TGI / transformers — no extra engine
• +12.63 %p reasoning at zero latency overhead
• Single-file deployment, Apache 2.0 licensed

🧬 Lineage
Qwen/Qwen3.5-9B → Darwin-9B-Opus (V7 evolutionary merge) → Darwin-9B-NEG (V8 + NEG training)

#Darwin #NEG #NativeEntropyGating #GPQA #Reasoning #LLM #OpenSource #Apache2

liked 2 models 13 days ago

UnstableLlama/Qwen3.6-35B-A3B-exl3-2.08bpw

6B • Updated 19 days ago • 37 • 1

spiritbuun/Qwen3.5-27B-DFlash-GGUF

2B • Updated 13 days ago • 2.53k • 23

liked a model 17 days ago

yonigozlan/sam3-litetext-s0

Mask Generation • 0.5B • Updated 16 days ago • 5.38k • 16

reacted to sergiopaniego's post with 🔥 20 days ago

Post

1313

Earlier this month, Apple introduced Simple Self-Distillation: a fine-tuning method that improves models on coding tasks just by sampling from the model and training on its own outputs with plain cross-entropy

And… it's already supported in TRL, built by Kashif Rasul. you can really feel the pace of development in the team 🐎

Paper by Ruixiang ZHANG, He Bai, Huangjie Zheng, Navdeep Jaitly, Ronan Collobert, Yizhe Zhang at Apple 🍎

How it works: the model generates completions at a training-time temperature (T_train) with top_k/top_p truncation, then fine-tunes on them with plain cross-entropy. no labels or verifier needed

You can try it right away with this ready-to-run example (Qwen3-4B on rStar-Coder):
https://github.com/huggingface/trl/blob/main/trl/experimental/ssd/ssd.py
or benchmark a checkpoint with the eval script:
https://github.com/huggingface/trl/blob/main/trl/experimental/ssd/ssd_eval.py

One neat insight from the paper: T_train and T_eval compose into an effective T_eff = T_train × T_eval, so a broad band of configs works well. even very noisy samples still help

Want to dig deeper?

Paper: Embarrassingly Simple Self-Distillation Improves Code Generation (2604.01193)
Trainer docs: https://huggingface.co/docs/trl/main/en/ssd_trainer