FINAL-Bench/Darwin-28B-Opus · 3-stage adaptive evaluation comparison with Qwen3.6-27B?

3-stage adaptive evaluation comparison with Qwen3.6-27B?

by SkyMind - opened 5 days ago

It would be useful to compare the 3-stage adaptive evaluation results with those for Qwen3.6-27B under the same protocol.

SeaWolf-AI

FINAL_Bench org 5 days ago

It would be useful to compare the 3-stage adaptive evaluation results with those for Qwen3.6-27B under the same protocol.

Thank you for the suggestion. We'd like to note two points.
First, Darwin V7 transparently reports both the standard greedy result (74.7%) and the 2-Pass result (86.9%) side by side, so the community can assess each stage independently.
Second, a fair comparison requires equal disclosure from all sides. To our knowledge, Qwen3.6-27B's reported GPQA Diamond score (87.8%) does not come with detailed evaluation conditions such as temperature, sampling strategy, or number of attempts. We believe standardized and transparent evaluation protocols would benefit everyone, and we are happy to participate in any such effort.

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment