3-stage adaptive evaluation comparison with Qwen3.6-27B?
It would be useful to compare the 3-stage adaptive evaluation results with those for Qwen3.6-27B under the same protocol.
It would be useful to compare the 3-stage adaptive evaluation results with those for Qwen3.6-27B under the same protocol.
Thank you for the suggestion. We'd like to note two points.
First, Darwin V7 transparently reports both the standard greedy result (74.7%) and the 2-Pass result (86.9%) side by side, so the community can assess each stage independently.
Second, a fair comparison requires equal disclosure from all sides. To our knowledge, Qwen3.6-27B's reported GPQA Diamond score (87.8%) does not come with detailed evaluation conditions such as temperature, sampling strategy, or number of attempts. We believe standardized and transparent evaluation protocols would benefit everyone, and we are happy to participate in any such effort.