which GPQA subset is uesd
#10
by
maybe10086
- opened
Hello! While reading the Qwen2 technical report, I noticed that Qwen2 achieved excellent results on the GPQA benchmark. I'm wondering which specific subset of GPQA was used in the evaluation? Was it diamond, main, extended, or experts? Since the GPQA dataset has different subsets, knowing exactly which one was used would help us better understand the model's capabilities and make fair comparisons.