which GPQA subset is uesd

#10
by maybe10086 - opened

Hello! While reading the Qwen2 technical report, I noticed that Qwen2 achieved excellent results on the GPQA benchmark. I'm wondering which specific subset of GPQA was used in the evaluation? Was it diamond, main, extended, or experts? Since the GPQA dataset has different subsets, knowing exactly which one was used would help us better understand the model's capabilities and make fair comparisons.

Sign up or log in to comment