Confused about the eval score

#15
by Denisssy - opened

I was curious about whether this model or 235b-2507 is better in agentic tool use, and found that 235b-2507 got a higher score in BFCL-v3 from its model card.
But I found that some data are inconsistent between 235b-2507's model card and this one.

image.png
image.png

seems deepseek-v3-0324's acc should be 64.7 from BFCL leaderboard.

Qwen org

Sorry, that was our mistake β€” the correct results have now been updated. @Denisssy

What are the best settings for coding, in my tests I found temperature 0.7 to be too high and cause it to not respect the requirements very well. 0.3 is better for coding in my tests..

huybery changed discussion status to closed

Sign up or log in to comment