Confused about the eval score

#15

by Denisssy - opened Jul 24

Jul 24

I was curious about whether this model or 235b-2507 is better in agentic tool use, and found that 235b-2507 got a higher score in BFCL-v3 from its model card.
But I found that some data are inconsistent between 235b-2507's model card and this one.

Denisssy

Jul 24

seems deepseek-v3-0324's acc should be 64.7 from BFCL leaderboard.

huybery

Qwen org Jul 24

Sorry, that was our mistake — the correct results have now been updated. @Denisssy

ciprianv

Jul 24

•

edited Jul 24

What are the best settings for coding, in my tests I found temperature 0.7 to be too high and cause it to not respect the requirements very well. 0.3 is better for coding in my tests..

huybery changed discussion status to closed Jul 28

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment