Spaces:
Running
Unordered questions and remarks
This sentence is repeated in the abstract and the first paragraph, maybe we can remove it from one of them?
LLMs are increasingly used to assist with scientific research
Large language models are increasingly used for scientific research
In the our prompt, the model is free to reason, but it is asked to output a structured response containing:"
the output asked to the llm is quite strict, how did you converge to that? Is it for being able to extract the relevant information for the rest of the analysis?
I'd be curious to know the results for small/old models, maybe they can be surprisingly high?
Just to confirm my intution: it would be super hard to have the best possible acheivable score for this benchmark right? This would requires hand crafted the best possible strategy, which, even in the simplified case of "Static set property" only i've no idea how one would do it.
To estimate excessive caution, we can compute how many turns it waits while having the correct tentative rule, before actually guessing it. This relates to how many points a model loses by waiting too long to commit.
I’m not fully convinced that “having the correct tentative rule but not guessing it yet" should always count as excessive caution. Being the most likely rule doesn’t necessarily mean it’s likely enough to commit to, especially early on when evidence is scarce.
For example, if the true rule is “only red cards” and the model plays one red card, “only red cards” might become its top hypothesis immediately — but it would still be rational to test a few more cards before guessing, because many other rules would remain almost as plausible from just one observation.
So I agree this metric can be useful for comparing models, but I don’t think the ideal value is 0: even an optimal model should sometimes “wait” while the correct rule is already its best current guess.
How should we interpret these values? A failed guess costs 2 points, while each turn of delay costs 1 point, so the optimal number of failed guesses per round should be around 0.5 (one failed guess every two rounds) to balance both sources of loss
Maybe I'm thinking not correctly, but it sounds more complicated than that. I feel like it has to depend on the expeced quantity of information gained by waiting vs guessing. I think we can imagine a simplified version of this game where the optimal strategy leads to a different number of expected failed guesses per round. Though I don't have a concrete example right now.
Every turn, even when they don’t choose to guess
Am I right to say that a rational model would never evaluate its confidence above 67% without guessing? In other words, if the model thinks the most likely rule is correct with more than 67% confidence, it should guess immediately, because the expected gain from guessing would be higher than waiting.
while Claude Haiku 4.5 often guesses even at confidence level 7.
so technically it's Claude Haiku 4.5 which is most rational model on this aspect, right?
Figure 8
Pretty cool intereactive figure, but it's hard to interpret the results. I don't have any idea to make it more interpretable :(
"the output asked to the llm is quite strict, how did you converge to that? Is it for being able to extract the relevant information for the rest of the analysis?"
Yes that's right, I wanted to be able to do extended analysis (like calibration, confidence, etc.) but that's right, for a strict eval, it could just return the card to play and optionally a rule if it wants to guess.
"Just to confirm my intution: it would be super hard to have the best possible acheivable score for this benchmark right? This would requires hand crafted the best possible strategy, which, even in the simplified case of "Static set property" only i've no idea how one would do it."
Yes that's right, the maximum score is impossible ! The model would have to guess every rule at the first turn 😀 Some rules are also more or less hard depending on the shuffling of the deck. So it is hard to say what is the true maximum, but with the confidence analysis and guessing strategy, we can at least say something like "with identical reasoning abilities, you could have played this better"
[about caution] "So I agree this metric can be useful for comparing models, but I don’t think the ideal value is 0:"
No that's right, the ideal value depends on the tradeoff between risk of losing points because of a wrong guess and "lost turn" because not guessing. In the "no-stakes" case, when guessing is free, then somehow the ideal value is 0, but as soon as there is a penalty for wrong guess, the ideal value is not zero
"Maybe I'm thinking not correctly, but it sounds more complicated than that. I feel like it has to depend on the expeced quantity of information gained by waiting vs guessing. I think we can imagine a simplified version of this game where the optimal strategy leads to a different number of expected failed guesses per round. Though I don't have a concrete example right now."
Yes I believe this reasoning assumes that the expected amount of extra information is small relative to the total information required for a good guess (which is equivalent to "it typically takes a large number of turn to win"). For a game where you could get all the information necessary in one turn, that would be wrong. Maybe a game like "you have 5 boxes, one contains a diamond, you win if you guess which one, and you have the right to open one box every turn", in that game you would receive all the relevant information at once.
"Am I right to say that a rational model would never evaluate its confidence above 67% without guessing? In other words, if the model thinks the most likely rule is correct with more than 67% confidence, it should guess immediately, because the expected gain from guessing would be higher than waiting."
Yes a rational model should do this if it is convinced that it is well calibrated If somehow you know that you are over-confident, then it makes sense to compensate and requires a higher confidence threshold to guess (what Gemini is doing for instance)
"Pretty cool intereactive figure, but it's hard to interpret the results. I don't have any idea to make it more interpretable :("
Me neither ! I thought it was nice to see the distribution of results, but I don't necessarily have a clear message other than "harder rules are harder" 😃