The LAMBADA dataset: Word prediction requiring a broad discourse context Paper • 1606.06031 • Published Jun 20, 2016
LLMs instead of Human Judges? A Large Scale Empirical Study across 20 NLP Evaluation Tasks Paper • 2406.18403 • Published Jun 26, 2024
From Tools to Teammates: Evaluating LLMs in Multi-Session Coding Interactions Paper • 2502.13791 • Published Feb 19 • 5
GROOViST: A Metric for Grounding Objects in Visual Storytelling Paper • 2310.17770 • Published Oct 26, 2023
Not (yet) the whole story: Evaluating Visual Storytelling Requires More than Measuring Coherence, Grounding, and Repetition Paper • 2407.04559 • Published Jul 5, 2024