-
Compression Represents Intelligence Linearly
Paper • 2404.09937 • Published • 29 -
MiniCPM: Unveiling the Potential of Small Language Models with Scalable Training Strategies
Paper • 2404.06395 • Published • 23 -
Long-context LLMs Struggle with Long In-context Learning
Paper • 2404.02060 • Published • 38 -
Are large language models superhuman chemists?
Paper • 2404.01475 • Published • 19
Collections
Discover the best community collections!
Collections including paper arxiv:2409.16191
-
CharacterEval: A Chinese Benchmark for Role-Playing Conversational Agent Evaluation
Paper • 2401.01275 • Published • 1 -
Introducing v0.5 of the AI Safety Benchmark from MLCommons
Paper • 2404.12241 • Published • 12 -
Prometheus 2: An Open Source Language Model Specialized in Evaluating Other Language Models
Paper • 2405.01535 • Published • 124 -
Judging the Judges: Evaluating Alignment and Vulnerabilities in LLMs-as-Judges
Paper • 2406.12624 • Published • 38
-
GAIA: a benchmark for General AI Assistants
Paper • 2311.12983 • Published • 226 -
MMMU: A Massive Multi-discipline Multimodal Understanding and Reasoning Benchmark for Expert AGI
Paper • 2311.16502 • Published • 35 -
BLINK: Multimodal Large Language Models Can See but Not Perceive
Paper • 2404.12390 • Published • 27 -
RULER: What's the Real Context Size of Your Long-Context Language Models?
Paper • 2404.06654 • Published • 39
-
Compression Represents Intelligence Linearly
Paper • 2404.09937 • Published • 29 -
MiniCPM: Unveiling the Potential of Small Language Models with Scalable Training Strategies
Paper • 2404.06395 • Published • 23 -
Long-context LLMs Struggle with Long In-context Learning
Paper • 2404.02060 • Published • 38 -
Are large language models superhuman chemists?
Paper • 2404.01475 • Published • 19
-
GAIA: a benchmark for General AI Assistants
Paper • 2311.12983 • Published • 226 -
MMMU: A Massive Multi-discipline Multimodal Understanding and Reasoning Benchmark for Expert AGI
Paper • 2311.16502 • Published • 35 -
BLINK: Multimodal Large Language Models Can See but Not Perceive
Paper • 2404.12390 • Published • 27 -
RULER: What's the Real Context Size of Your Long-Context Language Models?
Paper • 2404.06654 • Published • 39
-
CharacterEval: A Chinese Benchmark for Role-Playing Conversational Agent Evaluation
Paper • 2401.01275 • Published • 1 -
Introducing v0.5 of the AI Safety Benchmark from MLCommons
Paper • 2404.12241 • Published • 12 -
Prometheus 2: An Open Source Language Model Specialized in Evaluating Other Language Models
Paper • 2405.01535 • Published • 124 -
Judging the Judges: Evaluating Alignment and Vulnerabilities in LLMs-as-Judges
Paper • 2406.12624 • Published • 38