Intelligence Per Watt: A Study of Local Intelligence Efficiency

Community Article Published November 12, 2025

By: Jon Saad-Falcon*, Avanika Narayan*, Azalia Mirhoseini, Chris Ré (Stanford University)

2

AI demand is growing exponentially creating unprecedented pressure on data center infrastructure. While data centers dominate AI workloads due to superior compute density and efficiency, they face scaling constraints: years-long construction timelines, massive capital requirements, and energy grid limitations.

History suggests an alternative path forward. From 1946-2009, computing efficiency (performance-per-watt) doubled every 1.5 years, enabling a redistribution of computing workloads from data center mainframes to personal computers (PCs). Critically, this transition didn't occur because PCs surpassed mainframes in raw performance: efficiency improvements enabled computing capable of meeting end-user needs within the power constraints of personal devices.

We are at a similar inflection point. Local language models (LMs), with ≤20B active parameters, are surprisingly capable, and local accelerators (e.g., M4 Max with 128GB unified memory) run LMs at interactive latencies. Just as compute efficiency defined the transition to personal computing, we propose that intelligence efficiency defines the transition to local inference. To this end, we introduce intelligence per watt (IPW): task accuracy per unit of power. IPW is a unified metric for measuring intelligence efficiency, capturing both the intelligence delivered (capabilities) and power required (efficiency).

We evaluate the current state and trajectory of local inference efficiency through two questions: (1) Are local LMs capable of accurately servicing a meaningful portion of today’s workloads? and (2) Given local power budgets, how efficiently do local accelerators convert watts into intelligence? We conduct a large-scale study across real-world single-turn chat and reasoning tasks, revealing:

  • Local LMs are capable and improving rapidly: Local LMs can accurately respond to 88.7% of single-turn chat and reasoning queries, with accuracy improving 3.1× from 2023-2025.
  • Local accelerator efficiency has room for improvement: Inference on a local accelerator (Qwen3-32B on an M4 Max) demonstrates 1.5x lower intelligence-per-watt vs. inference of the same local LM on an enterprise grade accelerator (i.e., NVIDIA B200).
  • Local intelligence efficiency is improving 5.3x from 2023 to 2025: 3.1x from model improvements (thanks to advances in model architectures, pretraining, post-training, and distillation) and 1.7x from accelerator improvements.

Call to Action: As AI demand grows exponentially, we must find better approaches to turn energy into intelligence. We envision a world where intelligence is everywhere and in every device: in earbuds providing instant translation, in glasses offering real-time visual assistance, and in pocket assistants solving graduate-level problems. By prioritizing intelligence-per-watt in both model design and hardware acceleration, we can bring powerful AI to billions of edge devices and make intelligence truly ubiquitous. We release a profiling harness to enable systematic benchmarking of intelligence per watt as local LMs and accelerators evolve.

For more details, see the links below:

Community

Sign up or log in to comment