Papers
arxiv:2506.10911

NoLoCo: No-all-reduce Low Communication Training Method for Large Models

Published on Jun 12
· Submitted by benfielding on Jun 13
Authors:
,

Abstract

NoLoCo is a novel optimization method that eliminates explicit parameter synchronization and reduces communication overhead during the training of large language models, achieving faster convergence rates and reduced idling time compared to existing methods.

AI-generated summary

Training large language models is generally done via optimization methods on clusters containing tens of thousands of accelerators, communicating over a high-bandwidth interconnect. Scaling up these clusters is expensive and can become impractical, imposing limits on the size of models that can be trained. Several recent studies have proposed training methods that are less communication intensive, avoiding the need for a highly connected compute cluster. These state-of-the-art low communication training methods still employ a synchronization step for model parameters, which, when performed over all model replicas, can become costly on a low-bandwidth network. In this work, we propose a novel optimization method, NoLoCo, that does not explicitly synchronize all model parameters during training and, as a result, does not require any collective communication. NoLoCo implicitly synchronizes model weights via a novel variant of the Nesterov momentum optimizer by partially averaging model weights with a randomly selected other one. We provide both a theoretical convergence analysis for our proposed optimizer as well as empirical results from language model training. We benchmark NoLoCo on a wide range of accelerator counts and model sizes, between 125M to 6.8B parameters. Our method requires significantly less communication overhead than fully sharded data parallel training or even widely used low communication training method, DiLoCo. The synchronization step itself is estimated to be one magnitude faster than the all-reduce used in DiLoCo for few hundred accelerators training over the internet. We also do not have any global blocking communication that reduces accelerator idling time. Compared to DiLoCo, we also observe up to 4% faster convergence rate with wide range of model sizes and accelerator counts.

Community

Paper submitter

NoLoCo trains large models over heterogeneous gossip networks, rather than high-bandwidth datacentres. It reduces synchronisation latency by 10x vs state of the art methods while converging 4% faster to the same validation loss.

Why it matters
Current methods require heavy inter-node communication. This demands high-bandwidth datacentre interconnects; otherwise, nodes will sit idle. Prior works have reduced communication but still require all nodes to sync.

How it Works
NoLoCo uses a gossip approach that achieves faster convergence while eliminating the all-reduce step. It uses a novel variant of Nesterov momentum by averaging model weights in random pairs. It then dynamically routes pipeline shards to improve convergence.

You can read more about it in our article, the arXiv paper, and re-run our experiments with the open source code.

This is an automated message from the Librarian Bot. I found the following papers similar to this paper.

The following papers were recommended by the Semantic Scholar API

Please give a thumbs up to this comment if you found it helpful!

If you want recommendations for any Paper on Hugging Face checkout this Space

You can directly ask Librarian Bot for paper recommendations by tagging it in a comment: @librarian-bot recommend

Sign up or log in to comment

Models citing this paper 0

No model linking this paper

Cite arxiv.org/abs/2506.10911 in a model README.md to link it from this page.

Datasets citing this paper 0

No dataset linking this paper

Cite arxiv.org/abs/2506.10911 in a dataset README.md to link it from this page.

Spaces citing this paper 0

No Space linking this paper

Cite arxiv.org/abs/2506.10911 in a Space README.md to link it from this page.

Collections including this paper 2