MoMa: Efficient Early-Fusion Pre-training with Mixture of Modality-Aware Experts Paper • 2407.21770 • Published Jul 31, 2024 • 23
PyTorch FSDP: Experiences on Scaling Fully Sharded Data Parallel Paper • 2304.11277 • Published Apr 21, 2023 • 1
Wukong: Towards a Scaling Law for Large-Scale Recommendation Paper • 2403.02545 • Published Mar 4, 2024 • 17
Disaggregated Multi-Tower: Topology-aware Modeling Technique for Efficient Large-Scale Recommendation Paper • 2403.00877 • Published Mar 1, 2024