Papers
arxiv:2510.10068

Probabilistic Hyper-Graphs using Multiple Randomly Masked Autoencoders for Semi-supervised Multi-modal Multi-task Learning

Published on Oct 11
Authors:
,

Abstract

A novel model, PHG-MAE, combines masked autoencoders with probabilistic hyper-graphs to enhance multi-modal computer vision tasks through unified pre-training and fine-tuning, enabling ensembles and knowledge distillation.

AI-generated summary

The computer vision domain has greatly benefited from an abundance of data across many modalities to improve on various visual tasks. Recently, there has been a lot of focus on self-supervised pre-training methods through Masked Autoencoders (MAE) he2022masked,bachmann2022multimae, usually used as a first step before optimizing for a downstream task, such as classification or regression. This is very useful as it doesn't require any manually labeled data. In this work, we introduce Probabilistic Hyper-Graphs using Masked Autoencoders (PHG-MAE): a novel model that unifies the classical work on neural graphs leordeanu2021semi with the modern approach of masked autoencoders under a common theoretical framework. Through random masking of entire modalities, not just patches, the model samples from the distribution of hyper-edges on each forward pass. Additionally, the model adapts the standard MAE algorithm by combining pre-training and fine-tuning into a single training loop. Moreover, our approach enables the creation of inference-time ensembles which, through aggregation, boost the final prediction performance and consistency. Lastly, we show that we can apply knowledge distillation on top of the ensembles with little loss in performance, even with models that have fewer than 1M parameters. While our work mostly focuses on outdoor UAV scenes that contain multiple world interpretations and modalities, the same steps can be followed in other similar domains, such as autonomous driving or indoor robotics. In order to streamline the process of integrating external pre-trained experts for computer vision multi-modal multi-task learning (MTL) scenarios, we developed a data-pipeline software. Using this tool, we have created and released a fully-automated extension of the Dronescapes dataset. All the technical details, code and reproduction steps are publicly released.

Community

Sign up or log in to comment

Models citing this paper 0

No model linking this paper

Cite arxiv.org/abs/2510.10068 in a model README.md to link it from this page.

Datasets citing this paper 1

Spaces citing this paper 0

No Space linking this paper

Cite arxiv.org/abs/2510.10068 in a Space README.md to link it from this page.

Collections including this paper 0

No Collection including this paper

Add this paper to a collection to link it from this page.