arxiv:2507.08333

Token-based Audio Inpainting via Discrete Diffusion

Published on Jul 11

· Submitted by

TaliDror on Jul 16

Upvote

Authors:

Tali Dror ,

Abstract

A discrete diffusion model for audio inpainting using tokenized audio representations achieves competitive performance for reconstructing long gaps in corrupted audio recordings.

AI-generated summary

Audio inpainting refers to the task of reconstructing missing segments in corrupted audio recordings. While prior approaches-including waveform and spectrogram-based diffusion models-have shown promising results for short gaps, they often degrade in quality when gaps exceed 100 milliseconds (ms). In this work, we introduce a novel inpainting method based on discrete diffusion modeling, which operates over tokenized audio representations produced by a pre-trained audio tokenizer. Our approach models the generative process directly in the discrete latent space, enabling stable and semantically coherent reconstruction of missing audio. We evaluate the method on the MusicNet dataset using both objective and perceptual metrics across gap durations up to 300 ms. We further evaluated our approach on the MTG dataset, extending the gap duration to 500 ms. Experimental results demonstrate that our method achieves competitive or superior performance compared to existing baselines, particularly for longer gaps, offering a robust solution for restoring degraded musical recordings. Audio examples of our proposed method can be found at https://iftach21.github.io/

View arXiv page View PDF Add to collection

Community

TaliDror

Paper author Paper submitter 11 days ago

Audio inpainting refers to the task of reconstructing missing segments in corrupted audio
recordings. While prior approaches—including waveform and spectrogram-based diffusion models—have shown promising results for short gaps, they often degrade in quality when gaps
exceed 100 milliseconds (ms). In this work, we introduce a novel inpainting method based on
discrete diffusion modeling, which operates over tokenized audio representations produced by a
pre-trained audio tokenizer. Our approach models the generative process directly in the discrete
latent space, enabling stable and semantically coherent reconstruction of missing audio. We evaluate the method on the MusicNet dataset using both objective and perceptual metrics across
gap durations up to 300 ms. We further evaluated our approach on the MTG dataset, extending
the gap duration to 500 ms. Experimental results demonstrate that our method achieves competitive or superior performance compared to existing baselines, particularly for longer gaps,
offering a robust solution for restoring degraded musical recordings. Audio examples of our
proposed method can be found at https://iftach21.github.io/

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment

Upvote

Models citing this paper 0

No model linking this paper

Cite arxiv.org/abs/2507.08333 in a model README.md to link it from this page.

Datasets citing this paper 0

No dataset linking this paper

Cite arxiv.org/abs/2507.08333 in a dataset README.md to link it from this page.

Spaces citing this paper 0

No Space linking this paper

Cite arxiv.org/abs/2507.08333 in a Space README.md to link it from this page.

Collections including this paper 0

No Collection including this paper

Add this paper to a collection to link it from this page.