SAO fine tuning for modern beat generation

As a music and AI lover I wanted to dive into the music generation technologies.

preview

First, I started by exploring existing models for music generation such as Suno or Stable Audio 2.0, but I couldn't find any that could generate modern trap/rap/r&b beat as well. Then I got this idea, fine tune an open source model over a good amount of trap beat. I chose Stable Audio Open 1.0, as I found it to be the most suitable open-source foundation for this kind of task.

Results

Here the GitHub repository for model inference.
All the following results have been generated with 200 steps, CFG scale of 7, second start set on 0s and duration on 47s.

Prompt 1

A dark and melancholic cloud trap beat, with nostalgic piano, plucked bass and synth bells, at 110 BPM.

Stable Audio Open 1.0	StableBeaT

BPM	Spectral Centroid	Spectral Flatness	Harmonic/Percussive Ratio	Transient Sharpness	CLAP Prompt Score
106.13	1159.43	0.000091	0.460	0.000073	0.489

Prompt 2

A laid back lo-fi jazz rap at 85 BPM, featuring deep sub, plucked bass, and vocal chop, with chill and jazzy relaxed moods.

Stable Audio Open 1.0	StableBeaT

BPM	Spectral Centroid	Spectral Flatness	Harmonic/Percussive Ratio	Transient Sharpness	CLAP Prompt Score
82.72	784.82	0.000030	0.457	0.000015	0.429

Prompt 3

Melancholic trap beat at 105 BPM with shimmering synth bells and deep sub bass, minor chord progressions on piano, and airy vocal pads, evoking a cinematic and emotional atmosphere.

Stable Audio Open 1.0	StableBeaT

BPM	Spectral Centroid	Spectral Flatness	Harmonic/Percussive Ratio	Transient Sharpness	CLAP Prompt Score
100.45	2540.28	0.000284	1.412	0.0000585	0.523

Prompt 4

A jazzy chillhop beat at 101 BPM featuring synth bells, vocal pad, and movie sample, evoking trap nostalgic and chill moods.

Stable Audio Open 1.0	StableBeaT

BPM	Spectral Centroid	Spectral Flatness	Harmonic/Percussive Ratio	Transient Sharpness	CLAP Prompt Score
148.02	4287.26	0.00179	2.963	0.000195	0.552

Prompt 5

Smooth and seductive at 115 BPM trap beat with electric guitar riffs, plucked bass, vocal adlibs, and warm synth pads. Relaxed, romantic, and sexy mood.

Stable Audio Open 1.0	StableBeaT

BPM	Spectral Centroid	Spectral Flatness	Harmonic/Percussive Ratio	Transient Sharpness	CLAP Prompt Score
82.72	1056.42	0.000046	0.645	0.000089	0.478*

Prompt 6

A moody cloud trap beat, boomy bass, synth bells and melodic piano, evoking etherate mood at 100 BPM.

Stable Audio Open 1.0	StableBeaT

BPM	Spectral Centroid	Spectral Flatness	Harmonic/Percussive Ratio	Transient Sharpness	CLAP Prompt Score
144.2	2458.5	0.000356	0.738	0.00206	0.363

Prompt 7

A smooth neo-soul R&B instrumental at 90 BPM in D major, featuring live bass, soft Rhodes keys, and warm analog drum grooves.

Stable Audio Open 1.0	Stable BeaT

BPM	Spectral Centroid	Spectral Flatness	Harmonic/Percussive Ratio	Transient Sharpness	CLAP Prompt Score
130.81	1000.87	0.000166	0.679	0.000007288	0.250

Dataset

I used 20,000 trap/rap beats spanning various subgenres such as cloud, trap, R&B, EDM, industrial hip-hop, jazzy chillhop... For each instrumental, I extracted two segments of 20 to 35 seconds, so it ended up with 40k audio dataset for about 277h of audio, while keeping track of their starting timestamps. This allowed the model not only to learn the content of the beats but also to capture the temporal structure inherent to the musical phrases.

A key goal of this project was to enable the model to learn new instruments (synth bells, deep sub, plucked bass, snare, ...), tempos, and rhythmic patterns that are strongly associated with trap and its subgenres. To achieve this, I tagged each segment by computing its similarity with curated lists of instruments, moods, and genres using a CLAP LAION model.

Additionally, I used the Essentia library to extract the BPM (deeptemp-k16-3) and key/scale of each audio segment, considering only predictions with confidence above 70%.

{
  "39118.wav": {
    "instruments_tags": [
      "plucked guitar",
      "synth bells",
      "movie sample"
    ],
    "genres_tags": [
      "rap with soul"
    ],
    "moods_tags": [
      "trap melancholic",
      "love"
    ],
    "key": "G",
    "scale": "minor",
    "tempo": 109.0,
    "start": 63,
    "duration": 26
  }
}

I chose to generate some synonyms to improve the model’s language variety. This combination of features instrumentation, tempo, key, mood, and genre provided a rich set of musical metadata.

Frequence moods

We can observe how T5-Base encodes all of my tags, resulting in five distinct groups:

Emotion (e.g., cheerful, joyful, dreamy)
Groove (e.g., swing groove, nylon guitar, movie sample)
Genre (e.g., g-funk, chill rap beat, jazzy chillhop)
Sonority (e.g., trap vocal, trap guitar)

The clusters are very close to each other (Silhouette Score: 0.095), which is expected given that the model is fine-tuned on a specific musical subgenre. This proximity reflects the semantic density of the dataset: many tags are naturally related and share subtle differences.

Using this metadata, I was able to generate more human-readable prompts for the model via Llama 3.1 3B running locally, allowing the fine-tuned model to produce beats that better reflect the stylistic and structural characteristics of trap music.

{"filepath": "39118.wav", "start": 63, "duration": 26, "prompt": "A melancholic and love-inspired rap with soul beat at 109 BPM in G minor, using plucked guitar, synth bells, and movie sample."}

Training

The model was trained on a A100 Nvidia GPU Google Colab during about 42h, with a total of 40k audio segments (~277h) over 14 epochs. I set a batch size of 16, resulting in approximately 2,5k steps per epoch, so 35k steps in total.
It takes ~0.37s per step on a Nvidia RTX 4050 Laptop GPU, so about 1min15 for a good generation.

Results Analysis

The model performs particularly well on melodic beats with a smooth and floating atmosphere. It captures harmonic structures effectively and keeps a strong sense of coherence between instruments, mood, and tempo, which makes the generated beats sound natural, balanced, and musically pleasing. The model is able to generate interesting beats that pretty well reflect the given prompt.

However, the model tends to underperform on styles that were underrepresented in the training dataset, such as boom bap or high-energy beats with dense percussive layers.

Frequence moods

This limitation mainly stems from the uneven tag distribution within the dataset, certain instruments and genres are simply less present. In addition, the tagging tool (CLAP), trained on general-purpose music datasets like LAION-Audio-630K, is not specialized for specific genres such as trap or hip-hop, leading to imprecise tagging of elements like snares, hi-hats, or 808 bass. As a result, these styles are harder for the model to reproduce accurately. I also noticed that the generated melodic elements, like piano or synths, often sound much quieter than the drums, since their frequencies are more subtle.

Perspectives

I'd like to fine tune over only 2-3 more epoch of a smaller dataset that represent better underrepresented styles. It'd be interesting to start over with a CLAP specialized on trap/rap genres. Also interested about noise input conditioning such as SpecGrad.

I’m open to any feedback or suggestions on my work.

Sources

Stable Audio Open 1.0 - Model used.
LoRAW — Pipeline implementation for stable audio open LoRA finetuning.
Stable Audio Tools — Official stability.ai framework to use stable audio open.
Essentia - Library for music features extractions.

Contact - Gabriel Guiet-Dupré

Downloads last month: -; Downloads are not tracked for this model. How to track

Inference Providers NEW

This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for gab-gdp/StableBeaT

Base model

stabilityai/stable-audio-open-1.0

Finetuned

(8)

this model