File size: 4,122 Bytes
f00f693
 
 
 
 
 
 
 
 
 
 
 
 
 
a04bbe7
 
2c6e7ae
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
fce207b
 
 
 
 
2c6e7ae
 
 
0e6a16e
 
2c6e7ae
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
fce207b
 
2c6e7ae
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
a04bbe7
 
 
 
 
 
 
 
 
2c6e7ae
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
---
license: mit
tags:
  - small audio-language model
  - ALM
  - audio
  - music
  - sound events
  - audio reasoning
  - audio captioning
  - audio question answering
  - zero-shot
  - audio-text
---
# Mellow: a small audio language model for reasoning
[[`Paper`](https://arxiv.org/abs/2503.08540)] [[`GitHub`](https://github.com/soham97/Mellow)] [[`Checkpoint`](https://huggingface.co/soham97/Mellow)] [[`Zenodo`](https://zenodo.org/records/15002886)]

Mellow is a small Audio-Language Model that takes in two audios and a text prompt as input and produces free-form text as output. It is a 167M parameter model and trained on ~155 hours of audio (AudioCaps and Clotho), and achieves SoTA performance on different tasks with 50x fewer parameters. 

![alt text](resource/image.png)

## Index
* [Setup](#setup)
* [Usage](#usage)
* [Examples](#example)
* [ReasonAQA](#reasonaqa)
* [Limitation](#limitation)

## Setup
1. Install the required dependencies: `pip install -r requirements.txt`. For [conda](https://www.anaconda.com), run the following: 

```shell
cd Mellow && \
conda create -n mellow python=3.10.14 && \
conda activate mellow && \
pip install -r requirements.txt
```

2. To test the setup is complete, run:
```shell
python example.py
```

## Usage
The MellowWrapper class allows easy interaction with the model. To use the wrapper, inputs required are:
- `config`: The option supported is "v0"
- `model`: The option supported is "v0"
- `examples`: List of examples. Each example is a list containing three entries: audiopath1, audiopath2, prompt

Supported functions:
- `generate`: Produces text response for the given audio inputs and text prompt

## Example
Mellow supports open-ended questions-answering and can produce response based on the user's prompt. Below, we provide some example questions for testing Mellow on different tasks. 

```python
import torch
from pathlib import Path
import os
from mellow import MellowWrapper

# setup cuda and device
cuda = torch.cuda.is_available()
device = 0 if cuda else "cpu"

# setup mellow
mellow = MellowWrapper(
                    config="v0",
                    model = "v0",
                    device=device,
                    use_cuda=cuda,
                )

# pick up audio file paths
parent_path = Path(os.path.realpath(__file__)).parent
path1 = os.path.join(parent_path, "resource", "1.wav")
path2 = os.path.join(parent_path, "resource", "2.wav")

# list of filepaths and prompts
examples = [
    [path1, path2, "what can you infer about the surroundings from the audio?"],
    [path1, path2, "is there a cat in the audio? answer yes or no"],
    [path1, path2, "caption the audio."]
    [path1, path2, "Based on the audio, what can be said about the hypothesis - \"A farmer is giving a tour of his ranch while chickens roam nearby\"? a) It is definitely true b) It is definitely false c) It is plausible d) I cannot determine"],
    [path1, path2, "explain the difference between the two audios in detail."],
    [path1, path2, "what is the primary sound event present in the clip? a) dog barking b) chirping birds c) car engine d) clapping"],
]

# generate response
response = mellow.generate(examples=examples, max_len=300, top_p=0.8, temperature=1.0)
print(f"\noutput: {response}")
```

## Limitation
With Mellow, we aim to showcase that small audio-language models can engage in reasoning. As a research prototype, Mellow has not been trained at scale on publicly available audio datasets, resulting in a limited understanding of audio concepts. Therefore, we advise caution when considering its use in production settings. Ultimately, we hope this work inspires researchers to explore small audio-language models for multitask capabilities, complementing ongoing research on general-purpose audio assistants.

## Citation
```
@misc{mellow,
      title={Mellow: a small audio language model for reasoning}, 
      author={Soham Deshmukh and Satvik Dixit and Rita Singh and Bhiksha Raj},
      year={2025},
      eprint={2503.08540},
      archivePrefix={arXiv},
      primaryClass={cs.SD},
      url={https://arxiv.org/abs/2503.08540}, 
}
```