update reasonaqa links
Browse files
README.md
CHANGED
@@ -13,7 +13,7 @@ tags:
|
|
13 |
- audio-text
|
14 |
---
|
15 |
# Mellow: a small audio language model for reasoning
|
16 |
-
[[`Paper`](https://arxiv.org/abs/2503.08540)] [[`GitHub`](https://github.com/soham97/Mellow)] [[`Checkpoint`](https://huggingface.co/soham97/Mellow)] [[`Zenodo`](https://zenodo.org/records/
|
17 |
|
18 |
Mellow is a small Audio-Language Model that takes in two audios and a text prompt as input and produces free-form text as output. It is a 167M parameter model and trained on ~155 hours of audio (AudioCaps and Clotho), and achieves SoTA performance on different tasks with 50x fewer parameters.
|
19 |
|
@@ -91,6 +91,41 @@ response = mellow.generate(examples=examples, max_len=300, top_p=0.8, temperatur
|
|
91 |
print(f"\noutput: {response}")
|
92 |
```
|
93 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
94 |
## Limitation
|
95 |
With Mellow, we aim to showcase that small audio-language models can engage in reasoning. As a research prototype, Mellow has not been trained at scale on publicly available audio datasets, resulting in a limited understanding of audio concepts. Therefore, we advise caution when considering its use in production settings. Ultimately, we hope this work inspires researchers to explore small audio-language models for multitask capabilities, complementing ongoing research on general-purpose audio assistants.
|
96 |
|
|
|
13 |
- audio-text
|
14 |
---
|
15 |
# Mellow: a small audio language model for reasoning
|
16 |
+
[[`Paper`](https://arxiv.org/abs/2503.08540)] [[`GitHub`](https://github.com/soham97/Mellow)] [[`Checkpoint`](https://huggingface.co/soham97/Mellow)] [[`Zenodo`](https://zenodo.org/records/15036628)] [[`Demo`](https://tinyurl.com/mellowredirect)]
|
17 |
|
18 |
Mellow is a small Audio-Language Model that takes in two audios and a text prompt as input and produces free-form text as output. It is a 167M parameter model and trained on ~155 hours of audio (AudioCaps and Clotho), and achieves SoTA performance on different tasks with 50x fewer parameters.
|
19 |
|
|
|
91 |
print(f"\noutput: {response}")
|
92 |
```
|
93 |
|
94 |
+
## ReasonAQA
|
95 |
+
The composition of the ReasonAQA dataset is shown in Table. The training set is restricted to AudioCaps and Clotho audio files and the testing is performed on 6 tasks - Audio Entailment, Audio Difference, ClothoAQA, Clotho MCQ, Clotho Detail, AudioCaps MCQ and AudioCaps Detail.
|
96 |
+
|
97 |
+

|
98 |
+
- The ReasonAQA JSONs can be downloaded from: [Zenodo](https://zenodo.org/records/15036628). The zip file contain three files including train.json, val.json and test.json
|
99 |
+
- The audio files can be downloaded from their respective hosting website: [Clotho](https://zenodo.org/records/4783391) and [AudioCaps](https://github.com/cdjkim/audiocaps)
|
100 |
+
|
101 |
+
---
|
102 |
+
The format of the dataset is a JSON file of a list of dicts, in the following format:
|
103 |
+
|
104 |
+
```json
|
105 |
+
[
|
106 |
+
{
|
107 |
+
"taskname": "audiocaps",
|
108 |
+
"filepath1": "AudioCapsLarger/test/Y6BJ455B1aAs.wav",
|
109 |
+
"filepath2": "AudioCapsLarger/test/YZsf2YvJfCKw.wav",
|
110 |
+
"caption1": "A rocket flies by followed by a loud explosion and fire crackling as a truck engine runs idle",
|
111 |
+
"caption2": "Water trickling followed by a toilet flushing then liquid draining through a pipe",
|
112 |
+
"input": "explain the difference in few words",
|
113 |
+
"answer": "Audio 1 features a sudden, intense sonic event (rocket explosion) with high-frequency crackling (fire) and a steady, low-frequency hum (truck engine), whereas Audio 2 consists of gentle, mid-frequency water sounds (trickling, flushing, and draining).",
|
114 |
+
"subtype": "ACD-1.json"
|
115 |
+
},
|
116 |
+
...
|
117 |
+
]
|
118 |
+
```
|
119 |
+
The field of the JSON dict are:
|
120 |
+
- `taskname`: indicates the dataset. The two options are "audiocaps" or "clothov21"
|
121 |
+
- `filepath1`: the first audio file path
|
122 |
+
- `filepath2`: the second audio file path. This is empty for all tasks except for the audio difference explanation task
|
123 |
+
- `caption1`: the ground truth caption for the first audio
|
124 |
+
- `caption2`: the ground truth caption for the second audio. This is empty for all tasks except for the audio difference explanation task
|
125 |
+
- `input`: the input question or prompt to the model
|
126 |
+
- `answer`: the answer or response for the given input
|
127 |
+
- `subtype`: the type of question or prompt. The type matches the first column in the reasonaqa image above. The options are - "ACD-1.json", "CLE.json", "AudioCaps.json", and more.
|
128 |
+
|
129 |
## Limitation
|
130 |
With Mellow, we aim to showcase that small audio-language models can engage in reasoning. As a research prototype, Mellow has not been trained at scale on publicly available audio datasets, resulting in a limited understanding of audio concepts. Therefore, we advise caution when considering its use in production settings. Ultimately, we hope this work inspires researchers to explore small audio-language models for multitask capabilities, complementing ongoing research on general-purpose audio assistants.
|
131 |
|