Upload README.md with huggingface_hub
Browse files
README.md
CHANGED
|
@@ -31,16 +31,46 @@ model = CoNeTTEModel.from_pretrained("Labbeti/conette", config=config)
|
|
| 31 |
|
| 32 |
path = "/my/path/to/audio.wav"
|
| 33 |
outputs = model(path)
|
| 34 |
-
|
| 35 |
-
print(
|
| 36 |
```
|
| 37 |
|
| 38 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 39 |
| Dataset | SPIDEr (%) | SPIDEr-FL (%) | FENSE (%) |
|
| 40 |
| ------------- | ------------- | ------------- | ------------- |
|
| 41 |
| AudioCaps | 44.14 | 43.98 | 60.81 |
|
| 42 |
| Clotho | 30.97 | 30.87 | 51.72 |
|
| 43 |
|
|
|
|
|
|
|
| 44 |
## Citation
|
| 45 |
The preprint version of the paper describing CoNeTTE is available on arxiv: https://arxiv.org/pdf/2309.00454.pdf
|
| 46 |
|
|
@@ -60,6 +90,6 @@ The preprint version of the paper describing CoNeTTE is available on arxiv: http
|
|
| 60 |
## Additional information
|
| 61 |
|
| 62 |
The encoder part of the architecture is based on a ConvNeXt model for audio classification, available here: https://huggingface.co/topel/ConvNeXt-Tiny-AT.
|
| 63 |
-
|
| 64 |
|
| 65 |
It was created by [@Labbeti](https://hf.co/Labbeti).
|
|
|
|
| 31 |
|
| 32 |
path = "/my/path/to/audio.wav"
|
| 33 |
outputs = model(path)
|
| 34 |
+
candidate = outputs["cands"][0]
|
| 35 |
+
print(candidate)
|
| 36 |
```
|
| 37 |
|
| 38 |
+
The model can also accept several audio files at the same time (list[str]), or a list of pre-loaded audio files (list[Tensor]). IN this second case you also need to provide the sampling rate of this files:
|
| 39 |
+
|
| 40 |
+
```py
|
| 41 |
+
import torchaudio
|
| 42 |
+
|
| 43 |
+
path_1 = "/my/path/to/audio_1.wav"
|
| 44 |
+
path_2 = "/my/path/to/audio_2.wav"
|
| 45 |
+
|
| 46 |
+
audio_1, sr_1 = torchaudio.load(path_1)
|
| 47 |
+
audio_2, sr_2 = torchaudio.load(path_2)
|
| 48 |
+
|
| 49 |
+
outputs = model([audio_1, audio_2], sr=[sr_1, sr_2])
|
| 50 |
+
candidates = outputs["cands"]
|
| 51 |
+
print(candidates)
|
| 52 |
+
```
|
| 53 |
+
|
| 54 |
+
The model can also produces different captions using a Task Embedding input which indicates the dataset caption style. The default task is "clotho".
|
| 55 |
+
|
| 56 |
+
```py
|
| 57 |
+
outputs = model(path, task="clotho")
|
| 58 |
+
candidate = outputs["cands"][0]
|
| 59 |
+
print(candidate)
|
| 60 |
+
|
| 61 |
+
outputs = model(path, task="audiocaps")
|
| 62 |
+
candidate = outputs["cands"][0]
|
| 63 |
+
print(candidate)
|
| 64 |
+
```
|
| 65 |
+
|
| 66 |
+
## Performance
|
| 67 |
| Dataset | SPIDEr (%) | SPIDEr-FL (%) | FENSE (%) |
|
| 68 |
| ------------- | ------------- | ------------- | ------------- |
|
| 69 |
| AudioCaps | 44.14 | 43.98 | 60.81 |
|
| 70 |
| Clotho | 30.97 | 30.87 | 51.72 |
|
| 71 |
|
| 72 |
+
This model checkpoint has been trained for the Clotho dataset, but it can also reach a good performance on AudioCaps with the "audiocaps" task.
|
| 73 |
+
|
| 74 |
## Citation
|
| 75 |
The preprint version of the paper describing CoNeTTE is available on arxiv: https://arxiv.org/pdf/2309.00454.pdf
|
| 76 |
|
|
|
|
| 90 |
## Additional information
|
| 91 |
|
| 92 |
The encoder part of the architecture is based on a ConvNeXt model for audio classification, available here: https://huggingface.co/topel/ConvNeXt-Tiny-AT.
|
| 93 |
+
More precisely, the encoder weights used are named "convnext_tiny_465mAP_BL_AC_70kit.pth", available on Zenodo: https://zenodo.org/record/8020843.
|
| 94 |
|
| 95 |
It was created by [@Labbeti](https://hf.co/Labbeti).
|