Commit
·
4a83ae2
1
Parent(s):
4b1ba05
Update README (#6)
Browse files- Update README with langauge/task/timestamp info (7121cfc0bf874953f9a47845177517abf1e0940d)
Co-authored-by: Sanchit Gandhi <[email protected]>
README.md
CHANGED
|
@@ -172,10 +172,11 @@ pip install --upgrade pip
|
|
| 172 |
pip install --upgrade git+https://github.com/huggingface/transformers.git accelerate datasets[audio]
|
| 173 |
```
|
| 174 |
|
| 175 |
-
### Short-Form Transcription
|
| 176 |
-
|
| 177 |
The model can be used with the [`pipeline`](https://huggingface.co/docs/transformers/main_classes/pipelines#transformers.AutomaticSpeechRecognitionPipeline)
|
| 178 |
-
class to transcribe
|
|
|
|
|
|
|
|
|
|
| 179 |
|
| 180 |
```python
|
| 181 |
import torch
|
|
@@ -201,11 +202,14 @@ pipe = pipeline(
|
|
| 201 |
tokenizer=processor.tokenizer,
|
| 202 |
feature_extractor=processor.feature_extractor,
|
| 203 |
max_new_tokens=128,
|
|
|
|
|
|
|
|
|
|
| 204 |
torch_dtype=torch_dtype,
|
| 205 |
device=device,
|
| 206 |
)
|
| 207 |
|
| 208 |
-
dataset = load_dataset("
|
| 209 |
sample = dataset[0]["audio"]
|
| 210 |
|
| 211 |
result = pipe(sample)
|
|
@@ -218,59 +222,43 @@ To transcribe a local audio file, simply pass the path to your audio file when y
|
|
| 218 |
+ result = pipe("audio.mp3")
|
| 219 |
```
|
| 220 |
|
| 221 |
-
|
| 222 |
-
|
| 223 |
-
Through Transformers Whisper uses a chunked algorithm to transcribe long-form audio files (> 30-seconds). In practice, this chunked long-form algorithm
|
| 224 |
-
is 9x faster than the sequential algorithm proposed by OpenAI in the Whisper paper (see Table 7 of the [Distil-Whisper paper](https://arxiv.org/abs/2311.00430)).
|
| 225 |
-
|
| 226 |
-
To enable chunking, pass the `chunk_length_s` parameter to the `pipeline`. To activate batching, pass the argument `batch_size`:
|
| 227 |
|
| 228 |
```python
|
| 229 |
-
|
| 230 |
-
|
| 231 |
-
from datasets import load_dataset
|
| 232 |
-
|
| 233 |
-
|
| 234 |
-
device = "cuda:0" if torch.cuda.is_available() else "cpu"
|
| 235 |
-
torch_dtype = torch.float16 if torch.cuda.is_available() else torch.float32
|
| 236 |
|
| 237 |
-
|
|
|
|
| 238 |
|
| 239 |
-
|
| 240 |
-
|
| 241 |
-
|
| 242 |
-
model.to(device)
|
| 243 |
|
| 244 |
-
|
| 245 |
|
| 246 |
-
|
| 247 |
-
|
| 248 |
-
|
| 249 |
-
|
| 250 |
-
feature_extractor=processor.feature_extractor,
|
| 251 |
-
max_new_tokens=128,
|
| 252 |
-
chunk_length_s=15,
|
| 253 |
-
batch_size=16,
|
| 254 |
-
torch_dtype=torch_dtype,
|
| 255 |
-
device=device,
|
| 256 |
-
)
|
| 257 |
|
| 258 |
-
|
| 259 |
-
sample = dataset[0]["audio"]
|
| 260 |
|
| 261 |
-
|
| 262 |
-
|
|
|
|
| 263 |
```
|
| 264 |
|
| 265 |
-
|
| 266 |
-
|
| 267 |
|
| 268 |
```python
|
| 269 |
-
result = pipe("
|
|
|
|
| 270 |
```
|
| 271 |
-
--->
|
| 272 |
|
| 273 |
-
|
| 274 |
|
| 275 |
Whisper `tiny` can be used as an assistant model to Whisper for speculative decoding. Speculative decoding mathematically
|
| 276 |
ensures the exact same outputs as Whisper are obtained while being 2 times faster. This makes it the perfect drop-in
|
|
|
|
| 172 |
pip install --upgrade git+https://github.com/huggingface/transformers.git accelerate datasets[audio]
|
| 173 |
```
|
| 174 |
|
|
|
|
|
|
|
| 175 |
The model can be used with the [`pipeline`](https://huggingface.co/docs/transformers/main_classes/pipelines#transformers.AutomaticSpeechRecognitionPipeline)
|
| 176 |
+
class to transcribe audio files of arbitrary length. Transformers uses a chunked algorithm to transcribe
|
| 177 |
+
long-form audio files, which in-practice is 9x faster than the sequential algorithm proposed by OpenAI
|
| 178 |
+
(see Table 7 of the [Distil-Whisper paper](https://arxiv.org/abs/2311.00430)). The batch size should
|
| 179 |
+
be set based on the specifications of your device:
|
| 180 |
|
| 181 |
```python
|
| 182 |
import torch
|
|
|
|
| 202 |
tokenizer=processor.tokenizer,
|
| 203 |
feature_extractor=processor.feature_extractor,
|
| 204 |
max_new_tokens=128,
|
| 205 |
+
chunk_length_s=30,
|
| 206 |
+
batch_size=16,
|
| 207 |
+
return_timestamps=True,
|
| 208 |
torch_dtype=torch_dtype,
|
| 209 |
device=device,
|
| 210 |
)
|
| 211 |
|
| 212 |
+
dataset = load_dataset("distil-whisper/librispeech_long", "clean", split="validation")
|
| 213 |
sample = dataset[0]["audio"]
|
| 214 |
|
| 215 |
result = pipe(sample)
|
|
|
|
| 222 |
+ result = pipe("audio.mp3")
|
| 223 |
```
|
| 224 |
|
| 225 |
+
Whisper predicts the language of the source audio automatically. If the source audio language is known *a-priori*, it
|
| 226 |
+
can be passed as an argument to the pipeline:
|
|
|
|
|
|
|
|
|
|
|
|
|
| 227 |
|
| 228 |
```python
|
| 229 |
+
result = pipe(sample, generate_kwargs={"language": "english"})
|
| 230 |
+
```
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 231 |
|
| 232 |
+
By default, Whisper performs the task of *speech transcription*, where the source audio language is the same as the target
|
| 233 |
+
text language. To perform *speech translation*, where the target text is in English, set the task to `"translate"`:
|
| 234 |
|
| 235 |
+
```python
|
| 236 |
+
result = pipe(sample, generate_kwargs={"task": "translate"})
|
| 237 |
+
```
|
|
|
|
| 238 |
|
| 239 |
+
Finally, the model can be made to predict timestamps. For sentence-level timestamps, pass the `return_timestamps` argument:
|
| 240 |
|
| 241 |
+
```python
|
| 242 |
+
result = pipe(sample, return_timestamps=True)
|
| 243 |
+
print(result["chunks"])
|
| 244 |
+
```
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 245 |
|
| 246 |
+
And for word-level timestamps:
|
|
|
|
| 247 |
|
| 248 |
+
```python
|
| 249 |
+
result = pipe(sample, return_timestamps="word")
|
| 250 |
+
print(result["chunks"])
|
| 251 |
```
|
| 252 |
|
| 253 |
+
The above arguments can be used in isolation or in combination. For example, to perform the task of speech transcription
|
| 254 |
+
where the source audio is in French, and we want to return sentence-level timestamps, the following can be used:
|
| 255 |
|
| 256 |
```python
|
| 257 |
+
result = pipe(sample, return_timestamps=True, generate_kwargs={"language": "french", "task": "translate"})
|
| 258 |
+
print(result["chunks"])
|
| 259 |
```
|
|
|
|
| 260 |
|
| 261 |
+
## Speculative Decoding
|
| 262 |
|
| 263 |
Whisper `tiny` can be used as an assistant model to Whisper for speculative decoding. Speculative decoding mathematically
|
| 264 |
ensures the exact same outputs as Whisper are obtained while being 2 times faster. This makes it the perfect drop-in
|