ValueError: operands could not be broadcast together with remapped shapes [original->remapped]: (2,2) and requested shape (3,2)
Hello! When I try to run the code I get the following stack trace:
Traceback (most recent call last):
File "/root/whisper-intel-optimized/run.py", line 30, in <module>
input_features = processor(
^^^^^^^^^^
File "/root/whisper-intel-optimized/.venv/lib/python3.12/site-packages/transformers/models/whisper/processing_whisper.py", line 69, in __call__
inputs = self.feature_extractor(audio, *args, sampling_rate=sampling_rate, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/root/whisper-intel-optimized/.venv/lib/python3.12/site-packages/transformers/models/whisper/feature_extraction_whisper.py", line 282, in __call__
padded_inputs = self.pad(
^^^^^^^^^
File "/root/whisper-intel-optimized/.venv/lib/python3.12/site-packages/transformers/feature_extraction_sequence_utils.py", line 210, in pad
outputs = self._pad(
^^^^^^^^^^
File "/root/whisper-intel-optimized/.venv/lib/python3.12/site-packages/transformers/feature_extraction_sequence_utils.py", line 282, in _pad
processed_features[self.model_input_names[0]] = np.pad(
^^^^^^^
File "/root/whisper-intel-optimized/.venv/lib/python3.12/site-packages/numpy/lib/arraypad.py", line 748, in pad
pad_width = _as_pairs(pad_width, array.ndim, as_index=True)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/root/whisper-intel-optimized/.venv/lib/python3.12/site-packages/numpy/lib/arraypad.py", line 522, in _as_pairs
return np.broadcast_to(x, (ndim, 2)).tolist()
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/root/whisper-intel-optimized/.venv/lib/python3.12/site-packages/numpy/lib/stride_tricks.py", line 413, in broadcast_to
return _broadcast_to(array, shape, subok=subok, readonly=True)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/root/whisper-intel-optimized/.venv/lib/python3.12/site-packages/numpy/lib/stride_tricks.py", line 349, in _broadcast_to
it = np.nditer(
^^^^^^^^^^
ValueError: operands could not be broadcast together with remapped shapes [original->remapped]: (2,2) and requested shape (3,2)
Input tensor is torch.Size([1, 246936]), sr=16000
The code to reproduce: https://github.com/egorsmkv/whisper-intel-optimized
Any suggestions what can be wrong here?
Fixed
Hi I have kind of the same error when i try to process a batch of audio inputs using the processor. Could you please provide your solution on this error?
Hi I have kind of the same error when i try to process a batch of audio inputs using the processor. Could you please provide your solution on this error?
Check out my repository - https://github.com/egorsmkv/optimized-whisper-intel
I have managed it to work
Thank you for your prompt response. I understand that the error you encountered is due to the shape of the input waveform being torch.Size([1, 246936]), and changing it to waveform[0] resolves the issue for you. However, I would like to input a batch of audio_inputs. Does the Whisper processor not support batch inputs? I noticed that the code in the feature extraction section seems to support it, yet I am still unable to resolve this issue.
It supports. Is your audio in mono or in stereo?
Mono. There is a quick reproduction code:
import torchaudio
from torch.nn.utils.rnn import pad_sequence
from transformers import WhisperModel, WhisperForConditionalGeneration, AutoModelForSpeechSeq2Seq, AutoProcessor
model_id = "whisper-tiny"
processor = AutoProcessor.from_pretrained(model_id)
audio1, sample_rate = torchaudio.load(audio_path_1)
if audio1.shape[0] > 1:
audio1 = torch.mean(audio1, dim=0, keepdim=True)
if sample_rate != 16000:
audio1 = torchaudio.transforms.Resample(orig_freq=sample_rate, new_freq=16000)(audio1)
audio2, sample_rate = torchaudio.load(audio_path_2)
if audio2.shape[0] > 1:
audio2 = torch.mean(audio2, dim=0, keepdim=True)
if sample_rate != 16000:
audio2 = torchaudio.transforms.Resample(orig_freq=sample_rate, new_freq=16000)(audio2)
batch_audios = [audio1,audio2]
audio_inputs = pad_sequence(
[a.squeeze(0) for a in batch_audios],
batch_first=True,
padding_value=0.0 # padding to same length
)
outputs = processor(
audio_inputs,
sampling_rate=16000,
return_tensors="pt"
).to(device).to(torch.bfloat16)
Input tensor is torch.Size([2, 160085]), sr=16000