ValueError: operands could not be broadcast together with remapped shapes [original->remapped]: (2,2) and requested shape (3,2)

#3
by Yehor - opened

Hello! When I try to run the code I get the following stack trace:

Traceback (most recent call last):
  File "/root/whisper-intel-optimized/run.py", line 30, in <module>
    input_features = processor(
                     ^^^^^^^^^^
  File "/root/whisper-intel-optimized/.venv/lib/python3.12/site-packages/transformers/models/whisper/processing_whisper.py", line 69, in __call__
    inputs = self.feature_extractor(audio, *args, sampling_rate=sampling_rate, **kwargs)
             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/root/whisper-intel-optimized/.venv/lib/python3.12/site-packages/transformers/models/whisper/feature_extraction_whisper.py", line 282, in __call__
    padded_inputs = self.pad(
                    ^^^^^^^^^
  File "/root/whisper-intel-optimized/.venv/lib/python3.12/site-packages/transformers/feature_extraction_sequence_utils.py", line 210, in pad
    outputs = self._pad(
              ^^^^^^^^^^
  File "/root/whisper-intel-optimized/.venv/lib/python3.12/site-packages/transformers/feature_extraction_sequence_utils.py", line 282, in _pad
    processed_features[self.model_input_names[0]] = np.pad(
                                                    ^^^^^^^
  File "/root/whisper-intel-optimized/.venv/lib/python3.12/site-packages/numpy/lib/arraypad.py", line 748, in pad
    pad_width = _as_pairs(pad_width, array.ndim, as_index=True)
                ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/root/whisper-intel-optimized/.venv/lib/python3.12/site-packages/numpy/lib/arraypad.py", line 522, in _as_pairs
    return np.broadcast_to(x, (ndim, 2)).tolist()
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/root/whisper-intel-optimized/.venv/lib/python3.12/site-packages/numpy/lib/stride_tricks.py", line 413, in broadcast_to
    return _broadcast_to(array, shape, subok=subok, readonly=True)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/root/whisper-intel-optimized/.venv/lib/python3.12/site-packages/numpy/lib/stride_tricks.py", line 349, in _broadcast_to
    it = np.nditer(
         ^^^^^^^^^^
ValueError: operands could not be broadcast together with remapped shapes [original->remapped]: (2,2)  and requested shape (3,2)

Input tensor is torch.Size([1, 246936]), sr=16000

The code to reproduce: https://github.com/egorsmkv/whisper-intel-optimized

Any suggestions what can be wrong here?

Yehor changed discussion status to closed

Hi I have kind of the same error when i try to process a batch of audio inputs using the processor. Could you please provide your solution on this error?

Hi I have kind of the same error when i try to process a batch of audio inputs using the processor. Could you please provide your solution on this error?

Check out my repository - https://github.com/egorsmkv/optimized-whisper-intel

I have managed it to work

Thank you for your prompt response. I understand that the error you encountered is due to the shape of the input waveform being torch.Size([1, 246936]), and changing it to waveform[0] resolves the issue for you. However, I would like to input a batch of audio_inputs. Does the Whisper processor not support batch inputs? I noticed that the code in the feature extraction section seems to support it, yet I am still unable to resolve this issue.

It supports. Is your audio in mono or in stereo?

Mono. There is a quick reproduction code:

import torchaudio
from torch.nn.utils.rnn import pad_sequence
from transformers import WhisperModel, WhisperForConditionalGeneration, AutoModelForSpeechSeq2Seq, AutoProcessor

model_id = "whisper-tiny"
processor = AutoProcessor.from_pretrained(model_id)

audio1, sample_rate = torchaudio.load(audio_path_1)
if audio1.shape[0] > 1:
    audio1 = torch.mean(audio1, dim=0, keepdim=True)
if sample_rate != 16000:
    audio1 = torchaudio.transforms.Resample(orig_freq=sample_rate, new_freq=16000)(audio1)

audio2, sample_rate = torchaudio.load(audio_path_2)
if audio2.shape[0] > 1:
    audio2 = torch.mean(audio2, dim=0, keepdim=True)
if sample_rate != 16000:
    audio2 = torchaudio.transforms.Resample(orig_freq=sample_rate, new_freq=16000)(audio2)

batch_audios = [audio1,audio2]
audio_inputs = pad_sequence(
        [a.squeeze(0) for a in batch_audios],
        batch_first=True,
        padding_value=0.0  # padding to same length
    )
outputs = processor(
        audio_inputs,
        sampling_rate=16000,
        return_tensors="pt"
    ).to(device).to(torch.bfloat16)

Input tensor is torch.Size([2, 160085]), sr=16000

Sign up or log in to comment