Spaces:

espnet
/

OWSM_V4_Demo

Running on Zero

App Files Files Community

ms180 commited on Aug 27

Commit

cd5c512

verified ·

1 Parent(s): 00de4b3

Update app.py

Browse files

Updated description from v3.1 to v4

Files changed (1) hide show

app.py +25 -3

app.py CHANGED Viewed

@@ -7,7 +7,7 @@ from espnet2.bin.s2t_inference import Speech2Text as ARSpeech2Text
 from espnet2.bin.s2t_inference_ctc import Speech2TextGreedySearch as CTCSpeech2Text
-TITLE="Open Whisper-style Speech Model from CMU WAVLab"
 DESCRIPTION='''
 OWSM (pronounced as "awesome") is a series of Open Whisper-style Speech Models from [CMU WAVLab](https://www.wavlab.org/).
@@ -16,14 +16,18 @@ For more details, please check our [website](https://www.wavlab.org/activities/2
 '''
 ARTICLE = '''
-The latest demo uses OWSM v3.1 based on [E-Branchformer](https://arxiv.org/abs/2210.00077).
-OWSM v3.1 has 1.02B parameters and is trained on 180k hours of labelled data. It supports various speech-to-text tasks:
 - Speech recognition in 151 languages
 - Any-to-any language speech translation
 - Utterance-level timestamp prediction
 - Long-form transcription
 - Language identification
 As a demo, the input speech should not exceed 2 minutes. We also limit the maximum number of tokens to be generated.
 Please try our [Colab demo](https://colab.research.google.com/drive/1zKI3ZY_OtZd6YmVeED6Cxy1QwT1mqv9O?usp=sharing) if you want to explore more features.
@@ -32,6 +36,12 @@ Please try our [Colab demo](https://colab.research.google.com/drive/1zKI3ZY_OtZd
 Please consider citing the following papers if you find our work helpful.
 ```
 @inproceedings{peng2024owsm31,
   title={OWSM v3.1: Better and Faster Open Whisper-Style Speech Models based on E-Branchformer},
   author={Yifan Peng and Jinchuan Tian and William Chen and Siddhant Arora and Brian Yan and Yui Sudo and Muhammad Shakeel and Kwanghee Choi and Jiatong Shi and Xuankai Chang and Jee-weon Jung and Shinji Watanabe},
@@ -44,6 +54,18 @@ Please consider citing the following papers if you find our work helpful.
   booktitle={Proc. ASRU},
   year={2023}
 }
 ```
 '''

 from espnet2.bin.s2t_inference_ctc import Speech2TextGreedySearch as CTCSpeech2Text
+TITLE="Open Whisper-style Speech Model V4 from CMU WAVLab"
 DESCRIPTION='''
 OWSM (pronounced as "awesome") is a series of Open Whisper-style Speech Models from [CMU WAVLab](https://www.wavlab.org/).
 '''
 ARTICLE = '''
+The latest demo uses OWSM v4 based on [E-Branchformer](https://arxiv.org/abs/2210.00077).
+OWSM v4 medium model has 1.02B parameters and is trained on 320k hours of labelled data (290k for ASR, 30k for ST).
+OWSM-V4 CTC model has 1.01B parameters and is trained on the same dataset as the medium model.
+They supports various speech-to-text tasks:
 - Speech recognition in 151 languages
 - Any-to-any language speech translation
 - Utterance-level timestamp prediction
 - Long-form transcription
 - Language identification
+Additionally, OWSM v4 applies 8 times subsampling (instead of 4 times in OWSM v3.1) to the log Mel features, leading to a final resolution of 80 ms in the encoder. When running inference, we recommend setting maxlenratio=1.0 (default) instead of smaller values.
 As a demo, the input speech should not exceed 2 minutes. We also limit the maximum number of tokens to be generated.
 Please try our [Colab demo](https://colab.research.google.com/drive/1zKI3ZY_OtZd6YmVeED6Cxy1QwT1mqv9O?usp=sharing) if you want to explore more features.
 Please consider citing the following papers if you find our work helpful.
 ```
+@inproceedings{owsm-v4,
+  title={{OWSM} v4: Improving Open Whisper-Style Speech Models via Data Scaling and Cleaning},
+  author={Yifan Peng and Shakeel Muhammad and Yui Sudo and William Chen and Jinchuan Tian and Chyi-Jiunn Lin and Shinji Watanabe},
+  booktitle={Proceedings of the Annual Conference of the International Speech Communication Association (INTERSPEECH)},
+  year={2025},
+}
 @inproceedings{peng2024owsm31,
   title={OWSM v3.1: Better and Faster Open Whisper-Style Speech Models based on E-Branchformer},
   author={Yifan Peng and Jinchuan Tian and William Chen and Siddhant Arora and Brian Yan and Yui Sudo and Muhammad Shakeel and Kwanghee Choi and Jiatong Shi and Xuankai Chang and Jee-weon Jung and Shinji Watanabe},
   booktitle={Proc. ASRU},
   year={2023}
 }
+@inproceedings{owsm-ctc,
+    title = "{OWSM}-{CTC}: An Open Encoder-Only Speech Foundation Model for Speech Recognition, Translation, and Language Identification",
+    author = "Peng, Yifan  and
+      Sudo, Yui  and
+      Shakeel, Muhammad  and
+      Watanabe, Shinji",
+    booktitle = "Proceedings of the Annual Meeting of the Association for Computational Linguistics (ACL)",
+    year = "2024",
+    month= {8},
+    url = "https://aclanthology.org/2024.acl-long.549",
+}
 ```
 '''