BOOM: Beyond Only One Modality KIT's Multimodal Multilingual Lecture Companion
Abstract
BOOM is a multimodal multilingual lecture companion that translates audio and slides, producing synchronized outputs across text, images, and speech, enhancing accessibility and preservation of educational content.
The globalization of education and rapid growth of online learning have made localizing educational content a critical challenge. Lecture materials are inherently multimodal, combining spoken audio with visual slides, which requires systems capable of processing multiple input modalities. To provide an accessible and complete learning experience, translations must preserve all modalities: text for reading, slides for visual understanding, and speech for auditory learning. We present BOOM, a multimodal multilingual lecture companion that jointly translates lecture audio and slides to produce synchronized outputs across three modalities: translated text, localized slides with preserved visual elements, and synthesized speech. This end-to-end approach enables students to access lectures in their native language while aiming to preserve the original content in its entirety. Our experiments demonstrate that slide-aware transcripts also yield cascading benefits for downstream tasks such as summarization and question answering. We release our Slide Translation code at https://github.com/saikoneru/image-translator and integrate it in Lecture Translator at https://gitlab.kit.edu/kit/isl-ai4lt/lt-middleware/ltpipeline}\footnote{All released code and models are licensed under the MIT License.
Community
Multilingual Multimodal Lecture Translation
This is an automated message from the Librarian Bot. I found the following papers similar to this paper.
The following papers were recommended by the Semantic Scholar API
- Some Modalities are More Equal Than Others: Decoding and Architecting Multimodal Integration in MLLMs (2025)
- GlobalizeEd: A Multimodal Translation System that Preserves Speaker Identity in Academic Lectures (2025)
- When Eyes and Ears Disagree: Can MLLMs Discern Audio-Visual Confusion? (2025)
- Do Slides Help? Multi-modal Context for Automatic Transcription of Conference Talks (2025)
- From Slides to Chatbots: Enhancing Large Language Models with University Course Materials (2025)
- MAViD: A Multimodal Framework for Audio-Visual Dialogue Understanding and Generation (2025)
Please give a thumbs up to this comment if you found it helpful!
If you want recommendations for any Paper on Hugging Face checkout this Space
You can directly ask Librarian Bot for paper recommendations by tagging it in a comment:
@librarian-bot
recommend
Models citing this paper 2
Datasets citing this paper 0
No dataset linking this paper
Spaces citing this paper 0
No Space linking this paper