website banner

Slot 3 Tutorials (15:30-18:30)

  • Option 1: "Interpretability Techniques for Speech Models", by Charlotte Pouw (University of Amsterdam), Gaofei Shen (University of Amsterdam), Martijn Bentum (University of Groningen), Marianne de Heer Kloots (University of Amsterdam), Tomas Lentz (Tilburg University), Hosein Mohebbi (Tilburg University), WIllem Zuidema (University of Amsterdam), Grzegorz Chrupała (Tilburg University)

    • Short description: Pre-trained foundation models have revolutionized speech technology like many other adjacent fields. The combination of their capability and opacity has sparked interest in researchers trying to interpret the models in various ways. While interpretability in fields such as computer vision and natural language processing has made significant progress towards understanding model internals and explaining their decisions, speech technology has lagged behind despite the widespread use of complex, black-box neural models. Recent studies have begun to address this gap, marked by a growing body of literature focused on interpretability in the speech domain. This tutorial provides a structured overview of interpretability techniques, their applications, implications, and limitations when applied to speech models, aiming to help researchers and practitioners better understand, evaluate, debug, and optimize speech models while building trust in their predictions. In hands-on sessions, participants will explore how speech models encode distinct features (e.g., linguistic information) and utilize them in their inference. By the end, attendees will be equipped with the tools and knowledge to start analyzing and interpreting speech models in their own research, potentially inspiring new directions.

    • About tutorial organizers:

      • Charlotte Pouw is a third-year PhD candidate at the University of Amsterdam (ILLC). She is part of the InDeep consortium, funded by the Dutch Research Council (NWO). Her research focuses on analyzing the linguistic capabilities learned by speech recognition and speech synthesis systems. Her research has been published at ACL conferences and in Computational Linguistics.

      • Gaofei Shen is a third-year PhD candidate at Tilburg University. He is part of the InDeep consortium project, working with a specific focus on analyzing speech models with a strong linguistic theoretic basis. He has published papers at Interspeech and NAACL.

      • Martijn Bentum is a postdoctoral researcher at Radboud University and part of the InDeep consortium. He is interested in human and artificial speech perception, currently studying speech representations in self-supervised speech models. His research has been published in the proceedings of Interspeech and in Brain and Language.

      • Marianne de Heer Kloots is a final-year PhD candidate at the University of Amsterdam (ILLC), who is interested in modelling human speech perception and interpreting deep learning systems for speech processing. She has published work on interpreting self-supervised speech models at Interspeech and co-organized a workshop on Using Artificial Neural Networks for Studying Human Language Learning and Processing.

      • Tomas Lentz is an Assistant Professor at Tilburg University. He is interested in the psycholinguistic and communicative workings and effects of AI, especially for speech. His research is published in Interspeech proceedings, but also in linguistic journals such as Journal of Phonetics and communication science journals such as Humor, and Behaviour & Information Technology.

      • Hosein Mohebbi is a final-year PhD candidate at Tilburg University. He is part of the InDeep consortium project, doing research on the interpretability of deep neural models for text and speech. His research has been published in leading NLP venues such as ACL, EACL, and EMNLP, where he also regularly serves as a reviewer. He received an Outstanding Paper Award at EMNLP 2023 for his speech interpretability work. He co-offered a tutorial of “Transformer-specific Interpretability” at EACL 2024. He is also one of the organizers of BlackboxNLP 2023-2025, a workshop focusing on analyzing and interpreting neural networks for NLP.

      • Willem Zuidema is Associate Professor of NLP, Explainable AI and Cognitive Modelling at the University of Amsterdam. He leads a group that has done pioneering and impactful research into the interpretability of deep learning models, including text-based language models and neural speech models. His work has been published in a diversity of venues across cognitive science and AI, including NeurIPS, ICLR, EACL, EMNLP, ACL, Journal of AI Research, Psychonomic Bulletin & Review, Journal of Phonetics, PNAS and Nature.

      • Grzegorz Chrupała is an Associate Professor at Tilburg University. He is interested in computation in biological and artificial systems and connections between them. His research focuses on computational models of learning spoken language in naturalistic multimodal settings, as well as analysis and interpretation of representations emerging in deep learning architectures. He was one of the creators of the popular BlackboxNLP Workshop on Analyzing and Interpreting Neural Networks for NLP. His research has been funded by the Dutch Research Council (NWO).

  • Option 2: "Automatic Quality Assessment for Speech and Beyond", by Wen-Chin Huang (Nagoya University), Erica Cooper (NICT), Jiatong Shi (Carnegie Mellon University)

    • Short description: As generative AI has revolutionized the field of speech and audio generation, there is an increasing need for automatic quality assessment methods. In recent years, data-driven speech quality assessment (SQA) methods based on deep neural networks (DNNs) have improved greatly, and have been used by both academic and industrial researchers to evaluate speech generation models. Nonetheless, there are still unsolved problems, and the application of such methods to other audio types remains underexplored. In this tutorial, we will first give an overview of SQA, with a special focus on recent developments of automatic SQA methods in the era of DNN. We will then discuss the current challenges and future directions in SQA, and introduce what is missing in audio types outside speech. Finally, we will introduce two recently developed open-source toolkits: SHEET, which provides an easy-to-use interface to train and evaluate automatic speech quality assessment models, and VERSA, a suite of evaluation metrics for speech, music, audio, and beyond.

    • About tutorial organizers:

      • Wen-Chin Huang received the B.S. degree from National Taiwan University, Taiwan in 2018 and the M.S. and Ph.D. degree from Nagoya University, Japan in 2021 and 2024, respectively. He is currently an assistant professor at the Graduate School of Informatics, Nagoya University, Japan. He was a co-organizer of the Voice Conversion Challenge 2020, Singing Voice Conversion 2023, and VoiceMOS Challenge 2022, 2023, 2024. His main research interest is speech processing, with a main focus on speech generation related fields including voice conversion and speech quality assessment. He was the recipient of the Best Student Paper Award in ISCSLP2018, the Best Paper Award in APSIPA ASC 2021, and the 16th IEEE Signal Processing Society Japan Best Student Journal Paper Award.

      • Erica Cooper received the B.Sc. and M.Eng. degrees in electrical engineering and computer science from the Massachusetts Institute of Technology, Cambridge, MA, USA, in 2009 and 2010, respectively, and the Ph.D. degree in computer science from Columbia University, New York, NY, USA, in 2019. She is currently a senior researcher with the National Institute of Information and Communications Technology, Japan. Her research interests include statistical machine learning and speech synthesis. She was a co-organizer of the VoiceMOS Challenge in 2022, 2023, and 2024. Dr. Cooper’s awards include the 3rd Prize in the CSAW Voice Biometrics and Speech Synthesis Competition, the Computer Science Service Award from Columbia University, and the Best Poster Award in the Speech Processing Courses in Crete.

      • Jiatong Shi received the B.S. and M.S. degrees from Renmin University of China and Johns Hopkins University, in 2019 and 2021, respectively. He is currently a Ph.D. candidate at Language Technologies Institute, Carnegie Mellon University, PA, USA. His main research interest is speech representation learning and its application, with a recent focus on speech and audio evaluation. He was a co-organizer of the Singing Voice Deepfake Detection (SVDD) Challenge in 2024, the ML-SUPERB 2.0 Challenge in 2025, the Discrete Speech Challenge in 2024, the ML-SUPERB Challenge in 2023, the Singing Voice Conversion Challenge in 2023, International Workshop of Spoken Language Translation Shared Task (Simultaneous Track) in 2022 and 2023. He has received awards including the Best Paper Award at Interspeech 2024 and the Best Paper Award in EMNLP 2024.

  • Option 3: "Beyond End-to-End ASR: Integrating Long-Context Acoustic and Linguistic Insights", by Taejin Park (NVIDIA), Huck Yang (NVIDIA), Kyu Han (Amazon Web Services), Shinji Watanabe (Carnegie Mellon University)

    • Short description: This tutorial explores the advancements and challenges in long-context automatic speech recognition (ASR), emphasizing its critical role in improving fairness, inclusivity, and performance across diverse linguistic groups. While modern ASR systems excel in data-rich languages such as English, they often underperform for low-resource languages, accented speech, and underrepresented communities due to limited context modeling. This tutorial introduces robust evaluation pipelines for long-form ASR, leveraging datasets such as CHiME and LibriHeavy, and discusses metrics such as multi-talker diarization, semantic evaluation, and retrieval-augmented generation (RAG) techniques. We delve into acoustic and semantic context modeling, highlighting innovations in multi-speaker processing, speech-LLM integration, and RAG-based error correction. The tutorial also examines benchmarks such as SLUE, Speech QA and Dynamic SUPERB, which assess ASR performance in noisy environments, spoken language understanding, and long-context comprehension. By integrating long-context processing and large language models (LLMs), ASR systems can better handle disfluencies, code-switching, and speaker variability, reducing biases and improving accessibility for marginalized communities. This tutorial underscores the importance of long-context ASR in advancing speech technology toward fairer, more inclusive systems that serve all users equitably.

    • About tutorial organizers:

      • Taejin Park is a Senior Research Scientist at NVIDIA, NeMo Speech AI. His research focuses on deep learning for speech processing, including context-aware speaker diarization, and multi-speaker automatic speech recognition (ASR). He received his Ph.D. in Electrical and Computer Engineering and M.S. in Computer Science from the University of Southern California (USC) in 2021, where he was part of the Signal Analysis and Interpretation Laboratory (SAIL). Before that, he earned his B.S. and M.S. in Electrical Engineering and Computer Science from Seoul National University (SNU), South Korea. Prior to joining NVIDIA, he worked as a researcher at the Electronics and Telecommunications Research Institute (ETRI) and held internships at Microsoft, Amazon Alexa Speech, and Capio Inc., where he contributed to advancements in federated continual learning, ASR, and speaker diarization. Taejin Park has published extensively in signal processing-related conferences and journals such as ICASSP, Interspeech, and IEEE Signal Processing Letters.

      • Huck Yang is a senior research scientist at NVIDIA Research based at Taipei, Taiwan. He received his Ph.D. and M.Sc. from Georgia Institute of Technology, USA and B.Sc.from National Taiwan University. His primary research lies in the area of speech-language modeling, robust speech recognition, and multi-modal post-training alignments. He served as area chairs and committee members in IEEE ICASSP 2022 to 2025, EMNLP 2024, SLT 2024, and NAACL 2025. He has served in the IEEE SPS technical committee at Applied Signals Processing Systems (ASPS) and Data Collection Committee (DCC) since 2022.

      • Kyu Jeong Han received his Ph.D. from USC in 2009. He is currently a Senior Science Manager at Amazon Web Services (AWS), where he leads R&D innovations for Amazon Transcribe. Dr. Han has held prominent research positions at IBM, Ford, Capio.ai (acquired by Twilio), JD.com, and ASAPP, contributing to advancements in speech and language technologies. Dr. Han is an active member of the speech and language processing community. He serves as a reviewer for prestigious journals and conferences, including those organized by IEEE, ISCA, and ACL. Since 2019, he has been a Technical Committee Member of the Speech and Language Processing Committee under the IEEE Signal Processing Society (SPS). In recognition of his contributions to the field, Dr. Han was awarded the ISCA Best Paper Award in 2018 for the best paper published in Computer Speech & Language between 2013 and 2017. His work continues to drive innovation in speech recognition and natural language processing, with a focus on real-world applications and scalable solutions.

      • Shinji Watanabe is an Associate Professor at Carnegie Mellon University, Pittsburgh, PA. His research interests include automatic speech recognition, speech enhancement, spoken language understanding, and machine learning for speech and language processing. He has published over 500 papers in peer-reviewed journals and conferences and received several awards, including the best paper award from ISCA Interspeech in 2024. He is an IEEE and ISCA Fellow.

Interspeech 2025

PCO: TU Delft Events

Delft University of Technology

Communication Department

Prometheusplein 1

2628 ZC Delft

The Netherlands

Email: pco@interspeech2025.org

X (formerly Twitter): @ISCAInterspeech

Bluesky: @interspeech.bsky.social

Interspeech 2025 is working under the privacy policy of TU Delft