Interspeech 2025 - Slot 2 Tutorials

Slot 2 Tutorials (12:00-15:00)

Option 1: "Invited tutorial: Praat", by Paul Boersma (University of Amsterdam)
- Details coming up soon!
Option 2: "A Journey through Emerging Speech Research with NVIDIA NeMo", by Piotr Zelasko (NVIDIA), Nithin Rao Koluguri (NVIDIA), Ante Jukic (NVIDIA), Subhankar Ghosh (NVIDIA), Travis Bartley (NVIDIA), Elena Rastorgueva (NVIDIA)
- Short description: This tutorial presents a comprehensive overview of recent developments in NVIDIA NeMo, a popular open-source framework recognized for its state-of-the-art performance in speech processing tasks. The tutorial is structured in two segments that bridge classical and emerging speech technologies. The first segment introduces significant architectural improvements in speech recognition through the FastConformer architecture with label looping and CUDA graph acceleration, alongside novel approaches to speech enhancement using score-based diffusion models, flow-matching, and Schr¨odinger bridge techniques. It also presents pioneering work in high-quality neural audio codecs. The second segment explores developments in multi-task speech modeling through Canary-1B, end-to-end speaker diarization with Sortformer Diarizer, and the integration of speech capabilities with Large Language Models via SALM and BESTOW frameworks. A particular emphasis is placed on training efficiency innovations, including dynamic bucketing and batch size optimization techniques that achieve up to 2x faster training speeds. By combining theoretical foundations with practical implementation guidance, this tutorial serves as a valuable resource for both researchers and practitioners in speech technology, offering hands-on experience with state-of-the-art tools and techniques.
- About tutorial organizers:
  - Piotr Zelasko received the B.S. and M.Sc. degrees in Acoustic Engineering, and Ph.D. in Electronic Engineering (2019) at AGH-UST in Poland. He worked with several companies and held a research scientist position at JHU’s CLSP. At present, he is a research scientist at NVIDIA NeMo building multitask and multimodal models and efficient training infrastructure. Piotr is a co-author of the next-generation Kaldi toolkit (k2).
  - Nithin Rao Koluguri received his MS in Electrical Engineering from USC, Los Angeles. He worked as a researcher at USC SAIL and IISc SPIRE Labs. Currently, he holds a position as a research scientist on the NVIDIA Conversational AI team, focusing on advancing speech and speaker recognition models. As a key contributor to the NVIDIA NeMo toolkit, he plays a vital role in enhancing features for conversational AI model development.
  - Ante Jukić received his Dipl.-Ing degree in Electrical Engineering from the University of Zagreb, Croatia, and his Ph.D. degree in Engineering from the University of Oldenburg, Germany. Currently, he’s with NVIDIA’s Conversational AI team, working on generative models for speech and audio.
  - Subhankar Ghosh received his M.S. in Statistics from the University of Illinois at Urbana-Champaign. Subhankar also has a Bachelor’s degree in Computer Science from NIT Rourkela, India. He has previously worked at Microsoft and Google. Currently, he is a Research Scientist working on the NVIDIA conversational AI team, working on speech synthesis, LLMs for speech, and speech to speech technology.
  - Travis M. Bartley is a PhD Candidate at the City University of New York’s Graduate Center. He received his B.A. in English and Linguistics at the University of California at Berkeley. He has held teaching positions at Medgar Evers and Baruch Colleges. He is currently a Deep Learning Engineer with NVIDIA Riva Speech and Translation AI team.
  - Elena Rastorgueva received her B.A. and M.Eng. degrees in Engineering from the University of Cambridge. She is currently an applied research scientist on the NVIDIA Conversational AI team, focusing on speech to text models and improving their performance in streaming scenarios.
Option 3: "Tutorial on Speech Watermarking", by Patrick OReilly (Northwestern University), Bryan Pardo (Northwestern University)
- Short description: While speech synthesis systems have numerous benefits, they can also be used to impersonate the voices of individuals, serving as tools for blackmail, fraud, and misinformation. As a result, developers and providers of speech synthesis systems need reliable methods for identifying the synthetic audio these systems produce. One promising and widely adopted method is watermarking, which hides an imperceptible identifying signal in audio produced by a speech synthesis system to facilitate the detection process. In this tutorial, we provide an overview of the speech watermarking literature and walk attendees through step-by-step implementations of both traditional signal-processing and recent neural network-based watermarks. We explore methods for improving the perceptual transparency and robustness of watermarks, evaluate their effectiveness under "attacks" that attempt to remove watermarks, and discuss future directions for improving watermark performance. Participants will leave this tutorial with a strong knowledge of speech watermarking methods including the current state-of-the-art. Through hands-on experience implementing and evaluating speech watermarks, participants will gain a practical understanding of important techniques in the literature and an appreciation for the challenges faced in the development and deployment of speech watermarks.
- About tutorial organizers:
  - Patrick O’Reilly (https:/oreillyp.github.io/) is a fifth-year PhD student at Northwestern University. Patrick is the lead author of "Maskmark: Robust Neural Watermarking for Real and Synthetic Speech", which proposed a neural network-based speech watermark with state-of-the-art robustness and was selected for oral presentation at ICASSP 2024. Patrick is currently conducting research in collaboration with Adobe to develop novel watermarking methods for speech and general audio, which will serve as the focus of his dissertation.
  - Bryan Pardo (https://bryan-pardo.github.io/) is Professor of Computer Science at Northwestern University, where he leads the Interactive Audio Lab. Bryan has given over 70 invited talks at universities and conferences (e.g. Princeton, University of Michigan, UC Berkeley, the Audio Engineering Society) and served in editorial roles at journals such as IEEE Transactions on Audio, Speech, and Language Processing and the International Society of Music Information Retrieval, along with committee and conference chair positions.

Interspeech 2025

PCO: TU Delft Events

Delft University of Technology

Communication Department

Prometheusplein 1

2628 ZC Delft

The Netherlands

Email: pco@interspeech2025.org

X (formerly Twitter): @ISCAInterspeech

Bluesky: @interspeech.bsky.social

Interspeech 2025 is working under the privacy policy of TU Delft