website banner

Interspeech 2025 Special Sessions​​​​​​

Attention there, out-of-the-box thinker: Interspeech 2025 BLUE SKY track welcomes your ideas!

Inaugurated for Interspeech 2024, the BLUE SKY track will again be open for submission this year. The Technical Program Chairs would like to encourage authors to consider submitting to this track of highly innovative papers with strong theoretical or conceptual justification in fields or directions that have not yet been explored. Large-scale experimental evaluation will not be required for papers in this track. Incremental work will not be accepted. If you are an 'out-of-the-box' thinker, who gets inspiration from high-risk, strange, unusual or unexpected ideas/directions that go purposefully against the mainstream topics and established research paradigms --- please consider submitting a paper on this challenging and competitive track! Who knows you might launch the next scientific revolution in the speech field? Please note that to achieve the objectives of this BLUE SKY track, we will ask the most experienced reviewers (mainly our ISCA Fellow members) to assess the proposals.

For more information, please contact the TPC committee: TPC-chairs@interspeech2025.org

Biosignal-enabled Spoken Communication

Summary: The main topic of this special session is speech-related biosignals, such as of articulatory or neurological activities during speech production or perception. By analyzing these biosignals, researchers can gain insights into the mechanisms underlying speech processes and can also explore individual differences in how speech processing tasks are performed. Moreover, biosignals can serve as alternative modalities to the acoustic signal for speech-driven systems, enabling novel speech communication devices, such as voice prostheses or hearing aids. With this special session, we aim to bring together researchers working on biosignals and speech processing to exchange ideas on interdisciplinary topics.

Topics:

  • Processing of speech-related biosignals, such as of articulatory activity, captured by, e.g., Electromagnetic Articulography (EMA), Electromyography (EMG), Ultrasound Tongue Imagining (UTI), or from neural activity, measured by, e.g., Electroencephalography (EEG), Electrocorticography (ECoG). Further biosignals can stem from respiratory, laryngeal, or other speech-related activities.

  • Analysis of biosignals for evaluating and/or explaining individual differences in human speech production and/or perception.

  • Usage of biosignals in speech processing tasks, e.g., speech recognition, synthesis, enhancement, voice conversion, and auditory attention detection.

  • Integrating biosignals as an additional modality to acoustic speech processing systems for increasing their performance, user adaptability, or explainability.

  • Development of machine learning algorithms, feature representations, model architectures, or training and evaluation strategies for biosignal processing.

  • Applications of speech-related biosignal processing, such as speech restoration, training, therapy, or health assessments. Further applications include brain-computer interfaces, voice prostheses, communication in noisy environments, or privacy preserving communication

Organizers: Kevin Scheck, Siqi Cai, Tanja Schultz, Satoshi Nakamura, Haizhou Li

Contact: Kevin Scheck (scheck@uni-bremen.de)

Website: https://www.uni-bremen.de/csl/interspeech-2025-biosignals

Challenges in Speech Data Collection, Curation, and Annotation

Summary: The quality and availability of datasets are crucial for advancing the field of speech research. Reliable models rely heavily on high-quality data, making data collection and annotation pivotal. However, the challenges associated with collecting, curating, annotating and sharing large datasets are numerous, especially for domains such as child, health and low-resource languages speech. These include privacy concerns, standardization issues, and the significant cost and time involved in manual annotation. Even when datasets cannot be shared, the methods and guidelines used to collect and annotate these datasets are often as valuable as the data itself. Sharing these processes can enhance reproducibility and foster the creation of similar high-quality datasets. This exchange of knowledge can help avoid common pitfalls in data collection and annotation while promoting innovation in creating automated annotation workflows. This special session thus aims to bring together researchers to share their experiences, methodologies, and insights related to speech data collection and annotation.

Topics:

Key topics to be covered in the session include, but not limited to:

  • Selection of cohort and prompts for data collection

  • Methods for evaluating the quality of collected speech data

  • Automating the annotation process and integrating human annotators

  • Guidelines for manual annotation and automatic workflows

  • Evaluating the reliability and consistency of human annotations

  • Standardizing data collected with different protocols, data formats and annotations

  • Negotiating privacy, confidentiality, and legal constraints

  • Sharing processes and lessons learned in lieu of sharing private datasets

  • Using synthetic data to augment the dataset

Organizers: Beena Ahmed, Mostafa Shahin, Tan Lee, Mark Liberman, Mengyue Wu, Thomas Schaaf, Ahmed Ali, Carlos Busso

Contact: Mostafa Shahin (m.shahin@unsw.edu.au)

Website: https://sites.google.com/view/speech-data-cca-is25/

Connecting Speech Science and Speech Technology for Children’s Speech

Summary: Speech technology is increasingly getting embedded in everyday life, with its applications spanning from critical domains like medicine, psychiatry, and education, to more commercial settings. This rapid growth has often largely contributed to the successful use of deep learning in modelling large amounts of speech data. However, the performance of speech technology-related applications also largely varies depending on the demographics of the population the technology has been trained on and is applied to. That is, inequity in speech technology appears across age, gender, people with vocal disorders or from atypical populations, and people with non-native accents.

This interdisciplinary session will bring together researchers working on child speech from speech science and speech technology. In line with the theme of Interspeech 2025, Fair and Inclusive Speech Science and Technology, the proposed session will address the limitations and advances of speech technology and speech science, focusing on child speech. Furthermore, the session will aid in the mutual development of speech technology and speech science for child speech, while benefiting and bringing both communities together.

Topics:

We will encourage and hence, hope to receive contributions on topics such as (not limited to):

  • Differences between children’s and adult speech

  • Differences between children’s typical and atypical speech, including speech reflecting developmental disorders

  • Age-conditioned variation in children’s speech

  • Differences between non-native and native child speech, covering different speech technology applications

  • Applications and demos for children’s speech technology

  • Computational modelling of child speech

Organizers: Nina R. Benway, Odette Scharenborg, Sneha Das, Tanvina Patel, Zhengjun Yue

Contact: sciencetech4childspeech@gmail.com (all organizers)

Website: https://sites.google.com/view/sciencetech4childspeech-is25/home

Interpretability in Audio and Speech Technology

Summary: Audio and speech technology has recently achieved unprecedented success in real-world applications, driven primarily by self-supervised pre-training of large neural networks on massive datasets. While state-of-the-art models show remarkable performance across a wide range of tasks, their growing complexity raises fundamental questions about interpretability, reliability, and trustworthiness. Despite increasing interest from our community, we lack a deep understanding of what information is encoded in speech representations and what abstractions are learned during model training. This special session brings together researchers working on making audio and speech models more interpretable and explainable, drawing insights from machine learning, cognitive science, linguistics, speech science, and neuroscience.

Topics:

Topics include, but are not limited to:

  • Applying analytic techniques from neuroscience to understand speech models

  • Probing intermediate representations to decode linguistic and paralinguistic information

  • Linguistically-informed analysis of speech representations at the levels of phonetics, phonology, and prosody

  • Developing novel interpretable architectures for speech and audio processing

  • Understanding latent representations using speech synthesis

  • Analyzing model robustness to speaker variations, accents, and acoustic conditions

  • Cognitively motivated analysis of speech models and their representations

  • Adapting visualization and interpretability techniques from other modalities to audio signals

  • Bias mitigation, causal inference, post-hoc explanation and intervention analysis

  • Model safety and adversarial robustness

  • Analysis of multimodal and multilingual speech models

  • Extending interpretability methods to new tasks such as voice conversion, speech translation, and text-to-speech synthesis

  • Developing new methods and benchmarks for interpretability in the audio modality

We encourage submissions that demonstrate the value of interpretability research by highlighting actionable insights and their impact on model robustness, safety, and bias mitigation.

Organizers: Aravind Krishnan, Francesco Paissan, Cem Subakan, Mirco Ravanelli, Badr. M Abdullah, Dietrich Klakow

Contact: Aravind Krishnan (akrishnan@lsv.uni-saarland.de)

Website: https://sites.google.com/view/interspeech2025-interpret/home

Queer and Trans Speech Science and Technology

Summary: Trans and queer people have often been excluded from the very spaces that perform research or provide services for them (Zimman 2021, Huff 2022). At Interspeech 2024, the special session on Speech and Gender saw much discussion around trans and queer speech, and we aim to expand on those conversations and perspectives with this special session. By incorporating a broad range of perspectives across formal and informal communities interested in trans and queer voice, we aim to provide a space to build community and provide a foundation for future trans- and queer-driven research on trans and queer speech science and technology.

Topics:

  • Trans and Queer Linguistics

  • Trans and Queer Speech processing

  • Empirical investigations of algorithmic bias

  • Ethics of Queer and Trans Research

  • Datasets of Queer and Trans Voices

  • Ensuring Privacy for Trans and Queer Speakers

  • Applications related to Trans and Queer Voice Training and Modification

The special session will consist of talks by invited speakers from the formal and informal trans and queer voice communities (1 hour), followed by a poster session of accepted papers (1 hour). The invited talks will focus on the ethical and practical challenges of developing and deploying “transgender voice technologies”, including technologically-supported voice training and inclusive speech datasets.

Organizers: Robin Netzorg, Juliana Alison Francis, Nina Markl, Francisca Pessanha, Cliodhna Hughes

Contact: Robin Netzorg (robert_netzorg@berkeley.edu)

Website: https://sites.google.com/view/is2025-queer-trans/home

Responsible Speech Foundation Models

Summary: Speech foundation models are emerging as a universal solution to various speech tasks. Indeed, their superior performance has extended beyond ASR. However, the limitations and risks associated with foundation speech models have not been thoroughly studied. For example, they have been found to exhibit biases in different paralinguistic features, emotions, accents, as well as noise. Besides, foundation models present challenges in terms of ethical concerns, including privacy, sustainability, fairness, safety, and social bias. The responsible use of speech foundation models has attracted increasing attention not only in the speech community but also in the language community, including organizations like OpenAI. Thus, it is necessary to investigate speech foundation models for de-biasing (e.g., consistent accuracy for different languages, genders, and ages), enhancing factuality (not making mistakes in critical applications), and not used for malicious applications (e.g., using TTS to attack speaker verification systems). The goals of this special session are to look beyond the regular sessions, which lack a particular focus on foundation models; To concern about the risks of foundation models (including speech-LLMs) recently emerging globally; To catch up with NLP/ML community by addressing responsible foundation models within the speech community.

Topics:

  • Fairness of speech foundation models for understudied tasks

  • Limitations of foundation models and/or their solutions

  • Potential risks and security concerns of foundation models

  • Interpretability and generalizability of foundation models

  • Multimodal speech foundation models

  • Joint training of diverse tasks using foundation models

  • Adaptation methods for low-resource/out-of-domain speech

  • Robustness of speech foundation models

  • Integrating non-tech elements to ensure speech responsibility

Organizers: Yuanchao Li, Jennifer Williams, Tiantian Feng, Vikramjit Mitra, Yuan Gong, Bowen Shi, Catherine Lai, Peter Bell

Contact: Yuanchao Li (yuanchao.li@ed.ac.uk)

Website: https://sites.google.com/view/responsiblespeech

Source tracing: The origins of synthetic or manipulated speech

Summary: The rapid growth of audio generation technologies, including speech synthesis and manipulation, has made it easier to create deceptive audio/video content to spread the misinformation. While these innovations hold incredible potential in areas like accessibility, entertainment, and language education, they also raise urgent challenges in authenticity, security, and trust. As synthetic or manipulated speech becomes more prevalent and harder to distinguish from real human voice, the need for effective methods to trace its origins and verify its authenticity is critical. We are proposing this special session to foster development in the identification and tracing of synthetic or manipulated speech through pattern recognition, signal processing, and deep learning techniques. Contributions may also include multi- and trans-disciplinary approaches, new resources, metrics, while addressing the ethical, legal, and regulatory aspects related to responsible use and development of this technology.

Topics:

We welcome submissions on various topics related to audio source-tracing, including, but not limited to:

  • Source tracing of manipulated audio: in- and out-of-domain source/model attribution for: text-to-speech and voice cloning generators, traditional and neural vocoders, speaker anonymisation models, audio enhancement, acoustic environment simulators, audio slicing and any kind of voice editing.

  • Benchmarking and evaluation–contributions of new datasets, metrics, and reproducibility of open source science for source tracing;

  • Explainable source tracing–how do the decisions of the detection systems translate into the audio signal?

  • Cybersecurity and forensic perspectives–traceability and accountability of manipulated audio, expert-in-the loop, synthetic media countermeasures;

  • Ethical and regulatory aspects of speech manipulation: audio watermarking, bias and fairness of speech manipulation systems, socio-economic and political implications.

To sum up, if your research attempts to answer the question: “Is this speech sample an original recording, and if it is not, what are the implications of it?”, then this special session is for you.

Organizers: Hemlata Tak, Nicolas Müller, Adriana Stan, Piotr Kawa, Nicholas Klein, Jennifer Williams, Xin Wang, Jagabandhu Mishra

Contact: Hemlata Tak (Hemlata.tak@pindrop.com), Nicolas Müller (nicolas.mueller@aisec.fraunhofer.de)

Website: https://deepfake-total.com/sourcetracing

Interspeech 2025

PCO: TU Delft Events

Delft University of Technology

Communication Department

Prometheusplein 1

2628 ZC Delft

The Netherlands

Email: pco@interspeech2025.org

X (formerly Twitter): @ISCAInterspeech

Interspeech 2025 is working under the privacy policy of TU Delft