Éva Székely
Assistant Professor, Division of Speech, Music and Hearing, KTH Royal Institute of Technology, Stockholm, Sweden
Assistant Professor, Division of Speech, Music and Hearing, KTH Royal Institute of Technology, Stockholm, Sweden
ABSTRACT
Deep-learning-based speech synthesis now allows us to generate voices that are not only natural-sounding but also highly realistic and expressive. This capability presents a paradox for conversational AI: it opens up new possibilities for more fluid, humanlike interaction, yet it also exposes a gap in our understanding of how such expressive features shape communication. Can synthetic speech, which poses these challenges, also help us solve them? In this talk, I explore the fundamental challenges in modelling the spontaneous phenomena that characterise spoken interaction: the timing of breaths, shifts in speech rate, laughter, hesitations, tongue clicks, creaky voice and breathy voice. In striving to make synthetic speech sound realistic, we inevitably generate communicative signals that convey stance, emotion, and identity. Modelling voice as a social signal raises important questions: How does gender presentation in synthetic speech influence perception? How do prosodic patterns affect trust, compliance, or perceived politeness?
To address such questions, I will present a methodology that uses controllable conversational TTS not only as a target for optimisation but also as a research tool. By precisely manipulating prosody and vocal identity in synthetic voices, we can isolate their effects on listener judgments and experimentally test sociopragmatic hypotheses. This dual role of TTS – as both the object of improvement and the instrument of inquiry – requires us to rethink evaluation beyond mean opinion scores, towards context-driven and interaction-aware metrics. I will conclude by situating these ideas within the recent paradigm shift toward large-scale multilingual TTS models and Speech LLMs, outlining research directions that help us both understand and design for the communicative power of the human voice.
Assistant Professor, Department of Information and Computing, Utrecht University, The Netherlands
ABSTRACT
Mental disorders, especially major depression and bipolar mania, are among the leading causes of disability worldwide. In clinical practice, the diagnosis of mood disorders is done by the medical experts via multiple observations and by means of questionnaires. This system is however subjective, costly, and cannot meet diagnostic needs given the increasing demand, risking a large population of patients with insufficient care. Increasingly in the last decade, many Artificial Intelligence (AI) and particularly Machine Learning (ML) based solutions were proposed to respond to the urgent need for objective, efficient, and effective mental healthcare decision support systems to assist and reduce the load of the medical experts. However, many of these systems lack the necessary properties to qualify as "responsible AI", namely, interpretability/explainability, algorithmic fairness, and privacy considerations (in both their design and final outputs), thus rendering them useless in real life, especially in the light of recent legal developments. Therefore, this talk will provide overview on the motivations, recent efforts, and potential future directions for responsible multimodal modeling in mental healthcare.