Tutorial at Interspeech 2009
"Emerging Technologies for Silent Speech Interfaces"
Tanja Schultz (Cognitive Systems Lab, University of Karlsruhe)
Bruce Denby (Université Pierre et Marie Curie / ESPCI-ParisTech (CNRS))
In the past decade, the performance of automatic speech processing systems, including speech recognition, text and speech translation, and speech synthesis, has improved dramatically. This has resulted in an increasingly widespread use of speech and language technologies in a wide variety of applications, such as commercial information retrieval systems, call center services, voice-operated cell phones or car navigation systems, personal dictation and translation assistance, as well as applications in military and security domains. However, speech-driven interfaces based on conventional acoustic speech signals still suffer from several limitations. Firstly, the acoustic signals are transmitted through the air and are thus prone to ambient noise. Despite tremendous efforts, robust speech processing systems, which perform reliably in crowded restaurants, airports, or other public places, are still not in sight. Secondly, conventional interfaces rely on audibly uttered speech, which has two major drawbacks: it jeopardizes confidential communications in public and it disturbs any bystanders. Services which require the access, retrieval, and transmission of private or confidential information, such as PINS, passwords, and security or safety information are particularly vulnerable.
Recently, Silent Speech Interfaces have been proposed which allow its users to communicate by speaking silently, i.e. without producing any sound. This is realized by capturing the speech signal at the early stage of human articulation, namely before the signal becomes airborne, and then transfer these articulation-related signals for further processing and interpretation. Due to this novel approach Silent Speech Interfaces have the potential to overcome the major limitations of traditional speech interfaces today, i.e. (a) limited robustness in the presence of ambient noise; (b) lack of secure transmission of private and confidential information; and (c) disturbance of bystanders created by audibly spoken speech in quiet environments; while at the same time retaining speech as the most natural human communication modality. The SSI furthermore could provide an alternative for persons with speech disabilities such as laryngectomy, as well as the elderly or weak who may not be healthy or strong enough to speak aloud effectively.
Silent Speech Interfaces have a very recent history. Chan et al. (2001, 2002) proved that the myoelectric signal from articulatory face muscles contains sufficient information to discriminate a small set of words accurately. This holds even when words are spoken non-audibly, i.e. when no acoustic signal is produced (Jorgensen et al. 2003, Bradley et al. 2006), suggesting this technology could be used to communicate silently. Recent work demonstrated how to model phoneme-based acoustic units for Electromyographic (EMG)-based speech recognition (Jou et al. 2006, Walliczek et al. 2006), paving the way for large vocabulary speech recognition. Another system, using ultrasound and optical images to develop a Silent Speech Interface based on tongue and lip images (Denby and Stone 2004, Denby et al. 2006, Hueber et al. 2007), is equally quite recent. Predominantly in Japan, SSI-like systems are being developed in which an acoustic “murmur” is processed into a speech-like signal. In the United States, DARPA has funded research on glottal activity sensors for use in noisy environments.
The tutorial aims to provide a general overview of research challenges in Silent Speech Interfaces with a particular emphasis on EMG and US technologies. Furthermore, the tutorial aims to raise the awareness of alternative speech-based interaction systems.
The tutorial speakers will give an introduction into these emerging technologies for Silent Speech Interfaces. They will first briefly outline existing approaches, focusing on non-acoustic articulatory signals. After discussing the nature of these signals, they will discuss major challenges of recording and processing of the data. Furthermore, they will present state-of-the-art Silent Speech Interface solutions, primarily based on EMG and ultrasound and discuss possible applications of Silent Speech Interfaces. The tutorial will be concluded with demonstrations of Silent Speech Interfaces. In detail the tutorial will include the following topics:
- Biological nature of non-acoustic articulatory signals
- Existing Databases and Benchmarks
- Signal Preprocessing
- Practical Issues
- Electrode Positioning
- Ultrasound probe orientation
- Environmental conditions
- Tissue properties
- Speaker dependences: Speaking style, Speaking rate, idiosyncrasies
- Differences between silent and audible speech
- Articulation differences
- Bootstrapping and ground truth
- Signal delay
- State-of-the-art systems
- EMG-based Speech Recognition
- US-based Speech Protheses
- Robust, private, non-distracting speech recognition for human-machine interfaces, for example, silently speaking text messages rather than typing them;
- Recognition plus speech synthesis for quietly accessing remote applications, such as speech or text-based information systems;
- Transmitting articulation parameters for silent human to human communication;
- Speech prostheses.
- Demonstrations of Silent Speech Interfaces
Tanja Schultz is a Full Professor at the Computer Science Department of Karlsruhe University in Germany and an Assistant Research Professor at the Language Technologies Institute at Carnegie Mellon University. She is the director of the Cognitive Systems Lab and director of the Center for Visually Impaired Students, both at Karlsruhe University. Her research activities focus on human-human communication and human-machine interfaces with a particular area of expertise in rapid adaptation of speech processing systems to new domains and languages. She co-edited a book on this subject and received several awards for this work. In 2001 she received the FZI price for her outstanding Ph.D. thesis on language independent and language adaptive speech recognition. In 2002 she received the Allen Newell Medal for Research Excellence from Carnegie Mellon for her contribution to Speech-to-Speech Translation and the ISCA best paper award for her publication on language independent acoustic modeling. In 2005 she was awarded the Carnegie Mellon Language Technologies Institute Junior Faculty Chair.
Her recent research focuses on the development of human-centered technologies and intuitive human-machine interfaces based on biosignals, by capturing, processing, and interpreting signals such as muscle and brain activities. The development of the silent speech interface based on myoelectric signals received the Interspeech 2006 Demo award. Together with Prof. Denby she is a guest editor of the Speech Communication Special Issue on Silent Speech Interfaces to be published in 2009. Tanja Schultz is the author of more than 150 articles published in books, journals, and proceedings. She is a member of the IEEE Computer Society, the International Speech Communication Association ISCA, the European Language Resource Association, the Society of Computer Science (GI) in Germany, and currently serves as elected ISCA Board member, on several program committees, and review panels.
Bruce Denby is Full Professor of Electronics and Signal Processing at the Université Pierre et Marie Curie (Paris-VI), and Research Scientist at the Laboratoire d’Electronique ESPCI-ParisTech (CNRS) in Paris, France. He holds a BS degree from the California Institute of Technology (Caltech), MS from Rutgers University, and PhD from the University of California at Santa Barbara, all in physics. During post-doctoral studies in Switzerland, France, and the UK in the late 1980’s, he developed the Denby-Peterson contour extraction algorithm, and became well known for introducing statistical learning techniques to the experimental physics community. In 1995 he was named professor at the University of Versailles, France, where he created, and for 10 years directed, an innovative Master’s degree program in cellular telephone technology. Since transferring to Paris-VI in 2004, he has been leader in the area of applications of statistical learning techniques to real-time systems. Professor Denby is an Associate Editor of the journal Pattern Recognition, and has authored over 180 publications in international journals and peer-reviewed international conferences. He is a member of the International Speech Communication Association ISCA, the Association for Computing Equipment (ACM), Senior Member of IEEE, and member of the IEEE Computer, Communications, Consumer Electronics, and Instrumentation and Measurement Societies. He is one of the originators of the “Silent Speech Interface” concept, having authored in 2004, with Prof. Maureen Stone, a pioneering article on speech synthesis from ultrasound imagery of the tongue, and is coordinator of the OUISPER (Oral Ultrasound SynthetIc SPEech SouRce) Project, funded by the French Department of Defense (DGA) and the French Agence Nationale de la Recherche (ANR). He is also the primary guest editor of the Speech Communication Special Issue on Silent Speech Interfaces to be published in 2009. Prof. Denby’s current research interests include speech and audio signal processing, telecommunications, and radio engineering.