Index

Perspectives on the Evolution of Speech Technology

By James L. Flanagan

(Contribution to Panel on the History of Speech Technology,
Session 35D1b; Janet Baker, Chair
Interspeech 2005, Lisbon, September 6, 2005)

Introduction:

Madame Chair has invited 7-minute perspectives on the evolution of our field - Speech Technology. I'd like to explicate my route to this fascinating activity and to evidence, in part, the genesis of my perspective by using one minute of the 7 to establish a personal context.

Personal Context:

Since my days of secondary school I have harbored a burning interest in communication - especially communication by voice over distances. I don't know how to properly account for this predilection except to point out that I grew up on a farm in the Mississippi delta - quite miles distant from the closest town. Early on, I became aware of how physics could assist life in rural environments, and as a student built and operated radio transmitters that helped me surmount geographical isolation and brought the distant world closer.

Military experience with 10cm and 3cm radar for aircraft navigation solidified my decision to study electrical engineering, and to understand better how things really work. Key teachers at Mississippi State guided me on to graduate school at MIT. But, I had to finance my study. Happily, Leo Beranek offered me a Graduate Assistantship in the MIT Acoustics Laboratory. And, fortuitously, my assignment was a project on speech communication and hearing.

In the course of study and in later thesis research, I learned about work at Bell Labs - led by pioneers such as Fletcher, Dudley, Dunn, Shannon, Nyquist, and others. This settled me on Speech. So, when Ed David approached me with a recruitment offer, I discarded proffered opportunities in California and went to Murray Hill. It was not my original intent to remain there - but I did, spending 33 years in Telecom research until I reached the age where AT&T required Officers and Directors to leave their job. I went on to Rutgers University for 15 more years serving as Professor, Center Director, and Vice President for Research.

I make this summary to explain that my technical perspective is almost exclusively molded by the Telecom culture. As I enter retirement (for a second time), I remain enamored of human communication - an activity totally vital to human well-being. The business of Telecom as I first knew it has undergone vast change, and is still in the throes of transition from circuit-switched networks to packet-switched global architectures. This fact greatly impacts the ways we bring new technology to the benefit of society, but much less influences the basic scientific understanding of speech and hearing.

Events:

Madame Chair asks us to identify career-forming events that shaped the field as we know it. Once I became committed to Speech, there was only one such for me: It was the punctuate confluence of three major advances. They came together as I was completing thesis research.

* The first was the understanding of sampled-data theory, opening the door to the benefits of digital communication - and ultimately evolving techniques for sampling, quantizing, conversion, filtering, storage, and signal processing.
* The second was the evolution to practicality of binary-based computation (benefiting from pulsed-circuit hardware devised earlier for radar systems), and new thinking in discrete mathematics (that transformed many of the traditional continuum tools to the discrete domain).
* And the third was the invention of the transistor, a reliable power-efficient solid-state device for exquisite control of electronic current which promised direct scaling to microelectronic fabrications of enormous size.

While this confluence did not directly provide insights in speech science, it enabled new methods of experimentation on critical questions that otherwise could not be addressed. It allowed computer simulation of exceedingly complex systems and concepts to be undertaken on a time scale such that experimental results could be immediately ploughed back and hasten solutions. Over three decades, these emerging capabilities enabled me and others to consider issues in both speech communication and hearing, working completely from conception to application without the burden of creating special purpose hardware. Along this path, new analytical tools were developed that now have expanded into whole subfields of signal processing. I am thinking particularly of digital filtering, spectral analysis, transform methods, and dedicated single-chip processors that have displaced myriad traditional instruments (such as wave analyzers, sound level meters, oscilloscopes, multi-meters, recorders, and others).

In my narrow world, research results were expected to relate to telecom needs, but the latitude was wide, as long as relevance to commercial telecom could be evidenced. This type of research, and the tools employed, consequently led to new understanding in basic speech science. Without exhaustively cataloging individual topics, generic areas can be mentioned.

* An important driver was parsimonious, compact representation of speech and auditory information - described to accuracies implied by auditory acuity. Transmission economies and signal quality were important factors. Concrete results of this impetus in low bit-rate coding are currently visible in long distance communication, cell phones, voice mail systems, and voice over Internet protocol (VoIP).
* Another was automation in call routing where vast labor and cost savings are being won by automatic speech recognition and speech synthesis.
* In another, high-quality digital music recording and playback builds upon human perceptual criteria to achieve efficiencies for low cost storage and portability.
* And, individuals deprived of communicative or motor abilities have variously found value in aids such as the artificial larynx, computer-designed hearing aids, talking books with speed control, text to speech synthesis, and voice command of mobility systems.
* Further and increasingly, refined electro-acoustic techniques for high-quality sound capture and projection are displacing expensive, time-consuming business travel by providing convenient teleconferencing services.

Prospective:

An invited comment probably would be incomplete without a speculation about the future. One can sometimes hear the remark that "All the important problems in Speech have been solved." This is reminiscent of the US Commissioner of Patents in 1899 stating that "Everything that can be invented has been invented!" Well not quite. Certainly we've solved some important problems. We've identified others. And, we have not yet scarred the surface of some frontiers.

Humans will eternally need to communicate. More and more of this communication will be mediated and augmented by machines. And, whether the communication is with other humans or with machines, speech will carry the principal burden in information transfer. Speech is, and will be, supplemented by other sensory modes - particularly by sight and touch. The ideal is to achieve the naturalness of face-to-face interaction - with environments for participants that exhibit spatial realism in all sensory dimensions. We are a very long way from this realization. Central issues involve not only information capture and display, but methods for fusing simultaneous sensory information in a way that is context-aware and permits reliable anticipation of user intent and generation of an intelligent response. Such mediating systems of course must accommodate multi-linguality - moving toward the dream of unconstrained translating telephony, with synthesis exhibiting the voice characteristics of the originator. These are heady concepts now, but so were text-to-speech and speech-to-text in 1960! I believe the research frontier is even more challenging now!

Saras Institute

History of Speech and Language Technology

Perspectives on the Evolution of Speech Technology