Major progress is being recorded regularly on both the technology and exploitation of automatic speech recognition (ASR) and spoken language systems. However, there are still technological barriers to flexible solutions and user satisfaction under some circumstances. This is related to several factors, such as the sensitivity to the environment (background noise), or the weak representation of grammatical and semantic knowledge. Current research is also emphasizing deficiencies in dealing with variation naturally present in speech. For instance, the lack of robustness to foreign accents precludes the use by specific populations. Also, some applications, like directory assistance, particularly stress the core recognition technology due to the very high active vocabulary (application perplexity). There are actually many factors affecting the speech realization: regional, sociolinguistic, or related to the environment or the speaker herself. These create a wide range of variations that may not be modeled correctly (speaker, gender, speaking rate, vocal effort, regional accent, speaking style, non-stationarity, etc.), especially when resources for system training are scarce. This paper outlines current advances related to these topics.
This paper gives an overview of automatic speaker recognition technology, with an emphasis on text-independent recognition. Speaker recognition has been studied actively for several decades. We give an overview of both the classical and the state-of-the-art methods. We start with the fundamentals of automatic speaker recognition, concerning feature extraction and speaker modeling. We elaborate advanced computational techniques to address robustness and session variability. The recent progress from vectors towards supervectors opens up a new area of exploration and represents a technology trend. We also provide an overview of this recent development and discuss the evaluation methodology of speaker recognition systems. We conclude the paper with discussion on future directions.
Purpose: Previous articles in this supplement described rationale for and development of the pause marker (PM), a diagnostic marker of childhood apraxia of speech (CAS), and studies supporting its validity and reliability. The present article assesses the theoretical coherence of the PM with speech processing deficits in CAS. Method: PM and other scores were obtained for 264 participants in 6 groups: CAS in idiopathic, neurogenetic, and complex neurodevelopmental disorders; adult-onset apraxia of speech (AAS) consequent to stroke and primary progressive apraxia of speech; and idiopathic speech delay. Results: Participants with CAS and AAS had significantly lower scores than typically speaking reference participants and speech delay controls on measures posited to assess representational and transcoding processes. Representational deficits differed between CAS and AAS groups, with support for both underspecified linguistic representations and memory/access deficits in CAS, but for only the latter in AAS. CAS-AAS similarities in the age-sex standardized percentages of occurrence of the most frequent type of inappropriate pauses (abrupt) and significant differences in the standardized occurrence of appropriate pauses were consistent with speech processing findings. Conclusions: Results support the hypotheses of core representational and transcoding speech processing deficits in CAS and theoretical coherence of the PM's pause-speech elements with these deficits.
People naturally move their heads when they speak, and our study shows that this rhythmic head motion conveys linguistic information. Three-dimensional head and face motion and the acoustics of a talker producing Japanese sentences were recorded and analyzed. The head movement correlated strongly with the pitch (fundamental frequency) and amplitude of the talker's voice. In a perception study, Japanese subjects viewed realistic talking-head animations based on these movement recordings in a speech-in-noise task. The animations allowed the head motion to be manipulated without changing other characteristics of the visual or acoustic speech. Subjects correctly identified more syllables when natural head motion was present in the animation than when it was eliminated or distorted. These results suggest that nonverbal gestures such as head movements play a more direct role in the perception of speech than previously known.
•The role of the motor system in speech perception is reviewed.•Distributed production regions/networks ubiquitously participate in perception.•Regions/networks specific to production and vary dynamically with context.•Data consistent with a sensorimotor/complex network models of speech perception.•Existing models of the organization of language and the brain fail to explain results. Does “the motor system” play “a role” in speech perception? If so, where, how, and when? We conducted a systematic review that addresses these questions using both qualitative and quantitative methods. The qualitative review of behavioural, computational modelling, non-human animal, brain damage/disorder, electrical stimulation/recording, and neuroimaging research suggests that distributed brain regions involved in producing speech play specific, dynamic, and contextually determined roles in speech perception. The quantitative review employed region and network based neuroimaging meta-analyses and a novel text mining method to describe relative contributions of nodes in distributed brain networks. Supporting the qualitative review, results show a specific functional correspondence between regions involved in non-linguistic movement of the articulators, covertly and overtly producing speech, and the perception of both nonword and word sounds. This distributed set of cortical and subcortical speech production regions are ubiquitously active and form multiple networks whose topologies dynamically change with listening context. Results are inconsistent with motor and acoustic only models of speech perception and classical and contemporary dual-stream models of the organization of language and the brain. Instead, results are more consistent with complex network models in which multiple speech production related networks and subnetworks dynamically self-organize to constrain interpretation of indeterminant acoustic patterns as listening context requires.
The anatomy of language has been investigated with PET or fMRI for more than 20years. Here I attempt to provide an overview of the brain areas associated with heard speech, speech production and reading. The conclusions of many hundreds of studies were considered, grouped according to the type of processing, and reported in the order that they were published. Many findings have been replicated time and time again leading to some consistent and undisputable conclusions. These are summarised in an anatomical model that indicates the location of the language areas and the most consistent functions that have been assigned to them. The implications for cognitive models of language processing are also considered. In particular, a distinction can be made between processes that are localized to specific structures (e.g. sensory and motor processing) and processes where specialisation arises in the distributed pattern of activation over many different areas that each participate in multiple functions. For example, phonological processing of heard speech is supported by the functional integration of auditory processing and articulation; and orthographic processing is supported by the functional integration of visual processing, articulation and semantics. Future studies will undoubtedly be able to improve the spatial precision with which functional regions can be dissociated but the greatest challenge will be to understand how different brain regions interact with one another in their attempts to comprehend and produce language.
This review gives a general overview of techniques used in statistical parametric speech synthesis. One instance of these techniques, called hidden Markov model (HMM)-based speech synthesis, has recently been demonstrated to be very effective in synthesizing acceptable speech. This review also contrasts these techniques with the more conventional technique of unit-selection synthesis that has dominated speech synthesis over the last decade. The advantages and drawbacks of statistical parametric synthesis are highlighted and we identify where we expect key developments to appear in the immediate future.
Deriving a good model for multitalker babble noise can facilitate different speech processing algorithms, e.g., noise reduction, to reduce the so-called cocktail party difficulty. In the available systems, the fact that the babble waveform is generated as a sum of N different speech waveforms is not exploited explicitly. In this paper, first we develop a gamma hidden Markov model for power spectra of the speech signal, and then formulate it as a sparse nonnegative matrix factorization (NMF). Second, the sparse NMF is extended by relaxing the sparsity constraint, and a novel model for babble noise (gamma nonnegative HMM) is proposed in which the babble basis matrix is the same as the speech basis matrix, and only the activation factors (weights) of the basis vectors are different for the two signals over time. Finally, a noise reduction algorithm is proposed using the derived speech and babble models. All of the stationary model parameters are estimated using the expectation-maximization (EM) algorithm, whereas the time-varying parameters, i.e., the gain parameters of speech and babble signals, are estimated using a recursive EM algorithm. The objective and subjective listening evaluations show that the proposed babble model and the final noise reduction algorithm significantly outperform the conventional methods.
Speech production is a left-lateralized brain function, which could arise from a left dominance either in speech executive or sensory processes or both. Using functional magnetic resonance imaging in healthy subjects, we show that sensory cortices already lateralize when speaking is intended, while the frontal cortex only lateralizes when speech is acted out. The sequence of lateralization, first temporal then frontal lateralization, suggests that the functional lateralization of the auditory cortex could drive hemispheric specialization for speech production.
The motor theory of speech perception assumes that activation of the motor system is essential in the perception of speech. However, deficits in speech perception and comprehension do not arise from damage that is restricted to the motor cortex, few functional imaging studies reveal activity in the motor cortex during speech perception, and the motor cortex is strongly activated by many different sound categories. Here, we evaluate alternative roles for the motor cortex in spoken communication and suggest a specific role in sensorimotor processing in conversation. We argue that motor cortex activation is essential in joint speech, particularly for the timing of turn taking.