Superior Temporal Sulcus Auditory, Visual and Speech Processing

The superior temporal sulcus (STS) is activated during a variety of perceptual tasks including audiovisual integration (Beauchamp et al., 2004; Amedi et al., 2005), speech perception (Binder et al., 2000, 2008; Hickok and Poeppel, 2004, 2007; Price, 2010), and biological motion perception (Allison et al., 2000; Grossman et al., 2000, 2005; Grossman and Blake, 2002; Beauchamp et al., 2003; Puce and Perrett, 2003).

It has been widely established that auditory speech perception is influenced by visual speech information (Sumby and Pollack, 1954; McGurk and MacDonald, 1976; Dodd, 1977; Reisberg et al., 1987; Callan et al., 2003), which is represented in part within biological motion circuits that specify the shape and position of vocal tract articulators. This high-level visual information is hypothesized to interact with auditory speech representations in the STS (Callan et al., 2003).

Indeed, the STS is well-positioned to integrate auditory and visual inputs as it lies between visual association cortex in the posterior lateral temporal region (Beauchamp et al., 2002) and auditory association cortex in the superior temporal gyrus (Rauschecker et al., 1995; Kaas and Hackett, 2000; Wessinger et al., 2001). In nonhuman primates, polysensory fields in STS have been shown to receive convergent input from unimodal auditory and visual cortical regions (Seltzer and Pandya, 1978, 1994; Lewis and Van Essen, 2000) and these fields contain auditory, visual and bimodal neurons (Benevento et al., 1977; Bruce et al., 1981; Dahl et al., 2009). Furthermore, human functional neuroimaging evidence supports the notion that the STS is a multisensory convergence zone for speech (Calvert et al., 2000; Wright et al., 2003; Beauchamp et al., 2004, 2010; Szycik et al., 2008; Stevenson and James, 2009; Stevenson et al., 2010, 2011; Nath and Beauchamp, 2011, 2012).

However, it remains unclear what role, if any, biological-motion-sensitive regions of the STS play in multimodal speech perception. By and large, facial motion—including natural facial motion (Puce et al., 1998), movements of facial line drawings (Puce et al., 2003), and point-light facial motion (Bernstein et al., 2011)—yield activation quite posteriorly in the STS, a location that is potentially distinct from auditory and visual speech-related activations.

Superior Temporal Sulcus Speech Representations

Greg Hickock and colleagues at the University of California, Irvine investigated the functional organization of superior temporal sulcus with respect to modality-specific and multimodal speech representations. Twenty younger adult participants were instructed to perform an oddball detection task and were presented with auditory, visual, and audiovisual speech stimuli, as well as auditory and visual nonspeech control stimuli in a block fMRI design.

A posterior to anterior gradient of effects

A posterior to anterior gradient of effects related to visual and auditory speech information in bilateral superior temporal sulcus (STS).
Three separate meta-analyses were performed using NeuroSynth ( to identify studies that only included healthy participants and reported effects in STS. Two custom meta-analyses (dynamic facial expressions, audiovisual speech) and one term-based meta-analysis (speech sounds) were performed (see color key for details). The FDR-corrected (p < 0.01) reverse inference Z-statistic maps for each meta-analysis were downloaded from NeuroSynth for plotting. Results from dynamic facial expressions (blue), audiovisual speech (green), and speech sounds (red) meta-analyses are plotted on the study-specific template in MNI space (see “Study-Specific Anatomical Template” Section) and restricted to an STS region of interest to highlight the spatial distribution of effects within the STS.

The results demonstrated the following:

(1) activation in the STS follows a posterior-to-anterior functional gradient from facial motion processing, to multisensory processing, to auditory processing;
(2) speech-specific activations arise in multisensory regions of the middle STS;
(3) abstract representations of visible facial gestures emerge in visual regions of the pSTS that immediately border the multisensory regions.

The authors therefore suggest a functional-anatomic workflow for speech processing in the STS — namely, lower-level aspects of facial motion are processed in the posterior-most visual STS subregions; high-level/abstract aspects of facial motion are extracted in the pSTS immediately bordering mSTS; visual and auditory speech representations are integrated in mSTS; and integrated percepts feed into speech processing streams (Hickok and Poeppel, 2007; Rauschecker and Scott, 2009), potentially including auditory-phonological systems for speech sound categorization in more-anterior regions of the STS (Specht et al., 2009; Liebenthal et al., 2010; Bernstein and Liebenthal, 2014).

Jonathan H. Venezia, Kenneth I. Vaden Jr, Feng Rong, Dale Maddox, Kourosh Saberi and Gregory Hickok
Auditory, Visual and Audiovisual Speech Processing Streams in Superior Temporal Sulcus
Front. Hum. Neurosci., 07 April 2017 |

Copyright © 2017 Venezia, Vaden Jr., Rong, Maddox, Saberi and Hickok. Republished under the terms of the Creative Commons Attribution License (CC BY). Top Image: Jonathan H. Venezia, et al. Front. Hum. Neurosci