Introduction
Music performance takes many forms in our society, but usually involves trained and practiced musicians playing for an audience. In some musical traditions, conventions dictate silent and motionless listening behaviour from the audience; in other musical traditions, the audience is encouraged to move to or sing along with the music. Occasionally, music performance takes the form of a participatory activity that people do together as a group.
Music performance thus provides a venue for interaction between people – indeed, some scientists hypothesize that social bonding effects, in part, encouraged the widespread evolution of music-making abilities in early humans (Fitch, 2005; Huron, 2001; Tarr, Launay, & Dunbar, 2014). In the performance science literature, however, musical communication has often been conceptualized as a one-way trajectory from (active) performer to (passive) listener. Such a perspective fails to acknowledge that music-making is an inherently social process in which (live or prospective) audiences can influence performers’ behaviour.
Music production and perception are active processes that engage overlapping perceptual-motor mechanisms, and movement forms a critical and inseparable part of both forms of musical experience. Our aim in this paper is to highlight the role of the sensorimotor system in music perception and the role of the audience in musical communication. We situate our discussion in the theoretical context of embodiment, which defines cognition as encompassing both internal processing and observable interactions with the world (Leman, 2012; Leman & Maes, 2014).
Following a discussion of the embodied music cognition paradigm, we consider the “musical product” (i.e., the presented performance) in terms of the movements – sound-producing, communicative, and expressive – that go into it. We explore the idea that movement not only underpins music production, but is also a part of the musical product itself (e.g., when it carries expressive and/or communicative information that supplements the audio signal). In the subsequent section, we discuss how movement underlies the audiovisual perception of performed music. We consider how audience members’ prior experiences shape their perception of sound-producing movements, and how sounded music activates sensations of motion in listeners and, in some cases, encourages overt movement. We close with a discussion of how recent technological developments have simultaneously given us improved means to study musical communication and raised new questions for researchers to address.
Embodied Music Cognition and Perceptual-Motor Coupling
Embodied music cognition considers musical communication to be a nonlinear process characterized by dynamic interactions between performers, listeners, and their shared environment. The paradigm is in contrast to the “individualist” approach, which treats performers and listeners as separable from each other and from the musical stimulus, potentially understating the extent to which these three components interact (Moran, 2014). Some authors have argued that a focus on the Western art music tradition has enforced the individualist perspective, with its seemingly linear process of a composer writing a score, performers playing their interpretation of the score, and listeners interpreting the sounded music (Maes et al., 2014; Moran, 2014). In other musical traditions, the process of creating music can be more overtly collaborative and emergent (i.e., shaped dynamically in real-time) – for example, group improvisation requires clear interaction between performers, and mother-infant lullaby singing involves clear bidirectional interaction between performer and listener, as the mother’s performance is shaped in real-time by the responses of her infant (see below; Trevarthen, 2012).
Within the embodiment paradigm, there are diverging perspectives regarding the possible role of representational cognition in musical communication. Drawing on dynamical systems theory, some argue against the use of mental representations, and instead propose a framework in which musical interaction is dynamic, emergent, and autonomous (Schiavio & Høffding, 2015). Cognitive processes are said to be distributed across the collaborating musicians and their environment, rather than constrained within individuals, so “hidden” intentions – mental representations that are known only to individual performers – should not play a role. Those who take a less radical approach consider mental representations to be grounded in specific actions and bodily states. In this way, Leman and Maes (2014) describe the human body as a mediator linking people’s subjective experiences with their environment. The body mediates in both directions, helping to encode expressive ideas into sound as well as to decode expression from sound.
The research presented in this chapter is largely in line with this more moderate approach to embodiment. By this perspective, interaction with the environment occurs by way of action-perception loops. As also described in the literature on perceptual-motor coupling, actions are coded in the brain in terms of the consequences they have on the environment (Hommel et al., 2001; Maes et al., 2014; Prinz, 1990). Activation runs in both directions: executing an overt action activates expectations for the associated effect, while perceiving the effect primes the motor commands needed to produce it. Multiple levels of action-perception loops may run in parallel (Leman, 2012). Low-level loops, in which sensory input drives motor activity, may allow regulation of some aspects of performance technique (e.g., posture, breathing, finger position), while high-level loops, which draw on a repertoire of “gestures” (i.e., meaning-carrying movements) may facilitate planning of expressive contours.
The embodiment paradigm has been criticized for being overly broad and poorly defined – while its arguments are widely consistent with findings in the literature, it does not establish hypotheses specific enough to be empirically validated by disproving the alternative (disembodied) explanation (Matyja, 2016). However, it is a useful perspective to take in the context of the current paper, as it encourages us to reconsider the usual conceptual division between perception and performance and examine what the audience contributes to the process of music creation.
Movement as a Part of Music Production
Traditionally, body movement has been necessary for musical sound production, but musicians’ gestures serve other functions too: they enable changes in tone quality, facilitate coordination between ensemble members, convey expressive information to the audience, and support the performer’s own affective experience (Bishop & Goebl, 2018; Dahl et al., 2010). All categories of gesture form a part of the musical product that is presented to an audience, in the event that the performance can be viewed as well as heard. In the following sections, we discuss three categories of movements: those involved in sound production, those that support ensemble coordination, and those that communicate expression visually.
Sound Production
A central aim of the research on instrumental playing technique is to determine how variations in acoustic parameters are controlled by the performer. At a cognitive level, some musicians report focusing on the image of a desired sound as they play, which they say helps to guide their performance (Trusheim, 1991). The results of empirical studies support this claim. As explained above, perception of a sound is presumed to facilitate activation of the motor commands necessary to produce that sound. Imagining the sound has been shown to have a similar facilitatory effect on body movements (Bishop, Bailes, & Dean, 2013; Keller, Dalla Bella, & Koch, 2010).
The motor commands used to externalize these images and deliver musical output have been studied in increasing depth as technology for capturing fine movements has improved (Goebl, 2017). How skilled musicians maintain rhythmic consistency has been a primary focus in this line of research. In a study of piano performance, Tominaga et al. (2016) used a glove fitted with sensors to show how rotational velocity at finger joints and movement independence between fingers underlie rhythmic consistency. In saxophone performance, maintaining rhythmic consistency involves achieving precise coordination of finger and tongue movements. Using newly-developed sensor-equipped reeds, Hofmann and Goebl (2014) found higher consistency for coupled finger-tongue movements than for tongue-only (isotonal) movements at a slow tempo, but higher consistency for tongue-only movements at a fast tempo.
How skilled musicians manipulate timbre and dynamics has also been a focus of study. In an investigation of piano timbre, Goebl, Bresin, and Fujinaga (2014) used accelerometers to capture the kinematics of piano keys and hammers as they were played with “struck” and “pressed” finger movements. Pianists claim that these movements produce different tone qualities, and, indeed, listeners were able to discriminate between them even when key velocity was held constant. The points at which the striking finger hit the key and key bottom, identified using the accelerometer data, were found to provide acoustic cues that facilitated discrimination between timbres. In violin playing, timbre seems to be largely controlled through manipulations of bow force. This was determined by Schoonderwaldt (2009a), who used a bowing machine to measure the influence of bow force, velocity, and bow-bridge distance on intonation and spectral centroid, an acoustic correlate of perceived timbre. To achieve different dynamic levels, violinists manipulate bow force, bow-bridge distance and bow angle simultaneously (Schoonderwaldt, 2009b).
Interperformer Coordination
Ensemble musicians aim primarily to coordinate their sound. Sometimes they deliberately coordinate their sound-producing gestures as well – for example, orchestral string musicians typically use coordinated bowing patterns. More often, though, the sounds that must be coordinated are the result of different types of sound-producing gestures, which require different attack durations. Some coordination of expressive gestures (e.g., body sway) also occurs (Keller, Dalla Bella, & Koch, 2010; Moran et al., 2015), though it is unclear whether this is intended (or noticed) by performers, or simply a byproduct of performers sharing an interpretation of the music. Also unclear is the extent to which coordination of ancillary movements shapes the audience’s perception of ensemble coherence.
To coordinate their sound, ensemble musicians do not generally need to be able to see each other’s movements. Both trained and novice musicians synchronize with sounded rhythms in the absence of visual cues, even when the sounded rhythm contains irregularities and error-correction is needed to maintain synchronization (Hove et al., 2013; Konvalinka et al., 2010; van der Steen et al., 2015). Nevertheless, musicians do look towards each other when coordinating a performance, particularly at moments where their interpretations are likely to diverge, and uncertainty about each other’s intentions is high – for instance, at the start of a piece, following a long pause, or at sudden tempo or metrical changes (Bishop & Goebl, 2015; Kawase, 2014).
Some recent studies of ours have considered the visual signals that are exchanged at piece onset (Bishop & Goebl, 2017a, 2017b). In one study, we presented audiovisual point-light recordings of pianists’ cueing-in gestures to musician viewers, who were instructed to tap in synchrony with the beat of the music, aligning their first taps with the pianists’ first onsets (Bishop & Goebl, 2017b). Viewers synchronized most successfully with gestures that were low in jerk and large in magnitude. They also synchronized more successfully with gestures given by experienced ensemble pianists than with gestures given by pianists lacking ensemble experience.
Conductors’ gestures have been subjected to similar analysis. A study by Wöllner et al. (2012) showed that observers synchronize more successfully with gestures that are low in jerk and high in prototypicality than with gestures that are high in jerk and low in prototypicality. In conductors’ as well as instrumentalists’ gestures, beats seem to correspond to peaks in acceleration, rather than specific points in the spatial trajectory (Bishop & Goebl, 2017a, 2017b; Luck & Sloboda, 2009; Wöllner et al., 2012).
Of course, it is relatively infrequent that musicians make such explicit use of visual signalling as we describe here. It has been proposed that ensemble coordination is largely supported by an exchange of low-level sensory information (e.g., psychoacoustic sound parameters and movement kinematics) that induces entrainment between performers (Pachet, Roy, & Foulon, 2017). Musicians presumably draw more or less on these higher- and lower-level processes at any given moment, depending on the performance conditions (MacRitchie, Varlet, & Keller, 2017).
Visual Expressivity
Musicians’ body movements are of substantial communicative value to an observing audience. The question of how visual cues in the form of observed body gestures contribute to our perception of music performance has generated some debate among researchers and, of course, among musicians, whose primary focus is the sound their movements produce, and who are not always pleased to think that the visual modality could have a prevailing impact on their audience’s experience.
One line of research in this area tests the hypothesis that auditory and visual contributions to the perception of music expression are integrated, rather than additive. In a study by Vuoskoski et al. (2014), recordings of pianists performing with deadpan, normal, or exaggerated levels of expression were used. Audio and visual tracks were recombined (e.g., normal audio paired with deadpan video) and presented to trained and novice musicians, who provided ratings of auditory expressivity. Visual stimuli affected ratings of auditory expressivity even when matched with incongruous audio, though the magnitude of the effect differed between performing pianists. These results suggest a degree of malleability in auditory and visual cues that affects how much influence one modality has over the other. Similar results were subsequently obtained from an experiment testing emotional impact instead of perceived auditory expressivity (Vuoskoski et al., 2016).
The magnitude of the effect that visual cues can have on even a highly-trained audience’s perception of expressivity was demonstrated by Behne and Wöllner (2011). Musician participants were presented with audiovisual recordings of different pianists’ performances of the same two pieces and asked to provide expressivity ratings. The recordings that were presented used the same audio track but different video for the different performers, who had actually played in sync with the same pre-recorded audio. Ratings differed between the recordings, despite the audio being identical. High interrater variability was also observed, suggesting that the integration of auditory and visual cues could be an idiosyncratic process.
Tsay (2013) tested professional musicians and untrained listeners’ abilities to identify the winners of piano competitions, given audio, visual, or audiovisual presentation of performance excerpts. Both groups of participants made the most accurate judgments when presented with silent video excerpts, suggesting that visual cues might be even more informative than audio cues for discriminating competition winners. On the other hand, the apparent visual advantage could be attributable to a biased selection of stimulus excerpts (Platz et al., 2016). Further study using more systematically selected stimuli would be needed to confirm this surprising finding.
The research we describe here shows that the audience’s perception of a musical performance derives from an integration of auditory and visual kinematic cues, and it highlights the audience’s role in assigning meaning to the performance. In the next section, we consider how movement can be perceived through musical sound as well as visually.
Movement as a Part of Music Perception
Musical communication is a creative, dynamic process comprising performance and perception components. In the previous section, we considered how body movements underlie performers’ contributions to musical communication, and we must acknowledge that the audience’s feedback, whether real or imagined, helps to shape those movements. When engaging with a musical performance, audience members draw on their own abilities, constraints, and experiences to construct some meaning from the performers’ audiovisual signals. As such, they can be considered active contributors to the creative process of musical communication.
The term “communicative musicality” has been used to describe how coordinated companionship arises from the temporal and expressive attributes of social behaviour (Trevarthen, 2012). Three attributes are defined: pulse, the regular patterning of a performer’s output through time that allows the audience to anticipate what might follow; quality, the expressive contours of sound and body gestures; and narrative, the linking of pulse and quality into units that allow the performer and audience to share a sense of passing time. The application of this paradigm to the study of mother-infant musical interaction has shown how infants respond behaviourally to the timing and affective quality of their mothers’ singing (e.g., with body rocking, timed verbal utterances, and facial expressions), and how the mothers’ behaviour is shaped by evidence of their infants’ engagement. The communicative behaviour observed between mothers and infants provide clear examples of how listeners’ body movements can both facilitate their own understanding of a musical performance and provide feedback to the performer.
The embodied music cognition paradigm posits that movement underlies music perception just as it underlies performance. In this section, we discuss 1) how audience members’ perceptions of sound-producing movement are shaped by their prior experience, 2) how movement is “heard” in sounded music through associations of motion and acoustic parameters, and 3) why music sometimes prompts listeners to move.
Audiovisual Perception of Human Motion: Effects of Experience
Audience members become active participants in a music performance the moment the sounded and/or observed music enters their perceptual systems. If the performance can be seen and heard, a critical part of the perception process is the binding of auditory and visual signals into distinct perceptual events that correspond to sound-producing gestures and their acoustic effects. This process of audiovisual integration draws on action-perception loops that strengthen with exposure to different gestures and their associated effects. The more experience a person has with a repertoire of gesture-sound pairs, the more precise their gesture-sound associations become. Strengthened associations have been observed for pitch (Keller & Koch, 2008), dynamics and articulation (Bishop, Bailes, & Dean, 2013), and timing. In particular, tolerance for audiovisual asynchrony in musical stimuli has been shown to decrease with increasing musical expertise (Petrini et al., 2009).
Tolerance for asynchrony or incongruency in perceived gesture-sound pairs also depends on the type of motion observed and the type of sound produced. Less asynchrony is tolerated for piano playing than for (bowed) violin playing (Bishop & Goebl, 2014). Piano playing involves percussive gestures and yields sounds for which the physical onset (i.e., the point where the sound begins) and the perceptual onset (i.e., the point where the sound becomes perceptible to the listener) are virtually simultaneous. In contrast, violin playing involves continuous gestures and yields sounds for which a small interval between the physical and perceptual onsets can exist, and a range of perceptual onsets might be tolerated (Vos & Rasch, 1981).
Strengthening of action-perception loops occurs with both perceptual and motor experience. For example, among pianists, listening practice (without overt movement) results in better recall (i.e., performance) of simple melodies than does motor practice without sound (Brown & Palmer, 2013). Such an effect shows how the action-perception loops drawn upon during performance are strengthened by listening experience.
On the other hand, the learning benefits of combined perceptual-motor experience has been shown to outweigh the learning benefits of perceptual experience without overt movement. Aglioti et al. (2008) found that skilled basketball players predicted the success of free shots at a basket faster and more accurately than did either coaches or sports journalists, who had extensive visual but limited motor experience. In the music domain, Wöllner and Cañal-Bruland (2010) found that skilled string musicians predict note onset times from violinists’ cueing gestures more precisely than do skilled non-string musicians.
These findings suggest that audience members draw on movement in the form of action-perception loops during the early stages of interpreting perceived music, when the binding of audio and visual signals occurs. As discussed above, the way audio and visual signals combine has a potentially strong influence over audience members’ perceptions of expression. The learning that occurs with observation of others’ performance, even if it occurs to a lesser extent than with overt practice, shows how the perceptual-motor system is tuned to change in a way that facilitates prediction abilities and supports multisensory associations.
Hearing Movement in Sound
Cross-modal correspondences are symmetric associations that people make between parameters in different sensory modalities. Associations between pitch height and spatial height, for example, are widespread and seemingly independent of musical training and linguistic background (Eitan, 2017). Cross-modal correspondences can modulate overt movement, speeding motor responses when the sounded stimulus and target movement are “correctly” matched (Keller & Koch, 2008).
Eitan and Granot (2006) tested for associations between music and motion. Listeners were presented with auditory sequences in which only one parameter was varied (e.g., rising or falling pitch, increasing or decreasing loudness), and imagined a human figure moving with the music. Their descriptions of the imagined figures’ movements suggested a number of correspondences, including an association between increasing/decreasing loudness and decreasing/increasing distance change, and an association between pitch contour and vertical position change.
Some of the features that people associate with acoustic parameters they also associate with emotional constructs. For example, some positively-valenced words (e.g., happy) are associated with a high spatial position, while their antonyms (e.g., sad) are associated with a low spatial position (Eitan & Timmers, 2010; Gozli et al., 2013). A study by Weger et al. (2007) showed how associations between emotional valence, spatial height, and acoustic pitch can interrelate: positively- and negatively-valenced words (e.g., kiss, dead) were found to prime judgements of the pitch height of sounded tones. Eitan (2017) suggests that emotion may mediate some acoustic-visuospatial associations, such as the association of spatial height with pitch.
Taking a different perspective, the “FEELA” (Force-Effort-Energy-Loudness-Arousal) hypothesis relates affective parameters of music (e.g., arousal) to the corresponding acoustic parameters (e.g., acoustic intensity) and parameters of the movement needed to produce the underlying sound (Olsen & Dean, 2016). In a recent study, groups of listeners were presented with passages of classical and electroacoustic music and gave continuous judgements of perceived physical exertion, arousal, and valence. For passages with music that could be readily attributed to human movement (the classical and some of the electroacoustic pieces), perceived exertion was a significant predictor of perceived arousal and valence. For the passages of electroacoustic music that could not be attributed to human movement, exertion judgements did not seem to influence the perception of arousal. Acoustic intensity, in turn, was a significant predictor of perceived exertion. Thus, listeners seem to hear physical exertion in music that they associate with human sound-producing movements, and this supports the perception of arousal.
Studies of cross-modal correspondences suggest that part of the meaning an audience gets from perceived music can come from the associations they make between acoustic and motion parameters. While hearing movement in music is itself potentially meaningful, associations with emotional constructs may additionally contribute. Such findings are in line with the proposed role of the human body as a mediator between subjective experience and the environment that is involved in the construction of musical meaning (Leman & Maes, 2014).
Inducing Movement Through Sounded Music
Some music induces a sense of movement in listeners, inciting in both trained and novice musicians an urge to synchronize with the beat (Janata, Tomic, & Haberman, 2012). This psychological phenomenon is referred to as groove. Not all music creates a sense of groove. Ratings of groove strength are consistently lower for some genres of music (e.g., folk) than others (e.g., soul/R&B; Janata et al., 2012). The perception of groove is thought to relate to microtiming, the small deviations from metronomic timing that occur even in metrical, beat-based music; however, this notion has proven difficult to validate experimentally. Several studies have shown quantized music to receive high ratings of groove, and that these ratings decline as the magnitude of microtiming deviations increases (Hofmann, Wesolowski, & Goebl, 2017; Senn et al., 2017).
Other parameters have been shown to contribute to the perception of groove for both trained and novice musicians. Witek et al. (2014) showed that a moderate degree of syncopation maximizes both perceived groove and reported enjoyment of groove-based music. Other parameters that encourage perceptions of groove include a tempo near to the average frequency of human locomotion, the use of low frequencies on bass instruments, and spectral flux (a measure of rate of change in the frequency spectrum that relates to perceived activity level) in the low frequencies (Stupacher, Hove, & Janata, 2016). Ratings of groove have also been found to increase when the entrances of different voices are staggered rather than simultaneous, and when the music contains more rather than fewer instruments (Hurley, Martens, & Janata, 2014).
The results of the study by Hurley et al. (2014) show that some of the acoustic features likely to grab listeners’ attention (e.g., staggered entrances) also increase their tendency for stimulus-coupled movement. Increased attention facilitates processing of auditory signals, and could potentially increase activity of action-perception loops, encouraging overt movement. Burger et al. (2013) found that some characteristics of musical timing and timbre relate consistently to characteristics of music-induced movement. For example, increased pulse clarity encouraged a wide variety of whole body movements, and low frequency spectral flux correlated positively with head velocity. Rhythmic information in beat-based music is primarily communicated through low frequency voices, so the correlation with head velocity might reflect a tendency to pair head movements with beats. Synchronization with a sounded beat has been shown to improve timing perception (Manning & Schutz, 2013), and as an obvious signal of engagement, it could also have social bonding effects if observed by the performers or other audience members.
Future Directions
In recent years, we have seen the development of sensor and camera systems that measure musicians’ movements and audience perceptions of them in great detail. Simultaneously, with the advent of technology-mediated performance, computers have been playing an increasing role in human musical performance, disrupting the traditional relationship between musical sound and movement (Emerson & Egermann, 2018). Today, we can listen to instrumental music that has been digitally enhanced, resulting in sound that is not entirely attributable to observable gestures. We can also see performances on digital musical interfaces for which the gesture-sound mapping is unfamiliar to us or even invisible (e.g., if sound output is generated via laptop controls). In this section, we discuss how these technological developments have the potential to guide our future research endeavours, both by introducing new methods for the study of human interaction and by raising new questions about how music is treated by the perceptual system.
Integrating Capture Techniques to Quantify Visual Interaction
The techniques available for studying music-related movements include sensors capable of making fine-grained measurements of movement parameters that are not readily apparent to an external viewer, such as finger forces in pianists (Kinoshita et al., 2007) or tonguing patterns in saxophonists (Hofmann & Goebl, 2014). Larger-scale movements, from finger trajectories to body sway, can be assessed using inertial or optical motion capture. A number of techniques are also available for assessing perception of musicians’ gestures, including eye tracking, brain imaging, and EMG (for measuring covert muscle activation). In this section, we focus on two categories of techniques – motion capture and eye tracking – that have particular potential for research on the role of movement in musical communication.
Inertial and optical motion capture systems are widely used in the study of performance gestures (Goebl, Dixon, & Schubert, 2014). Inertial sensors, including accelerometers and gyroscopes, are typically affixed to a musician’s body or instrument. Accelerometers track 3D acceleration and gyroscopes track orientation and angular velocity at the measured point. Optical motion capture uses cameras to triangulate the position of markers that are affixed to musicians’ bodies or instruments, and has been widely used in studies investigating the expressive gestures of performing musicians (Nusseck, Wanderley, & Spahn, 2017; Vuoskoski et al., 2014), communicative gestures in music ensembles (Bishop & Goebl, 2017b), and motor responses to perceived music (Toiviainen, Luck, & Thompson, 2010). Motion capture recordings are also commonly used as visual stimuli in experiments investigating viewers’ perceptions of human movement (e.g., Bishop & Goebl, 2017b; Moran et al., 2015; Petrini et al., 2009; Wöllner et al., 2012).
Most eye tracking systems use infrared cameras to detect corneal reflections around the pupil and measure its movements. Eye tracking systems can be remote or mobile: remote systems are typically mounted to a computer screen and suitable for monitoring gaze directed towards a stationary stimulus (e.g., text or images), while mobile systems are mounted on the subject’s head and can be used to study gaze behaviour in a 3D environment (e.g., a performance space). Both remote and mobile eye trackers calculate measures such as pupil position, point of regard, and pupil dilation. Mobile eye tracking is a useful technique for studying gaze behaviour in performing musicians, who typically require more freedom of movement than remote eye trackers can cope with. Bigand et al. (2010) used mobile eye tracking to investigate the gaze behaviour of an orchestral conductor, and in our lab, we use mobile eye trackers to examine ensemble musicians’ attention towards each other (Bishop, Cancino-Chacón, & Goebl, 2017).
There are ongoing efforts by several research groups, including ours, to integrate mobile eye tracking and optical motion capture (Burger, Puupponen, & Jantunen, 2017). One of the critical steps involved in integrating these systems is establishing a method for synchronizing the different data streams. In our lab, we use a synchronization device (issued by the makers of our motion capture system) to send TTL triggers to the eye tracking software at the start and stop of each motion capture recording. This method ensures precise and reliable synchronization and allows us to check for drift between data streams. However, it does introduce some constraints to the recording set-up, as the glasses have to be connected via (customized) cable to a computer capable of receiving TTL triggers, somewhat restricting performers’ freedom of movement.
An integrated motion capture-eye tracking system greatly simplifies the analysis of eye gaze data. With mobile eye tracking, extensive manual coding of video data is typically required, since in contrast to remote eye tracking, the visual scene is constantly changing and areas of interest are not static. If mobile eye trackers are used in combination with motion capture, however, detection of subjects’ gaze targets can be automatized by remapping gaze coordinates into the motion capture coordinate system. Moments where the gaze target is the musical score or another performer (i.e., an object or person defined with markers) are then readily identifiable.
Complementing such an integrated system, and further facilitating the study of body movements and gaze behaviour, would be the development of advanced data analysis tools. Currently, MATLAB users can access the Mocap Toolbox developed by Burger and Toiviainen (2013), which provides functions for analyzing and visualizing movement data. It would be widely beneficial to develop such tools to include functions for the analysis of gaze data captured using an integrated motion capture-eye tracking system. When applied to the study of human interaction, an integrated motion capture-eye tracking system allows precise identification of which body movements people attend to and the extent to which gaze itself acts as a visual cue. With visual attention and body movements of multiple people (e.g., performer and listener or performer and co-performer) captured in parallel, musical communication can be studied from both ends of the loop simultaneously.
Challenging the Inseparability of Movement and Music
Ongoing changes in the way music is created, distributed, and experienced by audiences challenge the understanding we have of embodied music perception. For example, while people do attend live concerts on occasion, most of the music they encounter on a daily basis enters the perceptual system unimodally, as an auditory signal without corresponding visual cues. The pervasiveness of this unimodal auditory presentation raises some questions: when people hear music without knowledge of the movements needed to produce it, is their perception less embodied than it would have been, had they some familiarity with those movements? Earlier, we discussed some perceptual sub-processes that draw on listeners’ motor resources. Some of these sub-processes, such as the ability to predict the sounded outcomes of performer gestures and the integration of audio and visual signals, are enhanced by visual experience. On the other hand, cross-modal correspondences are thought to reflect learned associations between commonly co-occurring events and sounds that develop through a combination of general (not music-specific) statistical learning and familiarity with cultural/linguistic conventions (Eitan, 2017). Thus, visual exposure to performance may be unnecessary for listeners to “hear” movement in sounded music.
Music listening today is often a solitary activity, done alone, over headphones, with no possibility of observing the performers. Launay (2015) questioned how we should reconcile this with the hypothesized origins of music-making as a means of social bonding. He suggested that the ability to hear movement in sounded music, which is prompted by the presence of either rhythm or familiar instrumental sounds (associated in memory with certain gestures), allows listeners to infer the existence of a performer, and making music listening a social activity even in minimally interactive conditions (e.g., solo listening to an audio track).
Changes to the way music is created include the development of algorithms capable of performing music and the introduction of digital musical interfaces (DMIs) that human performers can use to create and manipulate digital music in new and creative ways. Some of the sounds produced via these methods, including machine-like or environmental sounds, are not likely to be attributed to human movements by our perceptual systems. Research has already shown that audience members’ aesthetic judgments of DMI performance suffer when they are unable to figure out the gesture-sound mappings of the interface (Emerson & Egermann, 2018). Audiences also show an inability to pair audio and visual signals when gesture-sound causality is not perceptible, giving equivalent ratings of performance quality for correctly paired audio-video excerpts and incorrectly paired excerpts that combine audio and video from different parts of the performance.
An outstanding question is whether music comprised of such sounds is, like music comprising sounds that result from human movements, experienced in an embodied way. The results of the study by Olsen and Dean (2016), discussed above, suggest that features of electroacoustic music not attributable to human movement can be associated with movement parameters, although in contrast to music that is attributable to human movement, the association may not influence listeners’ perception of arousal. In contrast to this finding, electronic dance music and hip-hop turntable performances tend to encourage both overt movement and high arousal in their audiences, showing that music comprising digital sounds can engage listeners motorically and emotionally. Presumably, the organization of digital sounds into distinct beat-based rhythms results induces motor activation in listeners, regardless of whether they associate the sounds with causal gestures.
Conclusions
The aim of this paper was to show that musical communication is a dynamic and collaborative process involving performers and an active audience, for whom music perception is a motoric process. Traditionally, overt movement has been critical for performed music, necessary for sound production as well as a part of the musical product presented to audiences. Today, this is no longer entirely the case, since 1) the audience does not always see the performance and 2) sounded music can be generated without sound-producing gestures. Research findings are in line with the prediction that music perception is embodied: motor resources are drawn upon throughout the perception process and while constructing meaning out of the musical signal. At present, however, the literature still lacks strong tests of the embodiment paradigm, as well as an indication of how robust current findings are to a broader definition of “music” that includes genres outside the Western art music tradition, including genres in which the potential for interpersonal interaction is higher (e.g., group improvisation) and genres in which it is lower (e.g., electroacoustic music).