Time as the Ink That Music Is Written With: A Review of Internal Clock Models and Their Explanatory Power in Audiovisual Perception

The current review addresses two internal clock models that have dominated discussions in timing research for the last decades. More specifically, it discusses whether the central or the intrinsic clock model better describes the fluctuations in subjective time. Identifying the timing mechanism is critical to explain and predict timing behaviours in various audiovisual contexts. Music stands out for its prominence in real life scenarios along with its great potential to alter subjective time. An emphasis on how music as a complex dynamic auditory signal affects timing accuracy led us to examine the behavioural and neuropsychological evidence that supports either clock model. In addition to the timing mechanisms, an overview of internal and external variables, such as attention and emotions as well as the classic experimental paradigms is provided, in order to examine how the mechanisms function in response to changes occurring particularly during music experiences. Neither model can explain the effects of music on subjective timing entirely: The intrinsic model applies primarily to subsecond timing, whereas the central model applies to the suprasecond range. In order to explain time experiences in music, one has to consider the target intervals as well as the contextual factors mentioned above. Further research is needed to reconcile the gap between theories, and suggestions for future empirical studies are outlined.

In relation to duration estimation, it is equally important to underline the role of beat perception, or inner timing as the ability to perceive and predict the temporal location of events. As a supporter of the cerebral clock model, Pöppel (1989) hypothesized that maintaining a constant tempo in music production, especially in classical music, has to do with a time keeping mechanism that functions mainly by tracking the temporal order such as synchrony and succession in addition to measuring durations. In the same vein, a "3-second window of temporal integration" (Pöppel, 1989, p. 86) was assumed to constitute the psychological present. This has consequences for perceiving musical tempo and the integration of beats, and hence subjective experiences of time. Similar attempts for developing clock models were based on beat perception (Langner, 2002;Schulze, 1978). Schulze, in particular, pivoted the Dynamic Attending Theory (Jones & Boltz, 1989) that emerged later by emphasizing the variability of internal clock speed under the influence of environmental cues (accelerating and decelerating beat patterns).
While some studies investigated musical tempo in order to formulate hypotheses about clock models, other research found that the variables embedded in tempo, such as isochrony, salience, or complexity, directly evoked changes in the functioning of the internal clock. Povel and Essens (1985) observed in their experiments that different grouping of rhythmic beats led to various temporal reproductions, giving rise to a best fit for the internal clock. Explanations lie in the coupling of beat accents and the clock 'tick': the stronger the beat pattern is (in this case, higher metrical level), and the less complex the metrical structure is, the more likely the beat would activate the internal time recording system and be represented in temporal processing. Recognizing beat perception not only helps understanding human timing better, but also particularly with music listening and performing, which can also be understood in terms of proactive and active timing. In fact, frequent exposure to musical beat production appears to enhance one's temporal sensitivity, and this effect may transcend to other sensory modalities (from audition to vision; Cicchini et al., 2012). Apart from external training, the stability of intrinsic rhythm was also positively correlated with tempo reproduction performances (McPherson et al., 2018).
It is therefore essential to look into the complexity in musical tempo itself.
Musical tempo is subject to ambiguity. The complexity of tempo structures in music has long been recognized (e.g., Pressnitzer et al., 2011). It is not only marked by the number of note events in a melody (Behne, 1976), nor only by the patterns of percussion instruments, but rather by changes in pitch, timbre, or loudness (Brochard et al., 2003), as well as phrasing and articulation (Auhagen & Busch, 1998). Multiple sound sources of the instruments in a symphony orchestra vary tremendously across different sections and therefore constitute auditory streams that are hard to disentangle (Shamma & Micheyl, 2010), especially for non-musicians.
Note that the difficulty of correctly identifying temporal structures in music is not equal to that of correctly identifying the tempo of music, considering the latter has more to do with detecting the absolute 'speed' and tempo changes. Attempts have been made to examine the thresholds of detecting musical tempo acceleration and deceleration, for instance, among musically trained and untrained groups (e.g., Ellis, 1991). There are several assumptions of how we cope with "noisy" auditory signals in terms of time and tempo perception. Some argued that the process of tempo extraction depends mainly on periodic regularities (McDermott et al., 2011), while others emphasized the importance of learning, regardless of tempo structure complexities (Agus et al., 2010).
A small number of studies aimed at the disentanglement of auditory rhythmic features and revealed how tempo salience affects perceptual time. One study investigated different metrical levels and found effects on listeners' sense of time (Hammerschmidt & Wöllner, 2020). More specifically, the lower the metrical level individuals at- Wang & Wöllner 3 tended to by tapping (e.g., eight notes versus half notes), the longer a music excerpt was perceived, providing some evidence for the impact of event density (cf. Behne, 1976). In this case, the count of time was affected by the number of beats registered in memory.
Apart from music, inputs from other sensory modalities may also affect temporal processing. Indeed, psychological research has often used visual stimuli such as flashes or flickering lights to investigate time. For instance, studies of the entrainment effect for independent modalities showed that the presence of either visual flickers or pure tones led to higher entrainment (e.g., Ortega & López, 2008;Treisman & Brogan, 1992;Treisman et al., 1990). The effect, nevertheless, is not limited to one modality. Past research suggests that auditory signals of various complexities could enhance the entrainment effect for visual sequences and, in some cases, were transferable to the attention acuity of the other modality (Bolger et al., 2013;Escoffier et al., 2010). In Bolger et al.'s study, participants were able to perform equally well in a target detection task regardless of the target modality (auditory or visual) when entrained with tone sequences. Another case in point is the cross-modal transfer in tempo discrimination between auditory and tactile domains, where training with rhythmic sounds led to enhanced performance in that of the latter (Nagarajan et al., 1998). These studies provide evidence that the cognitive processes involved in timing and time perception should function at a domain-general level.
In this review, an overview is provided of internal clock models that were established or further developed in recent years. In particular, the aim was to show how each model accounts for the experience of musical time in auditory and audiovisual contexts. We will tackle questions such as: How does music facilitate temporal processing? What are the timing mechanisms and models, and how do they explain the inference between music and perceptual time, respectively? What are the implications of studying music and time perception?

The Internal Clock
Comparable to an actual clock, the internal clock has been an analogy for the timing mechanism in human and animals (Eagleman et al., 2005;Ivry & Schlerf, 2008). The temporal order of events is recorded by multiple sensory modalities and processed in, according to different theories, a variety of pathways before becoming representations of time, that is, the occurrence of "clock ticks". Early in the discussion, hypotheses stated that time perception was a form of information processing that highly depended on the recording capacity (Ornstein, 1969). Researchers such as Barry (1990) and Schulze (1978) both emphasized the importance of music as an environmental construct of attention that shaped both the perceived time (in terms of its duration) and the passage of time (the perceived speed).
In the past years, two major theories were on the forefront in discussions as to how the temporal units are recorded, both postulating the presence of a specific cognitive module dedicated to timing. The 'no clock' hypothesis, or state-dependent network, and the 'central clock' hypothesis have both received increasing attention in research (Grondin, 2010b). The latter, in particular, encompasses two theories: The Dynamic Attending Theory, based on a non-linear cumulation of temporal units, as well as the Scalar Expectancy Theory, which assumed that the emission of temporal pulses follows a linear approach (Ivry & Schlerf, 2008). Such as Stern (1897) had already pointed out for "Präsenzzeit" (the experienced present moment) and the time range for other cognitive processes, it appears that different theories function best at specific time ranges (Figure 1).
A Review of Internal Clock Models 4 Figure 1. An overview of the internal clock models specified by interval ranges (subsecond, suprasecond, seconds to minutes, and minutes to hours) as well as by the division of central vs. intrinsic model.
Note. The most important features described in the overview were discussed in detail in the following sections. For a review of the research methods adopted in timing studies, see Grondin (2010b).

The Intrinsic Clock Model
Unlike the traditional view of a clock, some researchers believe that there might be no clock at all. Such a 'noclock' model is known as a state-dependent timing system or intrinsic model (Ivry & Schlerf, 2008). The 'state' here describes the specific circumstances generated by a neural network in response to external changes.
Timing is seen as an implicit function of each neural network that is activated for a given sensory modality and is sensitive to pre-and post-interval changes. It is postulated that activity-elicited changes in neural networks directly reflect the inherent temporal structures and therefore serve as references for timing in sub-second intervals (e.g., Karmarkar & Buonomano, 2007). The process is also referred to as "a temporal-to-spatial transformation" (Karmarkar & Buonomano, 2007, p. 3), or the "intrinsic model", as researchers hypothesize that timing is an integral function of neural activity (Ivry & Schlerf, 2008).
The model essentially suggests that timing as a function is distributed to a variety of neural structures in which oscillatory patterns stay consistent, known as the recurrent neural network (Buonomano & Laje, 2011).
Ramping, or climbing activities in neural oscillations ranging from primarily low frequency gamma band to higher frequency such as beta and alpha band (e.g., Wittmann, 2013) have been revealed as the physiological basis for the model, in addition to neural spikes across a wide range of brain regions such as the striatum (Gu et al., 2015). The time stamps, or accumulated states, are hypothesized to be expressed on both micro (individual neurons) as well as macro (populatory neuron excitation/inhibition) levels (Buonomano & Laje, 2011).

Wang & Wöllner 5
The intrinsic model proposes the possibility that timing is an inherent function of multiple dynamic neural networks. The flexibility allows the network to take into account calibrations towards previous durations and to judge the duration of the current event on this basis. Furthermore, by hypothesizing the implicity of timing, there is no need for external triggers even when an event is absent. However, the state-dependent network also has its shortcomings. Studies suggested that the model is only applicable to the subsecond range. That is to say, the cumulative effects of previous events, either enhancing or reducing one's temporal sensitivity, diminished within up to 300 milliseconds (Buonomano et al., 2009). On the other hand, it does not offer a clear explanation of cross-modal temporal information integration. This is where the central clock model provides a useful alternative perspective.

The Central Clock Model
The central timing mechanism, also known as the dedicated clock model (e.g., Allman et al., 2014), stemmed from Treisman's (1963) work. Decades of research into the human timing mechanism were based on this model and have assumed that timing is a specific cognitive module, hypothetically located across the global neural network (e.g., Allman & Meck, 2012).
Where is the 'clock' in our brain? Neurological studies supporting the timing mechanism as an independent cognitive module that is dedicated solely to this function, however, do not necessarily assume one single structure in the brain for it. Research rather supports the roles of a wide range of brain regions working collaboratively in order to process time (e.g., Buhusi & Meck, 2005). The cerebellum, for example, is involved in short duration judgments, arguably from a few hundred milliseconds to 30 seconds (Allman & Meck, 2012).
Disruptions to other cortical and subcortical structures including the basal ganglia can lead to timing deficits also in larger time frames. In Schwartze and colleagues' (2011) study, participants with basal ganglia lesion failed to detect tempo acceleration and deceleration of tones and could not entrain tapping movements with the signals. The evidence implicates a global network where multiple brain regions are involved; impairments at any part of the chain could possibly lead to malfunctions of the timing mechanism. There are two prominent theories with consequences for time processing in music that will be explained below.

The Dynamic Attending Theory (DAT)
DAT, also known as the oscillator model, hypothesizes that the ability to estimate the duration of past events depends on the coupling between attentional pulses and the occurrences of external events (Jones & Boltz, 1989). The theory supports the presence of a central clock, in the sense that the allocation of attention as a limited resource is based on the expectation of the next event on the timeline (Large & Jones, 1999). An exogenous stimulus, when aligned with the peak of attention, is best retained in working memory and transformed into representations of time (Barnes & Jones, 2000). Essential to this theory is that, like the unstable periodicities of external events, the emission of attentional pulses or oscillations is a non-linear process (Large, 2008).
Holding the central clock premise, DAT suggests that attention plays a critical role in regulating the frequency of the pacemaker pulses according to the Attentional Gate Model (Block & Gruber, 2014;Zakay & Block, 1995).
More specifically, the "gate" through which temporal units pass before registering with the counter device opens wider when more attention is assigned to the specific time point. When one's attention is shifted elsewhere irrelevant to temporal cues, fewer pulses are recorded, leading to duration underestimation. Some argue that A Review of Internal Clock Models 6 DAT applies exclusively to prospective timing, while in retrospective timing, it is subject to contextual influences and memory retrieval (Block & Gruber, 2014;Gu et al., 2015).
Note that DAT is hypothesized to function mostly within the suprasecond range, because prospective timing recedes with time due to limited capacities of working memory (WM) (for a review, see Gu et al., 2015). Concurrent tasks that require extra attentional resources could reduce timing accuracy (Brown & Boltz, 2002). Polti et al. (2018) attempted to explore the interval boundary of attention in prospective timing and found that the magnitude of WM interference on time estimation tasks increased proportionally with interval lengths (30 to 90s). In a more naturalistic setting, gamers were asked to estimate the elapsed time (12, 35, or 58 minutes) either knowingly (prospective timing) or not (retrospective; Tobin et al., 2010). The 12-minute session was estimated significantly longer in the prospective than the retrospective paradigm, while estimation differences were less pronounced in 35-and 58-minute conditions, suggesting that DAT's predictive power may be reduced in longer intervals. However, no evidence so far has made a clear cut of the interval ranges where each model fits best. This question clearly deserves further exploration and will be discussed in the conclusion.

The Scalar Expectancy Theory (SET)
Another well-known model for the internal timing mechanism argues that the perceived amount of time is composed of regularly emitted pulses from a pacemaker, as an analogy of an internal clock, and accumulated by a counter device, therefore also known as the pacemaker-counter model (Gibbon, 1977;Gibbon et al., 1984;Treisman, 1963). This model specifies that the temporal process is accomplished through roughly three different steps: the clock, the memorisation, and the judgment of time. Accordingly, in order to explain the temporal flow, SET proposes that subjective time is composed of (a) the representation of objective durations, and (b) the estimation variance or error rate by Weber's fraction (Allman & Meck, 2012;Grondin, 2010b).
The variance is hypothesized to stem from transferring the clock readings to working memory (Meck, 1984;Treisman, 1963). Inaccuracy in duration estimation, according to SET, is also subject to the influence of attention, clock speed error, task switch, decision error and other factors (Allman & Meck, 2012). The longer the duration, the larger the variance. It is, however, observed that SET has a controversial applicability in the sub-second to supra-second range. Grondin (2010a) focused on the violation of scalar property when examining participants' timing performance in a subsecond range, and found a tendency for Weber's fraction increasing as the interval approached 1s. This suggests that SET does not provide a powerful explanation of timing behaviour in the millisecond range. The applicability of SET in the millisecond to second range was further supported by the audiovisual evidence for this theory. Evidence is still needed for a clear boundary of 'time ranges of the best fit' for SET.
From a neurobiological perspective, the striatal beat frequency (SBF) theory offers an explanation for the original pacemaker-counter model (Miall, 1989;van Rijn et al., 2014). Unlike the latter, SBF theory instantiated the biological structure of the clock pulses as the oscillations of striatal medium spiny neurons (MSN), locating at the suprachiasmatic nucleus at the anterior hypothalamus. Researchers proposed that the neurons at different oscillatory frequencies reset when the timing begins, receiving inputs from cortical neurons firing as the consequences of dopaminergic releases (Merchant et al., 2013). Detection of the synchronous neural oscillation is known as the coincidental detection (Buhusi & Meck, 2005). The MSNs are capable of detecting coincidental oscillations from the cortical neurons that fire at similar frequencies, also known as the input, then translate to temporal units as the output. To justify the scalar property, that is variance in accumulative timing, it's been proposed that neural oscillations phase out and disperse into the inherent frequency of each neuron Wang & Wöllner 7 after the initial alignment. As a result, the discrepancy increases proportionally until the neurons that were firing together completely desynchronize in the end. The MSNs, however, retain the robust ability to detect temporal patterns up to minutes despite the complexity of the inputs thanks to ironically the large number of cortical neurons (e.g., Matell et al., 2003), making the theory viable for a wider range of durations.

Factors Overarching Both Clock Models Sensory Modalities
Multisensory inputs often interact with one another in our daily life. A vase dropping to the ground is usually followed by a shattering sound. A knock on the door leads to a knocking sound. To a broader extent, signals from vision, hearing, touch, smell and taste constitute the intangible framework of timing references together.
Hence it is critical to understand the specificity of each sensory modality and their joint effects in temporal perception.
The dominant role of audition in temporal processing has been evidenced by a series of studies (e.g., Boltz, 2017;Chen et al., 2018;Repp & Penel, 2002). A number of studies supported higher precision in temporal discrimination in audition compared to vision (e.g., Large, 2008;Phillips & Hall, 2002). Furthermore, auditory temporal processing is capable of interfering with visual timing. In this case, participants' performances in identifying the correct rhythmic visual patterns were most heavily compromised when the task was accompanied by a new string of isochronous sounds rather than visual display (Guttman et al., 2005). One may assume that temporal information derived from auditory events weighs more than that of visual inputs. The auditory dominance view is, however, not without dispute. van Wassenhove and colleagues (2008) found that incongruent visual displays could distort temporal perception of auditory information in both directions. A recent finding, in addition, suggests that temporal perception was biased towards the visually perceived tempo of natural human movements rather than that of the drumbeats when the two sensory modalities were incongruent .
Research showed that auditory stimuli can effectively distort visual perception (e.g., Burr et al., 2013). The auditory driving effect emphasizes the perceived coupling of fluttering sounds to visual flicker rates, if the temporal gap between the flutter and the flicker does not exceed a certain range (Shipley, 1964). In other words, perceptual integration is accomplished by averaging auditory and visual input frequencies while endowing more weights on the former. A robust auditory driving effect could be observed when the sounds were presented as a brief distractor (Burr et al., 2013). Furthermore, Chen and colleagues' (2018) study suggested that, in addition to traditional regular flutters, irregular auditory inputs accompanying the visual flickers could also lead to distortions in perceiving the latter. Similar observations of the audiovisual bias were reported as fission and fusion illusions (Shams et al., 2002). In this case, the former specifies two visual events perceived as one when presented simultaneously with a beep, while in the latter, one flash is perceived as two when accompanied by two beeps.
Therefore, it is inevitable to take into account the arguably dominant position of the auditory modality when exploring the role of music in temporal processing. It should be noted that music encompasses not only complex acoustic signals, but a rich source of emotions that alter subjective time. Films are an example of how music shapes experiences of time. In a study investigating slow motion film scenes as compared to the A Review of Internal Clock Models 8 same scenes played back in real time, participants were significantly influenced in their temporal judgments of the scenes' duration when music was present (Wöllner et al., 2018). While slow motion scenes led to an underestimation of time, the same scenes in real time seemed to last relatively longer, and music yielded more accurate time estimations. Furthermore, music led to higher physiological arousal and larger pupil diameters in observers, suggesting that music modulates emotional responses and experiences of time in audiovisual scenes.

Working Memory and Attention
Central to the SET is the memory stage, in which working memory is retained, and the judgment stage, in which the current count of temporal units is compared to references retrieved from long-term memory (Gibbon et al., 1984;van Rijn et al., 2014). Individual differences in short-term memory capacity and discrepancies in timing performances bring attention to the role of working memory in temporal processing (e.g., Broadway & Engle, 2011). More specifically, higher working memory capacities imply higher potential to hold more time units at the second and third stage of timing, thus leading to more precision .
Working memory is positively related to other executive functions such as selective and divided attention (Colflesh & Conway, 2007), for both auditory and visual modalities (Wöllner & Halpern, 2016). Both shift in weights in various timing scenarios. This is particularly relevant for understanding the different mechanisms behind prospective and retrospective timing, as mentioned before (Block & Zakay, 1997). In the oscillator model (DAT), attentional pulses are emitted in order to track external beats. These pulses are recorded and transferred to working memory before entering the stage of comparison with a reference duration in long-term memory (Block & Zakay, 1997;Gibbon et al., 1984). Attention diverted from the timing task results in fewer temporal units taken into the count and consequently underestimations of time, while attention directed to timing led to overestimations regardless of test durations (Polti et al., 2018). Despite a lack of evidence, we hypothesize a similar result with music listening. When instructed to time a piece of music before it commences, a listener processes the passage of time differently than when asked to estimate the time elapsed at the end of the excerpt.
The interpretation of the roles of WM and attention also depends on the theories. DAT, compared to SET, highlights the role of attention rather than working memory (e.g., Jones, 2010;Jones & Boltz, 1989). It postulates that attention, when quantified as regular emitted pulses, could synchronize with external periodicity and therefore serve as a reference for time. The periodicities of external events, that is regular or irregular patterns, do affect the strength of their synchronisation with attentional pulses. The more predictable an exogenous pattern is, the better the effect, known widely as the temporal entrainment effect (Barnes & Jones, 2000;Schroeder & Lakatos, 2009), This has been evidenced by a number of visual (Cravo et al., 2013), auditory (Barnes & Jones, 2000;Jones, 2010), and movement (Burger et al., 2018) studies. Jones (1981Jones ( , 1990 proposed that the characteristics of the information, in this case musical expressions, could distort the perception of time. Empirical studies supporting her claims found that, for instance, music was perceived to be slower when there were more pitch variations and inconsistent metrical accents (Boltz, 1998). We may predict that music genres with more predictable rhythms such as pop and rock, compared to those with less predictability such as Jazz, are associated with higher duration sensitivity and better timing accuracy.

Wang & Wöllner 9
Emotions "Time flies when you are having fun". Understanding the nature of emotions in time perception is important to comprehend how music distorts subjective time, as it essentially conveys a wide spectrum of emotions. The relatively small number of studies that have directly looked into the effects of musical emotions on subjective time show that information of strongly emotional contents were more engaging and were subsequently better processed and stored in WM, leading to time overestimation (for a review, see Schäfer et al., 2013). Music as a powerful tool to induce emotions was found to induce a sense of timelessness (duration overestimation) as well as faster passage of time when an individual is completely submerged in the experience (Herbert, 2012).
Apart from the aesthetic pleasure, other types and intensity of emotions may also have an impact on how music could distort the perception of time. The reasons may lie in the psycho-physiological arousal levels. Higher arousal level is believed to cause time overestimation (e.g., Droit-Volet et al., 2013). A group of participants, for instance, were presented with emotional film excerpts to induce corresponding emotions in them (Droit-Volet et al., 2011). Results indicate that, compared to baseline temporal judgments, participants tended to overestimate the durations after watching scary films. There are nevertheless findings implying the opposite, that is, higher emotional arousal leads to duration underestimation especially from a retrospective point of view (Herbert, 2012).
Another line of studies investigates the impact of emotional valences on temporal processing. Positive emotions, substantiated by happy music, led to duration underestimation, while negative emotions in sad music to duration overestimation with retrospective paradigms (Bisson et al., 2008). It was speculated that the positive emotions gave rise to less contextual changes than did the negative, therefore registering fewer events in the memory. Some evidence, on the contrary, implies that valence does not matter. Further investigations showed that highly arousing emotional pictures accelerated the internal clock speed and caused a leftward shift in the reaction time compared with pictures of low emotional arousal, regardless of its valence (Droit-Volet & Berthon, 2017).
The seemingly puzzling observations may be explained by the mechanism by which emotions take effect on time perception. One approach is rooted in the emission rates of attentional pulses, which can be moderated by the affective states, especially the arousal level. According to the pacemaker-counter model (Treisman, 1963), more attentional pulses are emitted when the arousal level is high, and subsequently be recorded as the sum of clock ticks, that is, the perceived duration. Attention could either facilitate or hinder the interaction between emotions and temporal processing. More specifically, when attention is allocated to sustaining the temporal units, the effect would lead to duration overestimation. In contrast, when attention is shifted from temporal information to the emotionally charged event, fewer ticks are accumulated, resulting in duration underestimation.

Modality-Specific Evidence for the Internal Clock Models Audiovisual Evidence for the Intrinsic Clock Model
Time-dependent neural oscillations are specific to sensory modalities. Studies have revealed that neuron excitation and inhibition could be elicited according to a specific type of sensory input, such as sound (Schnupp et al., 2006) and visual flicker (Burr et al., 2007). Researchers found that the time-dependent decodability of A Review of Internal Clock Models 10 visual objects with MEG in a window of 1000ms varied significantly, suggesting that time might be an inherent feature in the local visual network (Carlson et al., 2013). Furthermore, transcranial magnetic stimulation studies revealed that auditory timing could be dissociated with that in other sensory modalities (Bueti et al., 2008), as participants performed worse in duration discrimination task (pure tones, 10 to 40ms) when receiving disruptions in the auditory cortex. We might as well propose that, when listening to complex auditory signals such as music, particular groups of neurons in the human auditory cortex generate time-dependent responses, which simultaneously serve as time codes. However, relatively few studies with humans have directly confirmed the time-dependent variability of the local auditory network (Toiviainen et al., 2019).
The disassociation in timing abilities among different sensory modalities also showed that time is processed as a local flow of information. Early findings entail significantly higher timing precision with hearing than with vision (e.g., Penney et al., 2000), indicating a superiority of audition over vision in providing temporal cues. Timing is a highly selective, localized process even within one modality. Burr and colleagues (2007) successfully modulated the perceived durations of the target visual stimuli by manipulating the apparent rate of flickers in a confined retinal region. Their finding is among one of the first to empirically support (a) the spatial-temporal connection in neural representations, and (b) the modality specificity in temporal processing, particularly the superiority of audition (e.g., Repp & Penel, 2002). In Lustig and Meck's (2011) study, the modality effect was stronger for participants at both ends of the age spectrum. One potential cause was that older adults were more susceptible to varying allocation of attention under different experimental conditions, whereas children might be influenced by developing sensory functions. That is not to say that SDN is a 'one modality, one clock' system, but rather a large network that also covers the interactions between multiple networks.
Taken together, from an intrinsic model's perspective, time is a consequence of cumulative states in a recurrent neural network that represent the amount of changes induced by external stimuli. In this sense, when listening to a piece of music repetitively, the perceived duration of both music and video (as a further stimulus) will be altered if presented again later on.

Audiovisual Evidence for the Dynamic Attending Theory
DAT is endowed with a particular emphasis on attention, given that the count of temporal units depends on how well attentional pulses synchronize with the external event, also known as the temporal entrainment effect. The term specifies the coupling of the tempo of extrinsic temporal cues and that of pacemaker pulses (Jones, 2010). The emphasis on external entrainment like music began in the early days of the formulation of the clock model (Barry, 1990;Pöppel, 1989). Neurobiological evidence suggests that the just noticeable differences for auditory gaps can be modulated when neural activities were entrained with specific frequency bands and amplitudes (Henry et al., 2014). Regarding music, the synchronization between neural oscillations and musical beats was substantiated as the steady-state event potential (SS-EP) evoked by periodicity in musical beats (Nozaradan, 2014).
Behavioural evidence provided similar findings. Fast tempo was found to lead to overestimation, or "time dilation", and slow tempo to underestimation, or "time contraction" in both auditory  and visual perception (Ortega & López, 2008). In addition, behavioural entrainment to external beats were found across age ranges and stimulus types, including auditory sequences and music excerpts (for a review, see Repp & Su, 2013). The experimental paradigms usually provide participants with a rhythmic beat that ceases Wang & Wöllner 11 (or not) after a short period of entrainment and require them to continue tapping or moving along with the beats. Boasson and Granot (2012) adopted a paradigm of tapping to pitch rises and drops in multiple melodic sequences, in order to examine the entrainment effect. In their study, however, musicians and non-musicians uniformly exhibited faster-paced tapping behavior with rising pitch. This is consistent with other findings which revealed no difference in predictive timing between musically trained and untrained groups (e.g., Repp, 2010), whereas other studies indicated that musicians (percussionists) exhibited better entrainment performance when exposed to intense beat production activities (Cicchini et al., 2012). These studies suggest that individuals actively entrain with external rhythms and perceive past durations accordingly, and may provide evidence of the wide applicability of DAT.
Building upon simple click paradigms as previously discussed (Treisman et al., 1990), research in recent years used naturalistic stimuli, since DAT is most applicable in music and speech. Periodic tone entrainment studies yielded new results: Wearden et al. (2017) found the residual effect of the classic click train paradigm, that is, the higher the preceding click frequency, the longer the following duration would be perceived. They have also observed similar effects with irregular tones as well as white noise. This study revealed multiple approaches to activate and to speed up the internal clock. Periodic and aperiodic clicks, as well as rhythmic visual flickers and even white noise influenced results. In addition, the entrainment effect was also verified to transcend as long as 8s after hearing high-frequency clicks, indicating that the emission of attentional pulses has a latency between activation and cessation.
More complex stimuli such as music are processed similarly. Fast music compared to slow one was perceived to be longer due to the accumulation of more temporal units. A study adopted Mozart's Sonata for two pianos (K.448), where participants tended to overestimate the duration when the excerpt was at the "fast" (120BPM) end of the spectrum . The effect, nevertheless, is subject to the allocation of attention. Keller and Burnham (2005) emphasized the flexibility of attention when listening to musical meter, which could be composed of multiple metrical layers. Therefore, tracking high and low metrical structures is expected to have its corresponding effects on psychological time (cf. Hammerschmidt & Wöllner, 2020), as the former should hypothetically lead to fewer mental counts and thus time compression. Neurological evidence also indicated that focusing on different temporal structures led to alignments in steady state event potential (SS-EP) frequencies, deciphered from EEG recordings (Nozaradan et al., 2012). In this case, neural entrainment reflects that attending to local features in complex auditory signals could form mental representations of time by modulating the original neural oscillations.
When more attention is allocated to the temporal features of music, Cocenas-Silva et al. (2011) observed a time dilation effect. When participants were asked to group excerpts of various arousal levels based on their estimated lengths, those which were highly arousing tended to be overestimated. The finding is consistent with Droit-Volet et al.'s (2013) observation that faster music, which was thought to be more arousing, was judged to be longer than the slow, less arousing ones. We might reason that, when individuals attend to temporal features of the auditory signals, the temporal entrainment effect is stronger compared to situations when they attend to other features such as key chords and pleasantness.
A Review of Internal Clock Models 12

Audiovisual Evidence for the Scalar Expectancy Theory
The following examines the evidence for multiple sensory modalities that either support or disagree with SET.
To establish a solid ground for SET, researchers tried to find evidence for Weber's fraction, or a constant variance to subjective timing, across different sensory modalities, durations, populations, and other conditions. Wearden and Jones (2007) probed the scalar property of subjective timing using two variations of the duration comparison task with auditory tones ranging from 600ms to 10s. They found a linear increase in subjective timing that conforms to Weber's law. This effect is consistent also in the visual domain. In a duration discrimination study, Grondin (2001) found that participants exhibited similar sensitivity towards intervals marked by visual flickers between 600 to 900 ms, in accordance with Weber's law. However, the ratio changed when the inter-stimulus interval went beyond 900ms. The violation of Weber's law might be due to potentially explicit counting.
Similarly, mixed findings have been reported in multi-modalities studies. Hypothetically, if the scalar property holds across modalities, one should expect a consistent linear increase in different modalities. This was indeed the case when participants performed predictive saccades, or eye-movement timing, when intervals from 500 to 1000ms were presented either as visual flashes or auditory tone flutters (Joiner et al., 2007). However, comparing Weber's ratios between the two modalities revealed that auditory timing had greater variability than visual timing, as shown in participants' reactive eye movements when tracking the periodic cues. Hence, one might deduct that the scalar property holds but is also subject to stimulus modality. Block and Gruber (2014) argued that the obstacles of finding a cross-modal transfer effect was restricted to below the 3 to 5s window, beyond which the automatic processing should diminish due to the limited capacity of working memory.
On the other hand, evidence against the scalar property has been presented in auditory studies. Grondin (2012) adopted three approaches to measure Weber's ratio: duration discrimination, reproduction and categorization tasks on a spectrum from 1 to 1.9 seconds using pure tones. In all three tasks, Weber's ratio appeared to be higher when the intervals were longer regardless of the number of interval repetitions, in this case either 1, 3, or 5 times. These results indicate the inconsistency in Weber's ratio or temporal sensitivity despite different emphases of each paradigm on the timing process. Grondin (2010a) pointed out that the failure of conforming to the pacemaker-counter model, which SET is built upon, was because this model no longer applied to this duration range (see Figure 1). More specifically, a cut-off point at 1.2 to 1.3s was observed. This aligns with observations from other studies (for a review, see Matthews & Meck, 2014). The question is, how is time processed beyond that point? Some researchers proposed that a learning effect might have altered the variance, as the brain was influenced by multiple exposures to the same interval (Matthews & Grondin, 2012).
Findings across timing tasks and sensory modalities, nevertheless, support the presence of a unitary clock system.
Despite the controversial evidence, reports investigating timing precision on multiple sensory modalities align with what the striatal beat frequency theory proposed: a familiarity effect that is reflected by enhanced synaptic communication between neurons. This might lead to higher processing efficiency and smaller variability compared to unfamiliar intervals. Grondin's (2012) experiments revealed that participants performed better in 3and 5-interval discrimination than when only one interval was presented. Frequent exposure to timing tasks, as a part of music training, may also implicate the benefits of enhanced neural connection. In Rammsayer and Altenmüller's (2006) study, musicians outperformed non-musicians in a perceptual timing task in terms Wang & Wöllner 13 of showing less variance and thus higher temporal sensitivity for instance in duration discrimination tasks.
Musicians, however, did not exhibit significant superiority in a temporal generalization task, where participants compared the duration of an excerpt to the reference at the beginning, hypothetically stored in one's working memory. The authors believed that this was due to the fact that the intervals exceeded working memory capacities. This explanation is equally applicable to Grondin and Killeen's (2009) results, where participants in a reproduction-by-tapping task performed significantly better if they adopted counting or singing, compared to doing nothing. Thus it might be concluded that the SET indeed predicts the timing performance only within short intervals of no more than 2s (for a review, see Ivry & Schlerf, 2008). Nevertheless, it is equally important to understand timing within a few notes as well as in larger musical structures such as phrases.

Conclusions
This review has discussed two internal clock models: the intrinsic and the central clock models. The intrinsic model emphasizes automatic processing of temporal information in the subsecond range, while the central clock model explains the suprasecond (seconds to minutes) range of timing, which demands higher levels of cognitive control. Controversially, the Scalar Expectancy Theory, which can be seen as a specific account of the central clock model, applies to timing in the seconds range only, while the Dynamic Attending Theory works for timing intervals from seconds to minutes. According to SET and DAT, short intervals are represented linearly through the accumulation of pacemaker pulses, while longer intervals are represented nonlinearly, as pulse emission is calibrated to align with external periodicity. As for intervals of hours and longer, the timing process is subject to contextual changes and memory segmentation, and relevant research is scarce.
Audition, among all modalities, shows superiority in temporal processing by entailing higher sensitivity to detect changes and to estimate interval lengths compared to vision and other sensory modalities. In this sense, the modality specificity supports a distributed timing mechanism. Yet more evidence is needed to explain the cross-modal transfer of training effects in, for example, duration discrimination. Despite years of debate on the superiority of one clock model, there is no conclusive evidence to the best of our knowledge. We come to the observation that each model has its best fit at a different time duration scale, and as to whether discrete events (SET) or complex streams (DAT) such as in music are at the core of the investigation.
Regarding the explanatory power of the internal clock models for the perception of musical time, it is therefore necessary to consider an interval-specific approach. Short interval timing within the milliseconds range plays a crucial role in music production such as expressive microtiming, whereas long interval timing is more strongly modified by attention, emotion, and working memory, consequently adding more variables to the equation. In this regard, the timing paradigm adopted in an ecologically plausible environment such as music concerts, movies, or sports should receive more attention. Ways of applying clock models to longer-interval timing and time estimation are yet to be investigated.

Funding
This research was supported by a grant from the European Research Council to the second author (grant agreement: 725319) for the project ''Slow motion: Transformations of musical time in perception and performance'' (SloMo).
A Review of Internal Clock Models 14