Tonality is reflected by the variations in prominence given to tones on a scale (Piston, 1978; von Helmholtz, 1896) because of differences in their frequency of occurrence, metrical position, and duration (Prince & Schmuckler, 2014; Verosky, 2021). Most Western music considers the concept of tonality as:
…one of the main conceptual categories of Western musical thought, [referring] to the orientation of melodies and harmonies towards a referential (or tonic) pitch class. In the broadest possible sense, however, it refers to systematic arrangements of pitch phenomena and relations between them. (Hyer, 2021, p. 1)
Several music psychology models suggest that tonality perception is linked to the varying stabilities perceived in individual tones (Schmuckler, 2016). Krumhansl and Shepard (1979) demonstrate that tonal hierarchy is reflected in individuals’ stability ratings of tones, depending on the hierarchical level in the preceding tonal context. This hierarchy in classical/romantic music includes four levels: tonic, tonic triad, other diatonic tones, and non-diatonic tones (Krumhansl & Kessler, 1982). Krumhansl and Keil (1982) conclude that tonal hierarchy is stored in long-term memory as internal representations of tonal hierarchies (IRTH). In this paper, the ability to perceive the different stabilities of tones is referred to as sensitivity to IRTH.
Based on the above, our analyses focus on the IRTH acquisition processes (i.e., how IRTH evolves with increasing age). The scientific debate about the timing and mechanisms of IRTH development is ongoing (Patel, 2021), with narrative summaries (e.g., Corrigall & Schellenberg, 2016; Gembris, 2017a) and theoretical models offering varied and sometimes contradictory conclusions. Earlier theories, such as Brehmer's concept of “tonal giftedness” (German: “Begabung”), suggest that this ability is fully developed by age 7 (Kühn, n.d., cited in Brehmer, 1925, p. 172). Gordon (2012) argues that musical aptitude and audiation show significant development before stabilizing around age 9, with limited improvement thereafter. This extended developmental trajectory contrasts with Brehmer's earlier views.
Gardner’s (1973) theory includes two stages of artistic development concluding around age 7, whereas Swanwick and Tillman (1986) propose four stages extending into adolescence. Hargreaves (1996) supports the latter view, emphasizing the importance of ages 8 to 15 for mastering musical rules. Conversely, Serafine (1988) suggests a more restricted sensitive period for tonal learning from ages 8 to 10. In stark contrast, recent research highlights statistical learning as a key factor in lifelong IRTH development (Jonaitis & Saffran, 2009; Vuvan, 2013), with evidence showing that this process begins as early as infancy (Saffran et al., 1999). Thus, whether IRTH has already developed fully by one’s school years or continues to develop during that time remains an open question.
Regarding primary studies, Krumhansl and Keil (1982) identify a clear age-related progression in IRTH: first- and second-graders prefer diatonic tones over non-diatonic tones, third- and fourth-graders prefer the tonic triad over other diatonic tones, and fifth-graders as well as adults prefer the tonic overall. Maier-Karius and Schwarzer (2011), Paananen (2007, 2009), and Schwarzer et al. (1993) replicate this trajectory; however, Matsunaga et al. (2020), Schellenberg et al. (2005), Speer and Meeks (1985), and Wilson and Wales (1995) find either earlier or later IRTH acquisition.
The failure to replicate previous findings may be related to several factors, such as participant characteristics or methodological differences. One factor might be formal musical training, defined as systematic instruction in music theory or practice (Hanna-Pladdy & MacKay, 2011). Although some studies suggest that formal training enhances IRTH acquisition (Corrigall et al., 2022; Mandikal Vasuki et al., 2016), others find no such effect, primarily attributing acquisition to implicit learning processes (Cui, 2019; Müllensiefen et al., 2014). Moreover, the influence of formal training may depend on factors such as age, ongoing training, and varying types of operationalization (Asztalos & Csapo, 2017; Müllensiefen et al., 2022; Zhang et al., 2020). Cultural background (Matsunaga et al., 2020), gender (Lin, 2023), and socioeconomic status (Miles et al., 2016) may also modulate IRTH acquisition.
Studies may also suffer from a lack of statistical power, which increases the likelihood of failing to detect a true population effect (Ellis, 2010). Finally, the specific method used to measure IRTH may affect the results. In contrast to the predictions of earlier theories on the development of tonal hierarchies, which rely on explicit measures with (partially) time-independent response formats, several studies using implicit measures report detecting tonal knowledge even in early childhood, indicating an earlier onset of this ability than previously suggested (Jentschke et al., 2005; Politimou et al., 2021; Trehub et al., 1999). Corrigall et al. (2022) find that task characteristics influence IRTH acquisition in school-aged children, with significant differences only emerging in explicit tasks.
In summary, despite 40 years of vibrant research, we are faced with a highly heterogeneous body of literature, making it difficult to draw clear evidence on the shape of IRTH skill acquisition. This complexity pertains not only to identifying the general developmental trajectory but also to distinguishing between specific phases of skill acquisition, such as the phase of initial skill acquisition and the potential onset of a saturation effect—a plateau in skill improvement in relation to the conditions (e.g., formal instruction) under which differences in the shapes and effects of skill acquisition occur.
Aims and Objectives
We conducted a critical review of four decades of research on the development of IRTH sensitivity. Despite a sizeable volume of studies, the findings remain heterogeneous and partially contradictory, particularly regarding developmental trajectories and the influence of musical training. Building on Gordon’s (2012) theory of music learning, which posits that core aspects of tonal sensitivity are largely developed by age 9, we formulated four exploratory research questions. By examining these questions, this study aimed to (1) assess the magnitude of age-related changes in IRTH sensitivity, (2) examine whether sensitivity increases significantly beyond age 9, (3) investigate the role of musical training, and (4) explore the effects of different operationalizations of IRTH measurement.
Research Questions
Research Question 1 (RQ1). How substantial is the average age-related development of IRTH sensitivity?
Research Question 2 (RQ2). Does IRTH sensitivity continue to increase beyond the age of 9, or does it level off, as predicted by Gordon’s learning theory?
Research Question 3 (RQ3). Is IRTH sensitivity significantly higher in musically trained individuals than in those without musical training?
Research Question 4 (RQ4). Does the implicit measurement of IRTH sensitivity via response time yield lower estimates than more explicit operationalizations?
Method
Procedure
We conducted a systematic and comprehensive literature search (What Works Clearinghouse, 2020) between January and April 2022 to identify eligible studies. The study selection followed a predefined process outlined in the review protocol (see Supplementary Material S1), which was not published before the review was conducted. In the final sample, we included only primary research studies of the preliminary literature corpus that used a hypothesis-testing approach, followed either a cross-sectional or longitudinal design, were published between 19821 and April 2022, investigated a sample of healthy elementary and secondary school students, and defined the measurement of IRTH sensitivity as a dependent variable based on either a listening task (probe tone paradigm, response time measures, and goodness-of-fit ratings in syntax-violation paradigms) or a creative production task (harmonization of a given melody, tonal composition, and improvisation to a given chord sequence). These operationalizations were further specified as described below.
First, the probe tone paradigm (Krumhansl & Shepard, 1979) involves presenting one or two of the 12 chromatic tones (the "probe") after establishing the tonal context (e.g., an ascending major scale), followed by a brief silence. Participants rate how well the probe tones fit the tonal context. This process continues until all 12 chromatic tones are rated. The perceived stability of each tone is measured based on participants’ goodness-of-fit ratings. The correlation of these ratings with the prototypical tonal hierarchy (see above) defines participants’ sensitivity to IRTH. Although no standardized protocols or psychometric criteria2 have been established, the probe tone paradigm has been consistently replicated across various contexts, demonstrating its robustness (Morgan et al., 2019; Sauvé et al., 2021).
Second, some studies use goodness-of-fit ratings in a syntax-violation paradigm (e.g., Corrigall et al., 2022). Participants judge how well the final tone of short melodies fits within the preceding harmonic context with varying degrees of tonal congruence (e.g., a tonic note such as "C" in C major versus a non-diatonic note such as "C#"). A greater difference in ratings between congruent and incongruent endings indicates IRTH acquisition, with higher ratings for congruent endings reflecting higher IRTH sensitivity.
Third, Schellenberg et al. (2005) and Corrigall et al. (2022) focus on response times (as introduced by Janata & Reisberg, 1988). Rather than explicitly evaluating the tonal congruence of melody endings, participants rate other musical features, such as the timbre of the final tone, while hearing a priming sequence with either a tonally congruent (expected) or incongruent (unexpected) final tone. Faster and more accurate responses are expected when the target tone or chord is more tonally congruent with the preceding context, indicating stronger IRTH sensitivity.
Fourth, other methods involve creative production tasks, such as composing (Wilson & Wales, 1995), improvising while listening to a pre-recorded tonal chord sequence (Paananen, 2003), or harmonizing a melody (Paananen, 2009). The outcome variables are defined by the degree of tonal fit and timing of the tones or chords used by the participants. Wilson and Wales (1995, p. 102) report a substantial to almost perfect inter-rater agreement (cf., Landis & Koch, 1977) for expert assessments of compositions .
In the next step, three independent and trained coders (including the first author) coded the included studies according to a previously developed protocol (see Supplemental Material S2). After calculating the initial inter-rater reliability, which showed almost perfect agreement , coding discrepancies were resolved by consensus. The studies vary in whether and how they report participants' formal musical training; therefore, we calculated the percentage of those with training for each age group. Sufficient data are available in 11 studies, while other variables are coded as not applicable (“n/a”).
Data Analysis
Data were analyzed using two approaches. First, we conducted a Bayesian three-level meta-analysis to provide a quantitative summary of the systematic review, offer an overview, identify notable studies, and investigate differences in IRTH measurements. Second, we conducted a model comparison analysis of the cross-sectional data of a single study (Krumhansl & Keil, 1982) investigating the growth trajectory of IRTH acquisition.
In the meta-analytical approach, effect sizes were estimated from primary studies using the metafor package (Viechtbauer, 2010) in R (R Core Team, 2022) and, if needed, transformed into Cohen’s following Borenstein and Hedges (2019, pp. 214–234). The procedure for effect size calculations, raw data, and a markdown script for the Bayesian three-level meta-analysis are shown in Supplemental Materials S3, S4, and S5.
In the subsequent analysis step, the effect sizes were aggregated and weighted using a Bayesian three-level meta-analysis with the R package brms (Bürkner, 2017), following the procedure of Harrer et al. (2021). Unlike a one-level fixed effect model that only attributes variance to sampling error, the two-level random effects model separates variance into sampling errors3 and between-study heterogeneity (). This model captures inherent differences between studies, such as measurement methods.
Our model further includes a third level of variance reflecting the nested structure of effect sizes within studies and accounting for the variance within single studies, such as different age groups. This multilevel approach (for a formal model description, see Supplemental Material S6) better reflects the nested structure of our data and avoids the statistical issues of traditional univariate meta-analysis, which can misestimate heterogeneity and increase the risk of false positives (Harrer et al., 2021; Hedges, 2019). We compared models with differing levels of complexity by computing marginal likelihoods using the bridge sampling method implemented in the brms package.
We chose Bayesian meta-analysis for two key reasons. First, it allows for the explicit modeling of heterogeneity uncertainty (), which can yield more stable estimates in small-sample contexts—especially when applying weakly or moderately informative priors (Harrer et al., 2021). Notably, although Bayesian methods are not inherently robust to small samples or poor study quality, they provide a principled framework for incorporating prior knowledge and quantifying uncertainty. Second, prior information can improve the posterior estimation of key parameters, such as (intercept) and (standard deviations; Röver, 2020).
To address the context-dependence of choosing the prior parameter settings (Gelman et al., 2015), we initially used two separate and weakly informative zero-centered priors for and, as recommended by Williams et al. (2018) and Harrer et al. (2021, Setting Prior Distributions, para. 5):
Prior 1:
Prior 2:
Subsequently, we performed sensitivity analyses to estimate the impact of prior parameter choices on the posterior density distributions of the parameter estimates. Therefore, following Röver (2020) and Turner and Higgins (2019), we used two alternative weakly informative priors for both and, while τ was modeled as a standard deviation parameter constrained to be positive, implying half-distributions for the respective priors:
Prior 3:
Prior 4:
We did not conduct a meta-regression using age as a moderator because of the limited number of studies, heterogeneity in age reporting, and insufficient information on age distribution in several samples.
We examined the magnitude of age-related development in IRTH sensitivity (RQ1) by fitting a Bayesian three-level meta-analytic model using the brms package in R. Small effects () are often considered the minimum threshold for practical relevance in developmental research (e.g., Gignac & Szodorai, 2016), and the average difference in IRTH sensitivity between older and younger participants meets or exceeds the benchmark for a small effect. Insufficient information reported in the primary studies prevented us from conducting a meta-regression with musical training as a moderator (RQ2). Therefore, the impact of musical training on IRTH development remains an open research question.
To address RQ3, we estimated the impact of operationalization on effect size using a Bayesian three-level meta-regression in brms. The model included operationalization as a moderator, accounted for the nesting of effect sizes within studies, and incorporated their standard errors. Random effects were specified at the study and within-study levels.
We examined RQ4 using Krumhansl and Keil (1982), who report notable within-study heterogeneity, to compare learning-theory-informed models using data from. The dataset comprised four age groups assessed across five probe tone conditions. Based on age-specific difference scores, we fitted linear (linear, quadratic, cubic) and non-linear (sigmoidal, logistic, mixed, saturated) functions to model the development of tone judgment stability. Analyses were conducted using the R packages lme4 (Bates et al., 2015) and nlme (Pinheiro & Bates, 2024; see S7 for data and S8 for code).
Results
Our search process yielded h = 3,635 hits. We excluded articles based on a range of criteria (see Figure 1 for details). Ultimately, we included i = 13 articles reporting j = 16 studies with y = 60 effect sizes for N = 1,287 participants.
Figure 1
Study Selection Flowchart
Table 1 presents the descriptive statistics for the effect sizes of the included studies. All studies were cross-sectional, with no longitudinal studies fulfilling the eligibility criteria. The sample sizes of the included studies ranged from n = 24 (Paananen, 2009; Speer & Meeks, 1985) to n = 285 (Lamont & Cross, 1994). The studies’ contributions to the pooled effect size estimation differed, with weights ranging from approximately 2% (Paananen, 2009; Speer & Meeks, 1985) to 20% (Lamont & Cross, 1994). Effect sizes varied widely, 0 ≤ d ≤ 3.9 with 0.14 ≤ υ ≤ 0.41, indicating substantial variance.
Table 1
Descriptive Statistics of Effect Sizes in Primary Studies and the Multi-Level Data Structure
The studies included participants from various age groups, ranging from age 6 (e.g., Krumhansl & Keil, 1982) to age 15 (Paananen, 2009). Adults were used as the treatment group in five studies (e.g., Matsunaga et al., 2020) because they were assumed to have fully developed IRTH, making them a suitable reference population. However, as suggested by our subsequent model comparison analyses, this assumption appears to be only partially accurate, as adults do not consistently demonstrate superior performance in tasks with higher cognitive demands. For a visualization of the correlation between the age of the treatment group, the age of the control group, and the effect sizes, see the scatter plot in Supplemental Material S9. Most studies used the probe tone technique for the dependent variable, either with a rating scale (j = 5) or by asking the participants to produce the most appropriate probe tone (j = 1). This includes studies using goodness-of-fit ratings for syntax violations and three studies using creative tasks, such as composition, improvisation, or harmonization (Table 2).
Table 2
Characteristics of Primary Studies
| ID | Author (Year) | DV | n | Mean Age | Children with Formal Musical Training (%) | |||
|---|---|---|---|---|---|---|---|---|
| Treat | Cont | Treat | Cont | Treat | Cont | |||
| 1 | Krumhansl and Keil (1982), comp. 1 | PT | 14 | 14 | 8.5 | 6.5 | 71 | 43 |
| 1 | Krumhansl and Keil (1982), comp. 2 | PT | 14 | 14 | 10.5 | 8.5 | 79 | 71 |
| 1 | Krumhansl and Keil (1982), comp. 3 | PT | 14 | 14 | 20 | 10.5 | 86 | 79 |
| 2 | Speer and Meeks (1985)a | PT | 12 | 12 | 10 | 7 | 42 | 33 |
| 3 | Cuddy and Badertscher (1987), comp. 1 | PT | 21 | 20 | 8.5 | 6.5 | 43 | 45 |
| 3 | Cuddy and Badertscher (1987), comp. 2 | PT | 12 | 21 | 10.5 | 8.5 | 42 | 43 |
| 4 | Schwarzer et al. (1993)a | PTP | 20 | 26 | 20 | 9 | 25 | 23 |
| 5 | Lamont and Cross (1994)b | PT | 285 | — | 6-9 | — | n/a | n/a |
| 6 | Wilson and Wales (1995) | Comp | 36 | 37 | 9 | 7 | 42 | 62 |
| 7 | Paananen (2003), comp. 1 | Imp | 12 | 12 | 8.5 | 6.5 | n/a | n/a |
| 7 | Paananen (2003), comp. 2 | Imp | 12 | 12 | 10.5 | 8.5 | n/a | n/a |
| 8 | Schellenberg et al. (2005) I | RT | 13 | 10 | 10.5 | 6.5 | 100 | 0 |
| 9 | Schellenberg et al. (2005) II | RT | 19 | 17 | 10.5 | 7.5 | 47 | 41 |
| 10 | Schellenberg et al. (2005) III | RT | 22 | 22 | 10.5 | 7.5 | 59 | 50 |
| 11 | Paananen (2009), comp. 1 | Harm | 12 | 11 | 8.5 | 6.5 | n/a | n/a |
| 11 | Paananen (2009), comp. 2 | Harm | 12 | 11 | 10.5 | 8.5 | n/a | n/a |
| 11 | Paananen (2009), comp. 3 | Harm | 8 | 12 | 14.5 | 10.5 | n/a | n/a |
| 12 | Maier-Karius and Schwarzer (2011), comp. 1 | PT | 32 | 40 | 9.5 | 6.5 | 25 | 18 |
| 15 | Maier-Karius and Schwarzer (2011)a, comp. 2 | PT | 10 | 32 | 20 | 9.5 | 100 | 25 |
| 13 | James et al. (2012)b | GoF | 112 | — | 6-10 | — | n/a | n/a |
| 14 | Matsunaga et al. (2020) I, comp. 1 | GoF | 24 | 24 | 9 | 7 | 0 | 08 |
| 14 | Matsunaga et al. (2020) I, comp. 2 | GoF | 26 | 24 | 10.5 | 9 | 0 | 0 |
| 14 | Matsunaga et al. (2020) I, comp. 3 | GoF | 28 | 26 | 12.5 | 10.5 | 0 | 0 |
| 14 | Matsunaga et al. (2020) I, comp. 4 | GoF | 20 | 28 | 14 | 13.5 | 0 | 0 |
| 14 | Matsunaga et al. (2020) I, comp. 5 | GoF | 28 | 28 | 20 | 14 | 0 | 0 |
| 15 | Matsunaga et al. (2020) II, comp. 1 | GoF | 25 | 26 | 9 | 7 | 0 | 0 |
| 15 | Matsunaga et al. (2020) II, comp. 2 | GoF | 19 | 25 | 11 | 9 | 0 | 0 |
| 15 | Matsunaga et al. (2020), II, comp. 3 | GoF | 23 | 19 | 13 | 11 | 0 | 0 |
| 15 | Matsunaga et al. (2020) II, comp. 4 | GoF | 26 | 23 | 15 | 13 | 0 | 0 |
| 15 | Matsunaga et al. (2020) II, comp. 5 | GoF | 26 | 26 | 20 | 15 | 0 | 0 |
| 16 | Corrigall et al. (2022) | RT, GoF | 49 | 48 | 6.5 | 10.5 | 55 | 45 |
Note. comp. = age group comparison in case of multi-arm studies; 1 = youngest group; 2 = next oldest group, etc., up to comp. 5; DV = dependent variable; PT = probe tone rating; PTP = probe tone production (participants played the probe tone on a keyboard); Imp = improvisation; Comp = composition; Harm = harmonization; RT = response time; GoF = goodness-of-fit ratings; Treat = treatment group (older participants); Cont = control group (younger children); = number of participants in each group; n/a in the columns of musical training: the available information is insufficient to reproduce the number of musically trained participants in each subgroup. Formal musical training is operationalized differently in each study.
aindicates sudies, in which the mean age for adults was set to 20 years due to insufficient information, in detail, Krumhansl and Keil (1982, p. 246) reported only undergraduate or graduate students as adults, Maier-Karius and Schwarzer (2011, p. 172) graduate students, and Schwarzer et al. (1993, p. 77) provided an age range of 19–45 years. bprovided only a global effect size for primary school students; therefore, the age of the youngest and oldest participants were coded.
The control variables differed significantly across studies. Some researchers, such as Krumhansl and Keil (1982), did not explicitly control for prior formal musical training. In contrast, others, such as Corrigall et al. (2022), treated it as a key variable. The participants with formal musical training in each sample ranged from none (coded as 0; Matsunaga et al., 2020) to all (coded as 1; Maier-Karius & Schwarzer, 2011). Three studies reported the percentage of musically trained participants only across the sample: Paananen (2003), 13.89%; Paananen (2009), 13.63%; and James et al. (2012), 16.96%. Formal musical training could not be included as a moderator in the statistical model, owing to limited data and a lack of an apparent long-term sampling practice considering these variables (Harrer et al., 2021)
Study quality was evaluated using various criteria (see Supplemental Material S2), with an average score of M = 19.3 (SD = 2.3) out of 43. Overall, study quality was relatively homogeneous; however, only one study reported psychometric quality criteria (Wilson & Wales, 1995).
Bayesian Three-Level Meta-Analysis
We initially estimated a four-level meta-analysis to aggregate effect sizes, as two articles (Matsunaga et al., 2020; Schellenberg et al., 2005) reported independent studies, suggesting potential within-article heterogeneity. However, although a comparison between the initial four-level model and a three-level meta-analysis model showed that the Bayes factor, BF10 = 1.92, provided insufficient evidence to decisively favor either model (for a general discussion, see van Doorn et al., 2021), the three-level model was ultimately chosen because of its slightly better data fit and based on the principle of parsimony (Vandekerckhove et al., 2015). The model comparison did not support further simplifying the model to a random-effects structure4.
Before interpreting the results, we confirmed model convergence for our three-level structure by conducting posterior predictive checks and examining the -values of the parameter estimates (Harrer et al., 2021). All -values indicated successful convergence (Bürkner, 2017). Additionally, visual inspection confirmed that the posterior distributions aligned with the initial unimodal normal distribution, supporting our assumption of normality (see Appendix Figure A2-1). Sensitivity analyses using different weakly informed priors further support the robustness assumption of our results (Table 3).
Table 3
Prior and Posterior Model Parameters of Three-Level Meta-Analysis Model and Sensitivity Analyses
| Model Parameter | Priors of the Three-Level Model | Alternatives in Prior Choice | Conclusions | ||
|---|---|---|---|---|---|
| Prior 1 | Prior 2 | Prior 3 | Prior 4 | ||
| N(0, 1) | HalfCauchy(0, 0.5) | N(0, 0.5) | t(3, 0, 2.5) | ||
| Intercept (µ) | Mdn = 0.57 (0.37, 0.77) | — | Mdn = 0.56 (0.36, 0.75) | Mdn = 0.58 (0.37, 0.78) | Small, but practically negligible difference for both estimates. Prior 3 goes along with a slightly different, but insignificant shrinkage of the point estimates towards the center of the prior) |
| Between-study | — | Mdn = 0.20 (0.00, 0.40) | Mdn = 0.20 (0.04, 0.40) | Mdn = 0.21 (0.04, 0.42) | Minor, but practically negligible differences between both parameter estimates. Prior 4 produced a slightly wider 95% credible interval, reflecting marginally more uncertainty. |
| Within-study | — | Mdn = 0.43 (0.27, 0.60) | Mdn = 0.43 (0.28, 0.60) | Mdn = 0.45 (0.27, 0.61) | Minor, but practically negligible variation across priors. |
Note. Posterior medians (Mdn) are used as point estimates; 95% credible intervals are presented in parentheses; for Priors 3 and 4, half-distributions were used for modeling τ1 and τ2.
We assessed publication bias using a Bayesian Egger test, which indicated a significant intercept (β1 = 7.46, 95% CI [3.12, 11.67]), suggesting potential publication bias. However, previous studies have recommended conducting this approach on a corpus of at least 30 studies because asymmetry tests for standardized mean differences are prone to inflated Type I error rates (Pustejovsky & Rodgers, 2019; Renkewitz & Keiner, 2019). A visual inspection of the funnel plot (see Appendix Figure A2-2), showing the standard error of the reported effect sizes as a function of their magnitude, suggests the potential for slight asymmetry; however, this is primarily driven by Krumhansl and Keil (1982; Study ID 1). Thus, we found no significant evidence of general publication bias caused by selective reporting.
We further explored potential sources of bias by examining the model-based relationships between μ and τ1 (Figure 2a), and μ and τ2 (Figure 2b). Although visual inspection suggested only marginal associations, Pearson correlations revealed statistically significant but negligible effects in both cases: μ and τ1, r(11998) = -.09, p < .001, 95% CI (-.11, -.07); μ and τ2, r(11998) = .08, p < .001, 95% CI (.06, .10). These results indicated that any potential relationship between effect sizes and heterogeneity is minimal, thus providing little evidence of systematic publication bias.
Figure 2
Heatmap of the Joint Posterior Distribution of Model Parameters μ (x-axis) and Standard Deviations τ (y-axis)
Note. Figure 2a depicts the relationship between μ and τ1 (between-study standard deviation). Figure 2b shows the relationship between μ and τ2 (within-study standard deviation). The color gradient reflects the density scale, with higher values (lighter regions) indicating higher density and lower values (darker regions) indicating lower density. Red lines in each plot represent the maximum likelihood point estimates of both parameters.
Based on the observed data, hierarchical model, and prior choices, we produced a point estimation for the pooled medium-sized effect (Mdnd = 0.57). As indicated by the credible intervals of the pooled effect size (Table 3), we could further conclude that the general so-called “true” effect size—expressed as Cohen’s d—reflects a medium to nearly large difference in IRTH sensitivity between younger and older participants. Moreover, the 95% credible interval for the effect lies entirely above zero (0.37, 0.77)5 along with decisive evidence against a point-null hypothesis (BF10 = 11999). We further assess whether the age-related increase in IRTH sensitivity exceeds a negligible magnitude, that is, whether it is neither trivially different from zero nor within the range of a small effect . Therefore, we used Bayes factors to quantify the evidence that the result exceeded the small-effect threshold by using a Bayesian model comparison approach. Our analyses revealed a Bayes Factor providing decisive support for the alternative assumption (BF10 = 1332.33) indicating that the age-related increase in IRTH sensitivity is unlikely to be marginal. Instead, it should be best interpreted as a substantively meaningful skill-development effect with a high probability of at least medium-to-strong magnitude. We further illustrated the cumulative posterior evidence by plotting the empirical cumulative distribution functions for , and (see Appendix Figure A2-3) to facilitate the interpretation of the probability that these parameters exceed meaningful thresholds.
Furthermore, based on the hierarchical model and its estimated parameters, we revealed a model with parameter estimates as solution in which at least one effect size estimate from each study falls within the 95% credible interval of the pooled effect size (Figure 3). At first glance, Schwarzer et al. (1993) appear to be an exception; however, although the point estimates of effect sizes in their study are predicted to fall well outside the 95% credible interval of the pooled effect size, their respective 95% credible intervals still intersect with this range.
Figure 3
Forest Plot of the Bayesian Three-Level Model With Estimated Effect Sizes of Individual Studies and the Pooled Effect Size
Note. The densities represent their respective posterior distributions. Medians serve as point estimates for the effect sizes. Blue shading highlights estimated effect sizes and parts of the posterior distributions that fall within the 95% credible interval of the pooled effect size.
Although the vast majority of the effect sizes reported in investigated studies fall within the expected 50–80% credible intervals according to our hierarchical model (Figure 4), some observed effect sizes reported by Krumhansl and Keil (1982) deviate from this pattern. Specifically, the results of two studies (ID 11 and ID 12) exhibit exceptionally high effect sizes. Although our model considers these values possible, they are highly unlikely from a statistical standpoint. For example, according to our model, the probability that the estimate for the effect size for ID 12 in Krumhansl and Keil (1982), denoted as in the posterior samples, is greater than or equal to the observed effect size is relatively low at 2.1%, . In contrast, the observed effect size for ID 11 has an even lower probability of occurring, at less than 1% .
Figure 4
Model-Based Probability Predictions for the Occurrence of Each Effect Size
Note. Whereas the vast majority of effect sizes have a high probability of occurrence based on the model-based expectations, some values have extremely low statistical probability, such as ID 11 and ID 12.
The multilevel model revealed significant heterogeneity, divided into between- and within-study heterogeneity. Although the median between-study variance, accounted for only 18% of the total variance, the majority of the heterogeneity (82%) could be attributed to within-study variance, .
However, differences in age-group compositions across studies prevented us from answering RQ2 using this analytical approach, which is discussed below. Furthermore, RQ3 could only be addressed in an exploratory manner, as the limited number of studies and inconsistent reporting of relevant sample characteristics hindered formal statistical testing. Table 2 summarizes the proportions of musically trained and untrained individuals across the included samples.
We addressed RQ4 by examining whether differences in IRTH measurement type could explain the variability in effect sizes. The central goal of a meta-analysis is to account for heterogeneity in the data (Borenstein, 2019). Therefore, we followed Corrigall et al. (2022) and introduced a categorical moderator representing six measurement types: probe tone, response time, goodness-of-fit rating, composition, improvisation, and harmonization. The analysis revealed largely overlapping 95% credible intervals and only small differences in posterior means, suggesting that the measurement type had no systematic effect (Figure 5). However, comparing model fit using R2 values (Hayes, 2022) showed a slight improvement in the moderated model, R2 = .51, 95% CrI (.30, .72), compared to the model without a moderator, R2 = .48, 95% CrI (.27, .70). Although the Bayes factor of BF10 = 38.80 indicated a strong statistical preference for the moderated model (van Doorn et al., 2021), the practical significance of this improvement remained uncertain, given the modest differences in parameter estimates. Therefore, we directly addressed RQ4 by comparing the response time measures with all other operationalizations. These comparisons yielded weak-to-moderate evidence against a systematic difference for goodness-of-fit ratings (BF10 = 0.48), harmonization (BF10 = 1.00), and improvisation (BF10 = 0.84) in lower estimates for response time measures compared to probe tone ratings . Overall, the results indicated smaller effect sizes for response time measurements in probe tone ratings only, with no consistent reduction in effect sizes for response time measurements.
Figure 5
Posterior Distributions of the Intercept and Moderator Levels of the Bayesian Three-Level Moderated Meta-Analysis
Note. Effect sizes are standardized mean differences (d). Thin error bars are 50% CrI of d. Thick error bars indicate 95% CrI of d.
Taken together, these results suggest that response time measurements do not consistently yield lower IRTH sensitivity estimates compared to more explicit operationalizations. Therefore, the current evidence does not support RQ4, except for a notable difference in probe tone ratings.
The observed within- and between-study heterogeneity was likely caused by specific artifacts, such as sample and characteristics. However, we cannot pinpoint these factors more precisely because of the limited information in the primary studies. However, certain possibilities were excluded based on the model comparison analysis presented below.
Thus far, we have reported evidence of a medium-to-strong age-related increase in IRTH sensitivity, although the high variance somewhat obscures this trend. However, these analyses do not clarify whether age-specific learning gains are present, particularly (1) whether acquisition continues beyond age 9 and (2) whether there are sensitive periods during school years that would be reflected in a non-linear learning trajectory. Thus, we conducted further analyses to address these uncertainties.
Model Comparison Analysis
Based on the AIC, BIC, and RMSEA model comparison criteria, the non-linear mixed model best fits Krumhansl and Keil’s (1982) data (see Appendix Table A1-1), suggesting no linear or sigmoidal increase in IRTH sensitivity between ages 6 and 20. Figure 6 shows the age-related development of various IRTH sensitivity outcomes according to the non-linear mixed model. Based on the maximum increase in IRTH sensitivity at age 20 (k = 0.08), we concluded that IRTH continues to develop into adulthood and is not fully developed by age 9.
Figure 6
Development of IRTH Modelled as a Non-Linear Function of Age
Note. Standardized mean differences (d) as indicator of tonal recognition stability based on reanalyzed cross-sectional data of Krumhansl and Keil (1982, pp. 247–248) as measured with the two sample tones probe tone paradigm (see section procedure for explanation).
Furthermore, the distinct trajectories shown in Figure 6 suggest that outcomes may strongly depend on specific task characteristics, even within a single-measurement paradigm. Although the differentiation between pairs of diatonic and non-diatonic tones, as well as the tonic triad versus other diatonic tones, shows a steep increase, more subtle IRTHs, such as the preference for the tonic over other tonic triad tones, show little to no increase, even in adults.
Discussion
Our study aimed to synthesize 40 years of research examining the impact of age-related experience with Western tonal music on IRTH acquisition in school-aged children. Additionally, we explored sources of variance both within and between studies to better understand the heterogeneity in study outcomes.
In a first approach, we aggregated eligible studies in a Bayesian three-level meta-analysis to precisely quantify the mean increase in IRTH sensitivity between younger and older participants. On average, the results indicated a moderate increase in IRTH acquisition during one’s school years, depending on the cognitive demands of the tasks measuring IRTH sensitivity. This finding has significant educational implications. The observed medium effect size supports the implementation of targeted programs (e.g., Government of the UK, Department of Education, 2021) to enhance tonal sensitivity in school-aged children, as this is sufficiently substantial to be practically relevant for educational initiatives (Hattie, 2012). Future research should further investigate these acquisitional trajectories and explore potential critical periods beyond the age range covered in our study (Hargreaves & Lamont, 2017).
Subsequently, we conducted a model comparison of single-study results to explore the timing and potential shapes of learning trajectories within IRTH acquisition as a function of task characteristics. The findings provide converging evidence against early closure models that assume a fixed endpoint of acquisition by age 7 or 9 (Gembris, 2017b; Gordon, 2012). Our findings indicate that IRTH sensitivity instead continues to develop through adolescence and into adulthood, supporting gradual or open-ended models of acquisition (e.g., Hargreaves, 1996) and aligning with neurocognitive and educational frameworks of lifelong learning (Altenmüller, 2022; Mack et al., 2025). We also observed improvements among participants without formal musical training, underscoring the role of enculturation and mere exposure (Demorest & Morrison, 2016). Furthermore, data from participants with ongoing formal training—such as the adult volunteers in Krumhansl and Keil (1982), who had 7.8 years of musical training on average—indicate that IRTH performance remains trainable beyond early childhood. These patterns suggest a skill development process shaped not only by exposure but also by the cognitive demands of the task. Higher task complexity may require a level of expertise that emerges only through extended, potentially formalized practice. Consequently, the observed plateau effects likely reflect task-specific thresholds rather than universal developmental limits.
A more detailed analysis using a model comparison approach revealed that a task-specific non-linear model best captured the developmental trajectory, a pattern also observed in other musical contexts, such as the development of melody perception (Lin, 2023). Notably, our analysis revealed substantial differences in IRTH developmental trajectories, which varied not only as a non-linear function of age but also as a function of specific task characteristics. Matsunaga et al. (2020) also reported variations across tasks, who found that Japanese children recognized Western diatonic tones at age 7 but identified the tonic only at age 13. In contrast, implicit measures of IRTH demonstrated sensitivity to the tonic in ages 6 to 7 (Corrigall et al., 2022; Schellenberg et al., 2005), suggesting that implicit processes may precede explicit ones in tonal development (Corrigall et al., 2022). Neuroscientific (Corrigall & Trainor, 2014) and behavioral studies using implicit measures (Trainor & Trehub, 1992) further support the notion that some tonal abilities might develop earlier than others. Our meta-analysis, which focused primarily on explicit tasks, aligns with research showing later acquisition in explicit tonal processing.
However, data from all reviewed studies currently lack sufficient evidence to determine the correlative relationship between the various operationalizations used or whether they can be attributed to a single latent variable (IRTH) to which all prior operationalizations conceptually refer. To date, no statistical examination has been conducted to determine whether all operationalizations capture the same latent variable, assuming that IRTH follows a general factor model. Alternatively, these operationalizations may represent dimensions of a composite factor model or discrete partially uncorrelated constructs associated with different latent variables. A test-theoretical approach would be valuable to address this question across age groups (e.g., Bond & Fox, 2015), enabling the subsequent examination of developmental trajectories of IRTH in a longitudinal study. Owing to insufficient statistical information in the primary studies, RQ3 regarding the effects of formal musical training on IRTH acquisition could not be answered fully based on the available evidence. However, it can be inferred from our findings that although IRTH sensitivity develops in children without formal musical training (e.g., Matsunaga et al., 2020), formal musical training may enhance explicit judgments (e.g., for musical experts in Maier-Karius & Schwarzer, 2011). However, it has minimal impact on implicit response times (e.g., falling within the 95% CrI of the modeled mean in Schellenberg et al., 2005). This pattern is consistent with Corrigall et al.’s (2022) findings and is further supported by studies employing productive IRTH operationalizations (e.g., Guilbault, 2009; Wilson & Wales, 1995). However, several studies have reported no advantage of formal musical training for goodness-of-fit ratings (James et al., 2015; Schellenberg et al., 2005; Stalinski & Schellenberg, 2010) or observed improvements in implicit measures such as brain function in musically trained participants (Koelsch et al., 2005; Magne et al., 2006; Putkinen et al., 2014; Wehrum et al., 2011).
Although the limited data prevented us from disentangling the individual contributions of age, informal and formal musical training, and their interactions, our analyses underscore the need for further research in this area. The recurrent absence of data emphasizes the importance of transparent and sustainable research data management (Eerola, 2025), including practices such as verbatim reporting of instructions, providing original stimuli, reporting means and standard deviations for each condition and treatment group (or supplying raw data), and specifying test quality criteria (e.g., test-retest correlation for repeated measurements). By addressing these aspects, future studies will enable meta-analyses such as ours to be conducted more broadly and provide more precise estimates.
The following limitations should be considered when interpreting the results. Generally, meta-analyses inherit the methodological limitations of primary studies. For example, all included studies used cross-sectional quasi-randomized or observational study designs. Although common in this field (Boutron et al., 2022), these methods may introduce additional variance (Murad et al., 2016). Another example is the predominantly low statistical power of the primary studies (Ellis, 2010). Although meta-analytic weighting procedures account for sampling errors by assigning lower weights to less precise estimates, studies with small samples remain more susceptible to extreme or unstable effect size estimates owing to higher random variability (Harrer et al., 2021). Therefore, a high proportion of such studies may increase heterogeneity, particularly if their estimates are systematically biased or selectively reported. This highlights the importance of adequately designed and reported primary studies and calls for caution when interpreting highly variable or inconsistent findings stemming from small samples. From a conceptual standpoint, a limitation of our study lies in the treatment of age as an independent variable based on its frequent reporting in several studies. Age influences various cognitive factors, such as maturation, enculturation, exposure, and musical skill acquisition, that are likely to affect IRTH acquisition (Demorest & Morrison, 2016; Halford, 2014; Hannon & Trainor, 2007; Zajonc, 2001). However, the specific contributions of these factors remain unclear in both the primary studies and our own analyses. For example, formal musical training is known to enhance IRTH (Corrigall & Trainor, 2009; Kraus & Chandrasekaran, 2010; Müllensiefen, 2022; Virtala et al., 2012) but likely interacts with age and informal musical activities (Lamont, 1998). Future research should aim to disentangle the contributions of these factors and clarify their interactions to better understand IRTH acquisition. Another limitation is the substantial heterogeneity observed in the Bayesian three-level meta-analysis, which may have obscured the overall effect. This heterogeneity can be primarily attributed to Krumhansl and Keil (1982).
In summary, although our research question regarding smaller effect sizes for response time measurements compared to other more explicit measurements remains partially unanswered, with probe tone ratings being the exception, the model comparison analysis of Krumhansl and Keil (1982) indicates task-specific differences. Our findings underscore the pivotal influence of task characteristics on IRTH measurements reported over the past 40 years. This aligns with prior research highlighting performance variations stemming from differences in probe tone instructions (Kristop et al., 2020) and supports insights gained from re-analyses of measurement instruments (Platz et al., 2022). Reviewing the knowledge gained from 40 years of research in a structured manner is worthwhile because it provides specific insights for the profitable continuation of the research tradition and general findings for appropriately handling data that enable fruitful re-analyses. This discussion highlights key challenges in IRTH research and underscores the importance of employing validated methodological approaches to accurately model and interpret data. Future studies should address these limitations and focus on developing more precise and comprehensive models based on valid measures to better capture the complex dynamics of IRTH development.
This is an open access article distributed under the terms of the Creative Commons Attribution License (