Article Text

What tests should be used to assess functional performance in youth and young adults following anterior cruciate ligament or meniscal injury? A systematic review of measurement properties for the OPTIKNEE consensus
  1. Bjørnar Berg1,2,
  2. Anouk P Urhausen3,
  3. Britt Elin Øiestad4,
  4. Jackie L Whittaker5,6,
  5. Adam G Culvenor7,
  6. Ewa M Roos8,
  7. Kay M Crossley7,
  8. Carsten B Juhl8,9,
  9. May Arna Risberg1,3
  1. 1 Division of Orthopaedic Surgery, Oslo University Hospital, Oslo, Norway
  2. 2 Centre for Intelligent Musculoskeletal Health, Faculty of Health Sciences, Oslo Metropolitan University, Oslo, Norway
  3. 3 Department of Sports Medicine, Norwegian School of Sport Sciences, Oslo, Norway
  4. 4 Department of Physiotherapy, Oslo Metropolitan University, Oslo, Norway
  5. 5 Department of Physical Therapy, Faculty of Medicine, The University of British Columbia, Vancouver, British Columbia, Canada
  6. 6 Arthritis Research Centre, Vancouver, Vancouver, Canada
  7. 7 La Trobe Sport and Exercise Medicine Research Centre, La Trobe University School of Allied Health Human Services and Sport, Bundoora, Victoria, Australia
  8. 8 Department of Sports Science and Clinical Biomechanics, University of Southern Denmark, Odense, Denmark
  9. 9 Department of Physiotherapy and Occupational Therapy, Copenhagen University Hospital, Herlev and Gentofte, Kobenhavn, Denmark
  1. Correspondence to Dr Bjørnar Berg, Division of Orthopaedic Surgery, Oslo University Hospital, Oslo 0424, Norway; bjornar.berg{at}


Objectives To critically appraise and summarise measurement properties of functional performance tests in individuals following anterior cruciate ligament (ACL) or meniscal injury.

Design Systematic review.

Data sources Systematic searches were performed in Medline (Ovid), Embase (Ovid), CINAHL (EBSCO) and SPORTSDiscus (EBSCO) on 7 July 2021.

Eligibility criteria for selecting studies Studies evaluating at least one measurement property of a functional performance test including individuals following an ACL tear or meniscal injury with a mean injury age of ≤30 years. The COnsensus-based Standards for the selection of health Measurement INstruments Risk of Bias checklist was used to assess methodological quality. A modified Grading of Recommendations Assessment, Development and Evaluation assessed evidence quality.

Results Thirty studies evaluating 26 functional performance tests following ACL injury were included. No studies were found in individuals with an isolated meniscal injury. Included studies evaluated reliability (n=5), measurement error (n=3), construct validity (n=26), structural validity (n=1) and responsiveness (n=1). The Single Leg Hop and Crossover Hop tests showed sufficient intrarater reliability (high and moderate quality evidence, respectively), construct validity (low-quality and moderate-quality evidence, respectively) and responsiveness (low-quality evidence).

Conclusion Frequently used functional performance tests for individuals with ACL or meniscal injury lack evidence supporting their measurement properties. The Single Leg Hop and Crossover Hop are currently the most promising tests following ACL injury. High-quality studies are required to facilitate stronger recommendations of performance-based outcomes following ACL or meniscal injury.

  • knee
  • anterior cruciate ligament

Statistics from

Request Permissions

If you wish to reuse any or all of this article please use the link below which will take you to the Copyright Clearance Center’s RightsLink service. You will be able to get a quick price and instant permission to reuse the content in many different ways.


Sport-related traumatic knee injuries in adolescents and young adults are widespread.1 2 Assessing physical function following knee injury is a key component of evaluating rehabilitation progression, treatment success and return to sport readiness.3 Objective performance tests are recommended to comprehensively evaluate physical function as they provide complementary information to patient-reported outcomes.4–6 Functional performance may also inform early identification of individuals at risk of post-traumatic knee osteoarthritis and facilitate development of appropriate secondary prevention strategies.7 8 While a range of functional performance tests are available, we lack information on those with the best measurement properties (such as reliability, validity and responsiveness).9

Knowledge of measurement properties is essential to guide practitioners and researchers in choosing the most appropriate functional performance test for clinical practice or research.10 To inform rehabilitation progression and evaluation of treatment success a test should be free from measurement error, measure the intended construct and be able to detect changes over time.11 Using tests with poor or unknown measurement properties introduces the risk of imprecise or biased results, potentially leading to suboptimal treatment and outcomes for individual patients.10 12 Furthermore, it constitutes a waste of resources if applied for research purposes.13

Measurement properties of functional performance tests have rarely been evaluated in systematic reviews for individuals with knee disorders. Past reviews have either: (1) assessed only a single measurement property14; (2) been limited to individuals with knee osteoarthritis15 16 or (3) lacked a structured approach to evaluate measurement properties.17 18 There is now a need to appraise the measurement properties of the large number of functional performance tests used in adolescents and young adults following knee injury.

This systematic review aimed to critically appraise and summarise the measurement properties of functional performance tests in young adults following anterior cruciate ligament (ACL) or meniscal injury, irrespective of management approach. This systematic review is one of several contributing to the development of evidence-based consensus recommendations for rehabilitation to optimise musculoskeletal health and prevent post-traumatic osteoarthritis following knee trauma (OPTIKNEE;

Materials and methods

The systematic review was conducted using the COnsensus-based Standards for the selection of health Measurement INstruments (COSMIN) guideline19 20 and reported according to the Preferred Reporting Items for Systematic Reviews and Meta-Analyses statement.21

Eligibility criteria for selecting studies

Studies were eligible for inclusion if they met the following criteria:

  1. Population: the study sample included individuals with an isolated ACL tear, ACL tear with concomitant meniscal injury, or isolated meniscal injury, with a mean age of injury ≤30 years.

  2. Construct: the study included a test that measured physical function, defined as Activities according to the International Classification of Functioning, Disability and Health (ICF) framework.22

  3. Instrument: the instrument was a quantitative functional performance test.

  4. Study type: the study evaluated at least one measurement property of a functional performance test (eg, reliability, validity, measurement error, responsiveness).

Studies of individuals with or without surgical intervention were both considered for inclusion. A mean sample age of injury ≤30 years was chosen to limit the inclusion of degenerative knee joint conditions. Studies were excluded if evaluating measurement property of a functional performance test was not the primary aim; the functional performance test was used to validate another instrument (eg, patient-reported outcomes); the study was a randomised controlled trial, systematic review or literature review; the study was not published in full-text (eg, abstracts or conference proceedings), and; the study was published prior to 2000. We restricted the review to studies published from 2000 and onward because the quality of reporting has increased substantially since 2000 and to ensure that the test is likely to be used in clinical practice.

Search strategy and study selection

The search strategy was developed in collaboration with a senior librarian (Medical library, University of Oslo) who also performed the searches. The following databases were searched from earliest available to 16 June 2020, and repeated on 7 July 2021: Medline (Ovid), Embase (Ovid), CINAHL (EBSCO) and SPORTSDiscus (EBSCO). In addition to comprehensive search terms for population and measurement instrument, we used a highly sensitive validated search filter to identify studies evaluating measurement properties (online supplemental appendix 1).23 No constraints were set on language or date of publication (studies published prior to 2000 were excluded after the search strategy); however, systematic reviews and randomised controlled trials were excluded (using an exclusion filter). The reference lists of all included studies were hand-searched for additional relevant studies by two authors independently (BB and APU).

Supplemental material

Identified publications from all databases were imported to EndNote (V.X9.3.3, Clarivate Analytics) and duplicates removed. Titles and abstracts were screened using the Rayyan Qatar Computing Research Institute application24 by two independent authors (BB and APU), who also independently reviewed the full-text studies for eligibility. Two senior authors (BEØ and MAR) were consulted to resolve discrepancies.

Risk of bias assessment

Two authors (BB and APU) independently assessed included studies using the COSMIN Risk of Bias checklist25 and the extension for studies on reliability and measurement error.20 This recently published extension of the COSMIN standards enables transparent and systematic evaluation of the methodological quality of studies investigating performance-based tests reliability and measurement error.20 26 Each study evaluating a measurement property was rated on a four-point scale (very good, adequate, doubtful or inadequate) based on the standards specific to that measurement property. The lowest rating of any standard was used as the overall rating (‘worst score counts’ principle).20 25 For studies evaluating construct validity using several comparator instruments, we considered each result as a study comparison and only combined the methodological quality rating if each standard for all ‘study comparisons’ were rated the same.27

Data extraction

Data were extracted from the included studies by two authors independently (BB and APU). The data extracted included the following items: (1) study and patient sample characteristics; (2) description of the functional performance test, including the procedure and equipment; (3) measurement properties evaluated and (4) results on the measurement properties.

Data synthesis and analysis

The result for the measurement property of each single study was rated as either sufficient (+), insufficient (-) or indeterminate (?) according to the updated criteria for good measurement properties (online supplemental appendix 2).19 Quantitative pooling was performed if possible, or we qualitatively summarised the results of single studies to obtain an overall result for each functional performance test by measurement property. If the results were inconsistent, we pooled the results but downgraded quality of evidence due to inconsistency.27 The pooled or summarised results were rated against the same quality criteria,19 resulting in a rating of either sufficient (+), insufficient (-), inconsistent (±) or indeterminate (?). At least 75% of the results needed to be sufficient (or insufficient) to rate the summarised results as sufficient (or insufficient).19

For construct validity, we used quadriceps and hamstring muscle strength assessed by dynamometry and four patient-reported outcomes as comparator instruments (Knee Documentation Committee Subjective Knee Form,28 Knee injury and Osteoarthritis Outcome Score sport and recreation subscale (sport/rec) and knee-related quality of life subscale,29 and ACL Return to Sport after Injury Scale).30 To interpret the results, we formulated a priori hypotheses about the expected relationship (direction and magnitude) between the functional performance tests and comparator instruments based on the generic hypotheses proposed by the COSMIN initiative,10 27 current literature and the review team’s clinical experiences (online supplemental appendix 3).27 The following reasoning was made specifically for functional performance tests involving hopping or jumping:

  1. Muscle strength, particularly quadriceps strength, is an important aspect of activities requiring high force generation.31

  2. Patient-reported outcomes and functional performance tests measure related but distinct components of physical function.6

  3. We expected correlations ≥0.50 (quadriceps strength) and ≥0.40 (hamstrings strength), and correlations between 0.30 and 0.50 with patient-reported outcomes.

When pooling or summarising construct validity results, we considered correlations with quadriceps and hamstrings strength as stronger evidence than patient-reported outcomes as they are physical function based.27 Studies reporting only p values were not included in construct validity data synthesis.27 For responsiveness, we hypothesised that changes on the functional performance test would correlate (0.30–0.50) with changes in region-specific patient-reported measures of physical function (related but dissimilar construct).

Quantitative synthesis

Quantitative pooling (ie, performing meta-analysis) was conducted on functional performance tests with two or more available results (for the same comparator) for construct validity. If a study presented results from multiple time points, we considered the shortest time point (from injury or surgery) to be most clinically relevant and analysed data only at this time.32 We calculated weighted mean Pearson correlation coefficients and 95% CIs. Effect sizes were obtained through the Fisher’s z-transformation and the pooled effect sizes back-transformed (z–r).33 A random effect model was applied considering heterogeneity of studies in terms of timing of functional performance testing. The I2 statistic was calculated to test the proportion of variation in the pooled estimates due to between-study heterogeneity.34

Qualitative synthesis

A qualitative summary of the results was performed when quantitative pooling was not possible. The number of confirmed hypotheses were counted across studies on construct validity and responsiveness. For reliability, the point estimate of the intraclass correlation coefficient (ICC) or Cohen’s kappa (k) was used to conclude whether a measurement instrument had sufficient reliability.26 Comprehensive research questions were also deducted from the design of studies on reliability and measurement error to determine how the results inform the quality of the functional performance test.20 For reliability studies describing different operationalisation of the measurement protocol for the same functional performance test, the studies were combined if results were consistent (ie, either sufficient or insufficient).26

Grading the quality of evidence

The modified Grading of Recommendations Assessment, Development and Evaluation19 approach was applied. The quality of evidence was graded separately for each measurement property for each functional performance test, resulting in a grading of either high, moderate, low or very low quality of evidence. Grading was based on four factors: (1) risk of bias (ie, methodological quality of studies using the COSMIN checklist); (2) inconsistency (ie, unexplained inconsistency of results across studies); (3) imprecision (ie, total sample size of available studies) and (4) indirectness (ie, evidence from different populations).19 No rating of quality of evidence was given for inconsistent results with no explanation for inconsistency or when the overall rating was indeterminate.19

Finally, we examined whether excluding studies published prior to year 2000 impacted the results (measurement property rating and quality of evidence).

Patient and public involvement

Two individuals with lived experience of ACL tear (and ACL reconstruction) and four clinicians (ie, physiotherapists, orthopaedic surgeons) contributed to the priority theme setting of this review.


After searches and removal of duplicates, 2521 studies underwent title and abstract review. Of these, we screened 107 full texts, and finally included 30 studies evaluating 26 unique functional performance tests (figure 1). Of the 107 full texts screened, 9 studies investigating construct validity,35–39 reliability40–42 or structural validity43 were excluded based on publication prior to 2000.

Figure 1

PRISMA flow diagram of study selection. ACL, anterior cruciate ligament; PRISMA, Preferred Reporting Items for Systematic Reviews and Meta-Analyses.

Characteristics of included studies and functional performance tests

Twenty-seven of the included studies involved individuals who had an ACL reconstruction,44–70 two evaluated ACL deficient individuals71 72 and one included both ACL reconstructed and ACL deficient individuals (table 1).73 No studies evaluating individuals with an isolated meniscal injury were identified.

Table 1

Characteristics of included studies

Twenty-five of the 26 tests were single-activity tests,44–72 with the remaining test a multi-activity functional performance test or battery (table 2).73 Tests involved a range of tasks, including: (1) hopping,44–46 48–55 57–59 61–65 67–73 (2) jumping,57 (3) stepping,66 (4) running55 56 or (5) dynamic balance.47 50 51 60 66 Required equipment varied from common equipment such as measuring tape and stopwatch to more advanced equipment such as a force platform, accelerometer or timing gates. Test administration (eg, best vs mean score) and task constraints (eg, landing requirements or arm placement) also varied between studies for the same functional performance test.

Table 2

Brief description of included functional performance tests

Data synthesis

Included studies investigated five measurement properties: reliability,52 58 63 64 67 measurement error,63 64 67 construct validity,44–51 53–62 65–72 structural validity73 and responsiveness.64

Reliability and measurement error

Four studies52 58 64 67 evaluated intrarater reliability in individuals who had an ACL reconstruction. The Limb Symmetry Index (LSI) score was the outcome of interest in three studies,58 64 67 defined as the ratio of the involved limb score and the uninvolved limb score (involved/uninvolved ×100 = LSI). One study52 included both the LSI score and absolute values. Test scoring varied slightly: the best score of three trials was used in two studies,52 58 the mean of two trials in one study64 and the mean of three trials in one study.67 No study specified rater experience (online supplemental appendix 4). Sufficient intrarater reliability was found for the Single Leg Hop (ICCs from 0.92 to 0.94),58 64 67 6 m Timed Hop (ICCs 0.82 and 0.93),52 64 Crossover Hop (ICCs 0.84 and 0.94),52 64 Triple Hop (ICC 0.88),64 Vertical Hop (ICCs 0.81 and 0.98)52 58 and Stair Hop (ICC 0.90)52 tests in studies using LSI as the outcome score (table 3). In one study using absolute values as the outcome score, sufficient intrarater reliability was found for the 6 m Timed Hop (ICC 0.96), Crossover Hop (ICC 0.98), Vertical Hop (ICC 0.98) and Stair Hop (ICC 0.94) tests.52

Table 3

COSMIN methodological quality ratings and results for reliability and measurement error for each functional performance test

One study63 evaluated inter-rater reliability of the 6 m Timed Hop Test in individuals who had an ACL reconstruction. Raters included an experienced rehabilitation specialist and a strength and conditioning coach with substantial experience (>5 years) in the test procedures (online supplemental appendix 4). The outcome of interest was the LSI score, dichotomised into pass (≥90%) or fail (<90%). No inter-rater agreement was found for rating a participant as fail (k −0.11) (table 3).

Three studies63 64 67 evaluated measurement error in individuals who had an ACL reconstruction. The SE of Measurement (SEM) was 2.41% and 3.49% for the Single Leg Hop,64 67 5.59% for the 6 m Timed Hop,64 5.28% for the Crossover Hop64 and 4.32% for the Triple Hop test (table 3).64 The overall rating for all tests was indeterminate because the minimal important change (MIC) was not defined.26


Twenty-six studies evaluated construct validity (hypothesis testing).44–51 53–62 65–72 Quantitative pooling was applied to the correlations of the Single Leg Hop, 6 m Timed Hop, Crossover Hop, Triple Hop and Vertical Hop tests. Moderate to strong correlations were found with quadriceps and hamstrings strength, whereas correlations with patient-reported outcomes were weak to moderate (table 4). We based the overall rating on correlations with muscle strength outcomes and downgraded the evidence quality by one level due to inconsistency. For the Single Leg Hop Test the quality of evidence was downgraded by two levels due to a lower mean correlation coefficient with hamstrings strength (due to a negative correlation in one study).70 Individual study methodological quality and results are presented in online supplemental appendix 5.

Table 4

Meta-analyses of hop tests for construct validity

Sufficient construct validity was found for seven of 19 functional performance tests assessed and summarised in qualitative synthesis: Medial Hop,48 Lateral Hop,48 Single Leg Drop Jump,44 61 4-jump Vertical Hop,57 Vertical Jump,57 Step-Down66 and Carioca55 56 tests. In contrast, the Timed Speedy Hop Test,48 10 s Vertical Hop Test62 and Star Excursion Balance Test (anterior, posteromedial, and posterolateral)47 were rated insufficient. The following tests had inconsistent construct validity: Shuttle Run Test,55 56 Side-Step Test,55 56 Co-Contraction Test56 and Y-balance Test (anterior,60 66 posterolateral,60 posteromedial60 and composite score).60 Individual study methodological quality and results are presented in online supplemental appendix 6.

One study73 of adequate methodological quality evaluated structural validity of a test battery consisting of five hop tests. The factor analysis divided the five tests in two factors: maximal hop tests (Vertical Hop, Single Leg Hop and Drop Jump) and endurance hop tests (Square Hop and Side Hop). Structural validity was rated indeterminate due to inadequate reporting.


One study64 of very good methodological quality investigated responsiveness in individuals who had an ACL reconstruction. Correlations were evaluated between change in LSI scores and change in the Lower Extremity Functional Scale74 following a 6-week rehabilitation programme. Sufficient rating was found for the Single Leg Hop and Crossover Hop tests (r=0.37 and r=0.41, respectively). In contrast, the 6 m Timed Hop (r=0.28) and Triple Hop (r=0.26) tests were rated insufficiently responsive.64

Summary of findings

A summary of the measurement property quality of evidence for each functional performance test is shown in table 5 and Summary of Findings Tables are presented in online supplemental appendix 7. For 21 of the 26 included functional performance tests, only 1 measurement property was evaluated. Sufficient rating for three measurement properties was only found for the Single Leg Hop and Crossover Hop tests, namely intrarater reliability (high and moderate quality of evidence, respectively), construct validity (low and moderate quality of evidence, respectively) and responsiveness (low quality of evidence).

Table 5

Quality of measurement property evidence by functional performance test

Synthesising data from studies published prior to 2000 otherwise meeting our inclusion criteria did not change our results (online supplemental appendix 8).35–43 The only exception was the quality of evidence for intrarater reliability of the Vertical Hop Test, which would not have been downgraded for imprecision due to the inclusion of one study with 15 individuals. However, this study used a vertical jump apparatus for measuring jump height,40 in contrast to the included studies which used a force plate or contact mat.52 58


This review systematically summarised 30 studies evaluating the measurement properties of 26 functional performance tests according to recently updated COSMIN methodology. All identified studies included individuals who had torn their ACL. Our data synthesis showed a general lack of high-quality evidence for the measurement properties of functional performance tests for use in an ACL injured population. Only the Single Leg Hop and Crossover Hop tests demonstrated sufficient intrarater reliability, construct validity and responsiveness.

Reliability and measurement error

High-quality evidence for sufficient intrarater reliability was found for the Single Leg Hop Test,58 64 67 meaning we are very confident that the true measurement property lies close to the summarised estimate (ICCs 0.92–0.94). For the five other hop tests investigated, the quality of evidence was downgraded to moderate (6 m Timed Hop,52 64 Crossover Hop52 64 and Vertical Hop)52 58 or very low (Triple Hop64 and Stair Hop)52 due to only one or two small studies of adequate methodological quality evaluating the performance test. While these tests are reliable for use in groups (ICC values>0.70), ICC values in excess of 0.90 are recommended for application in individuals.10 For the Single Leg Hop Test, ICCs above the 0.90 cut-off were consistently found across three studies, indicating sufficient intrarater reliability for use in individuals. Slightly lower ICC values (from 0.82 to 0.88) were reported for the 6 m Timed Hop, Crossover Hop and Triple Hop tests in one study.64 Recommended strategies to improve these measures’ reliability include training raters in the measurement procedures, patient familiarisation, and averaging repeated measures (eg, three measurements per rater).10

Ratings for intrarater reliability were based on studies using the LSI score as the outcome.52 58 64 67 The LSI provide an important measure of performance of the involved limb in relation to the uninvolved limb and is easy to understand for clinicians and patients. LSI values may also be more consistent than absolute values as it accounts for the learning effect in both limbs.64 However, caution is required when interpreting changes in hop performance using LSI scores as bilateral deficits may be present.75

One study63 compared consistency of pass (≥90% LSI) and fail (<90% LSI) decisions between two raters for the 6 m Timed Hop Test. Although agreement was suitable (80%), they never agreed on rating participants as fail (inter-rater reliability: k −0.11). While this may question the commonly used criteria to complete rehabilitation and subsequently return to sport, we interpret the result with caution as only four players were rated as fail.63

Measurement error was rated indeterminate for the Single Leg Hop,64 67 6 m Timed Hop,63 64 Crossover Hop64 and Triple Hop64 tests because MIC thresholds have not been established. Information on the MIC is required to apply the criteria for good measurement error,19 that is, if the smallest detectable change is lower than changes perceived as important by patients (or clinicians).10 Although the four tests were rated indeterminate, the results provide important information to clinicians and researchers on the ability to detect true changes. For the Single Leg Hop Test, the smallest detectable change was 6.7% and 9.7% (calculated based on the SEM: 1.96*√2*SEM),64 67 indicating that changes in LSI scores above 10% in individual patients represents a real change. The smallest detectable change was slightly higher for the 6 m Timed Hop (15.5%), Crossover Hop (14.6%) and Triple Hop (12.0%) tests.


To provide an estimate for construct validity, we compared functional performance tests with knee muscle strength and patient-reported outcomes. In line with our expectations, tests involving hopping demonstrated moderate to strong associations with quadriceps strength. However, these tests should not be considered a measure of knee muscle strength following ACL injury as muscle compensation strategies may be adopted.76 Further, in ACL reconstructed athletes it has been demonstrated that the knee only contributes 12% to the propulsive phase of the Single Leg Hop Test, and that asymmetries in knee work and kinematics (propulsion and landing phases) can exists despite between-limb hop distance symmetry.77 Hop performance tests, therefore, measure total lower extremity function rather than solely performance of the knee joint or single muscles.77 78

Pooled effect sizes showed weak to moderate correlations with patient-reported outcomes for the Single Leg Hop, 6 m Timed Hop, Crossover Hop, Triple Hop and Vertical Hop tests. These results are consistent with a previous review14 and support a clear distinction between these two types of constructs. Functional performance tests may better distinguish between pain and function and evaluate what individuals can do rather than what they perceive they can do, as assessed by patient-reported outcomes.16 79


All ratings on responsiveness were based on one study64 assessing relationships between changes in functional performance tests following a 6-week intervention period with changes in the Lower Extremity Functional Scale,74 a region-specific self-reported functional status measure. Although the evidence quality was low, the Single Leg Hop and Crossover Hop tests appear to be able to detect changes in physical function.64 We noted somewhat stronger correlations with a generic Global Rating of Change questionnaire in the same study.64 However, responsiveness concerns the ability to measure the right amount of change in the purported construct.10


The present review used the COSMIN methodology to evaluate measurement properties of functional performance tests, including the extended version of the COSMIN Risk of Bias checklist for studies on reliability and measurement error of performance-based outcome measures.20 Although the COSMIN methodology has been used in several reviews of functional performance tests,15 16 18 80 it was originally developed for reviews of patient-reported outcome measures.19 Considering the differences in instrument characteristics between performance-based tests and patient-reported outcomes, this may have affected our conclusions. As an example, if the number of hypotheses tested and the cut-off for good measurement properties varied, the results may change. The current criteria for good measurement properties for construct validity and responsiveness (75% of hypotheses confirmed) is based on consensus but arbitrary to a certain extent.81 Furthermore, hypotheses about correlations are to some extent a best guess. A different team of reviewers would likely construct slightly different hypotheses, possibly leading to different conclusions.82

Tests using kinematic quantification methods, for example, gait analysis, were outside of the present review scope as they were considered measures of Body function in the ICF model.22 Neither did we intend to capture instruments used for predictive purposes. The COSMIN methodology was developed for instruments used for evaluative purposes and would require adaptions for predictive applications.27 The predictive validity of functional performance tests has been summarised in a recent systematic review.14 We did not include test batteries of multitask items not solely assessing Activities but acknowledge that tests combining aspects of function with impairments (eg, muscle strength) provide complementary information regarding physical function. However, they are challenging to interpret because they measure different underlying constructs.

One small study presented results on construct validity from multiple time points,55 and only data from 6 months postsurgery were included in the data analysis. Analysing data from the same participants pre-surgery would not have changed our results. Further, we only included studies with a mean sample age of injury ≤30 years to limit the inclusion of tests evaluated in middle-aged and older people with a supposedly lower physical activity level. The age criterion precluded the inclusion of one study evaluating individuals with a meniscal tear.83 Finally, studies published before year 2000 were excluded. However, our results remained essentially unchanged when synthesising data from studies published prior to year 2000 otherwise meeting our inclusion criteria.35–43

Clinical implications and future research

With the currently available evidence, it is difficult to formulate strong recommendations for functional performance tests to choose for clinical practice. Based on the identified measurement property evidence, we recommend the Single Leg Hop and Crossover Hop tests in individuals following ACL reconstruction as they have excellent intrarater reliability, support for validity and are potentially responsive to detect changes. However, recent studies have questioned whether horizontal hop distance is the best metric of knee function.77 84 Given that the knee contributes greater work during a vertical hop, and that vertical hop tests are better at identifying between-limb asymmetries compared with horizontal hop tests,84 it may also be important to assess vertical hop performance (eg, with the Vertical Hop Test) for rehabilitation progression and return to sport readiness. Finally, it is important to point out that the other functional performance tests identified in the current review are not necessarily of ‘poor’ quality. Most measurement properties are unknown and robust studies are needed to determine their suitability for use in clinical practice and research.

This review identified important areas for future research. First, better reporting standards are needed to enhance the quality and reproducibility of future research and facilitate correct administration of these tests in clinical practice. Even though slight alterations in test administration may not influence intrarater reliability, it may lead to different scores and make comparison or synthesis of study findings difficult when used as an outcome for interventional studies or return to sport criteria.85 Second, studies evaluating reliability often lack essential information on sources of variability in the measurement setting (eg, rater characteristics) and the statistical approach. Future studies should adhere to a reporting guideline to improve interpretation of their findings, such as the Guidelines for Reporting Reliability and Agreement Studies.86 Third, more knowledge on interpretability and responsiveness are needed in particular. To establish MIC values for patient-reported outcomes the widely accepted optimal approach is the use of an external patient-reported anchor.10 However, patient-reported outcomes and functional performance tests are distinctly different concepts and establishing a credible anchor-based MIC value is less straightforward for performance-based outcomes. Therefore, changes that are measurable (smallest detectable change) may be the most appropriate benchmark for functional performance tests. Such information can further inform an opinion-based approach with a Delphi process to determine changes considered to be minimally important for different functional performance tests.10 87


This systematic review highlighted the paucity of evidence for measurement properties of functional performance tests following ACL or meniscal injury. The Single Leg Hop was the most studied and highest rated test, followed by the Crossover Hop Test. Still, high-quality studies on responsiveness and interpretability are needed to make strong recommendations for which functional performance tests should be used in youth and young adults following knee injury.

What is already known?

  • Functional performance tests are frequently used in research and clinical practice to assess physical function following knee injury.

  • Functional performance tests complement patient-reported outcomes, but consensus on which tests have the best measurement properties and clinical relevance in individuals who have had an anterior cruciate ligament (ACL) tear or meniscal injury is lacking.

What are the new findings?

  • A wide variety of functional performance tests has been used following ACL injury, but there is a paucity of evidence about their measurement properties.

  • The Single Leg Hop Test and Crossover Hop Test are the highest rated tests for use with individuals that have had an ACL injury and reconstruction, displaying excellent intrarater reliability, and support for construct validity and responsiveness.

  • The 6 m Timed Hop Test and Triple Hop Test demonstrate good intrarater reliability and support for construct validity, but insufficient responsiveness.

  • There is no information published about the measurement properties of functional performance tests for use after isolated meniscal injury

Ethics statements

Patient consent for publication


The authors greatly acknowledge senior librarian Marte Ødegaard (Medical library, University of Oslo) for assistance with developing search strategies and performing the searches. We further acknowledge Pætur Holm, Marienke van Middelkoop, Stephanie Filbay, and Erin Macri for contributing valuable methodological input.


Supplementary materials

  • Supplementary Data

    This web only file has been produced by the BMJ Publishing Group from an electronic file supplied by the author(s) and has not been edited for content.


  • Twitter @AnoukUrhausen, @jwhittak_physio, @agculvenor, @ewa_roos

  • Contributors JLW, AGC, EMR, KMC, CBJ and MAR contributed to the conception of the study. BB, APU, BEØ and MAR designed the study. BB and APU screened studies for inclusion, performed risk of bias assessment and data extraction. All authors contributed to the interpretation of data. BB wrote the initial draft. All authors revised the draft critically for important intellectual content and approved the final version.

  • Funding This review is part of the OPTIKNEE consensus ( which has received funding from the Canadian Institutes of Health Research (OPTIKNEE principal investigator JLW #161821).

  • Disclaimer The funders had no role in any part of the study or in any decision about publication.

  • Competing interests JLW and AGC are Associate Editors of the British Journal of Sports Medicine (BJSM). JLW is an Editor with the Journal of Orthopaedic and Sports Physical Therapy. KMC is a senior advisor of BJSM, project leader of the Good Life with Osteoarthritis from Denmark (GLA:D)-Australia a not-for profit initiative to implement clinical guidelines in primary care, and holds a research grant from Levin Health outside the submitted work. CBJ an Associate Editor of Osteoarthritis and Cartilage. ER is Deputy Editor of Osteoarthritis and Cartilage, developer of Knee injury and Osteoarthritis Outcome Score (KOOS) and several other freely available patient-reported outcome measures, and founder of the GLA:D). All other authors declare no competing interests.

  • Provenance and peer review Not commissioned; externally peer reviewed.

  • Supplemental material This content has been supplied by the author(s). It has not been vetted by BMJ Publishing Group Limited (BMJ) and may not have been peer-reviewed. Any opinions or recommendations discussed are solely those of the author(s) and are not endorsed by BMJ. BMJ disclaims all liability and responsibility arising from any reliance placed on the content. Where the content includes any translated material, BMJ does not warrant the accuracy and reliability of the translations (including but not limited to local regulations, clinical guidelines, terminology, drug names and drug dosages), and is not responsible for any error and/or omissions arising from translation and adaptation or otherwise.