Objective To review the quality of literature and measurement properties of physical performance tests (PPTs) of the lower extremity in athletes.
Methods Using the PICOS method we established our research question as to whether individual PPTs of the lower extremity have any relationship to injury in competitive athletes ages 12 years to adult (no limit). A search strategy was constructed by combining the terms ‘lower extremity’ and synonyms for ‘performance test’ and names of performance tests with variants of the term ‘athlete’. After examining the knee in part 1 of this 2 part series, the current report focuses on findings in the rest of the lower extremity. The Preferred Reporting Items for Systematic Reviews and Meta-Analyses (PRISMA) guidelines were followed and the Consensus-based Standards for the selection of health Measurement Instruments (COSMIN) checklist was used to critique the methodological quality of each paper. A second measure was used to analyse the quality of the measurement properties of each test.
Results Thirty-one articles examined the measurement properties of 14 PPTs pertaining to the lower extremity. The terminology used to name and describe the tests and methodology by which the tests were conducted was inconsistent.
The star excursion balance test performed in three directions (anterior, posteromedial, and posterolateral) appears to be the only test to be associated with increased injury risk. There is moderate evidence that the one leg hop for distance and the hexagon hop can distinguish between normal and unstable ankles. There is also moderate evidence that the medial hop can distinguish between painful and normal hips in dancers.
Conclusions Currently, there is relatively limited research-backed information on PPTs of the lower extremity in athletes. We would suggest convening an international consortium comprised of experts in sports to standardise the descriptions and methodologies, and to set forth a research agenda to establish definitively the measurement properties of the most common PPTs.
Statistics from Altmetric.com
If you wish to reuse any or all of this article please use the link below which will take you to the Copyright Clearance Center’s RightsLink service. You will be able to get a quick price and instant permission to reuse the content in many different ways.
Measures of function, especially in an athletic population where competition demands complex movements that involve multiple systems and joints, are critical for clinical decision-making.1 To more closely approximate function in sport, physical performance tests (PPTs) were developed as measures of function. A PPT is a low technology measure that can be performed by everyone from coaches to healthcare professionals to examine components of sport (strength, power, agility) through multijoint movements.2 ,3
Clinically, PPTs are used, in the lower extremity especially, after injury or surgery to judge symmetry and readiness for return to play. PPTs are also used as preseason screening examinations to discern deficiencies that may lead to injury.
However, the use of PPTs as outcome measures and prognostic tools has at least two major issues that are debated.
The first issue is that the usefulness of PPTs is not clear. For example, some authors have reported that PPTs can distinguish a deficient lower extremity from a normal lower extremity4 ,5 whereas others dispute that claim.6 ,7 The same contradictory findings exist with regard to the ability of PPTs to predict injury.8 ,9
The second issue is that PPTs should be reliable, valid and responsive and have acceptable measurement error (agreement) if they are to be clinically useful. For example, using a PPT as a preseason screen in an attempt to predict injury is a futile if that PPT lacks the necessary criterion or predictive ability. As another example, a PPT loses meaning as an outcome measure to track progress throughout rehabilitation unless the agreement or the minimally important change (MIC) and the minimal detectable change (MDC) are known. For more on critical properties of tests in sports medicine see Davidson and Keating.10
Our goal was to produce a series of manuscripts that summarised the PPTs of the lower extremities, and that examined the methodological quality of the current research and the quality of measurement properties of each PPT. In our previous paper, we focused on the knee.3 We present the findings from the remaining anatomical regions of the lower extremity.
The Preferred Reporting Items for Systematic Reviews and Meta-Analyses (PRISMA) guidelines were followed.11 ,12 Our research question, ‘Do individual PPT's of the lower extremity have any relation to injury in athletes of age 12 years and older?’, was framed using the PICOS method.
In addition, we were interested in the available research reporting on the measurement qualities of PPTs. PPTs were defined as single, low technology tests that attempt to measure constructs related to sport (strength, power and agility); lower extremity was defined as the region spanning from the hip proximally to the phalanges of the foot distally; and athlete as those participating in sports at Tegner level 5 or above.13 Level 5 and above athletes include a range of competitive sports from cycling and cross-country skiing to soccer, football, and rugby. When Tegner level was not specified, we accepted studies of intramural or recreational athletes as appropriate for inclusion. Articles were excluded if authors examined the utility of a combination of PPTs; if single PPTs were used but results were measured with equipment that was either expensive or not readily and widely available to the average examiner such as force plates, motion capture cameras and timing gates; if the PPTs examined impairment-level data like pain and range of motion; if the PPTs examined tasks not related to the lower extremity; or if participants were involved in Tegner activity levels 4 or below. We also did not include studies where the percentage of Tegner level 5 participants comprised less than 50% of the population. Since of our interest in the measurement properties of PPTs, we accepted studies of healthy athletes.
A search strategy (see online supplementary appendix A) was formulated using terms for sport, athletics, athletes and injuries and combining results with terms that captured tests, performance, and components of performance like strength, power, endurance, agility and function. This search was applied to three databases: PubMed, CINAHL and SportDiscus. Results of searches were limited to articles (not abstracts or posters) written in the English language about humans. In addition, the ‘Clinical Queries’ option in PubMed was used to attempt to find systematic reviews and other articles missed by our search strategy, and the personal collection of one author (EJH) was reviewed for pertinent articles. Finally, the reference lists of the systematic reviews and of all of our final articles was searched for pertinent resources.
Inclusion and exclusion criteria were applied as two authors (EJH and CB) first read all titles and abstracts. Next, all articles were read in full by one author (EJH) in combination with one of the other authors depending on their area of expertise. If the two evaluating authors were in disagreement about either inclusion or exclusion, a third author resolved the dispute.
Data extraction, summary and analysis of quality
Data were extracted by one author (EJH) with oversight by the rest of the research team. We then summarised the data by first gathering the names and methodologies of the PPTs to examine them for consistency (see online supplementary table 1), producing a summary of all studies (see online supplementary table 2), a summary of the methodological quality of the included studies (table 1), and finally, a best evidence synthesis for each PPT (table 2). We followed previously published methodology2 ,14 ,15 in producing this best evidence synthesis; the data on which tables 1 and 2 were based are presented in appendices B and C.
The methodological quality of each included article was critiqued using the 4-point (poor, fair, good and excellent) scoring system Consensus-based Standards for the selection of health Measurement Instruments (COSMIN) checklist,14 and the quality of the measurement properties was critiqued using the adapted quality tool of Terwee et al.16 In assessing the methodological quality of each accepted article, sample size was not taken into account. However, sample size did factor into the best evidence synthesis. Data were summarised in the best evidence synthesis using a scoring of ‘unknown’, ‘strong’, ‘moderate’, ‘limited’, or ‘conflicting’. The scoring method was defined as follows:16
Unknown—investigated in studies of exclusively poor methodology or not investigated in any study
Strong—multiple studies of good methodological rating or at least 1 study of excellent methodology
Moderate—multiple fair methodological studies or 1 study of good methodology
Limited—one study of fair methodological quality
The quality of the measurement properties of each special test were graded as ‘positive’, ‘indeterminate’, or ‘negative’. Owing to the volume of data accumulated from this process, we published the results in two parts: part 1—the knee, and part 2—the rest of the lower extremity. The results presented hereafter represent our findings for the lower extremity (knee excluded).
Included studies, tests, and testing procedures
Of the 169 articles read in full and the 60 that were appropriate for final analysis, 31 examined PPTs of the lower extremity in athletes (figure 1). The one leg hop for distance8 ,17–23 and vertical jump8 ,18 ,24–29 were most often studied. These two PPTs were followed in descending order of frequency by: the star excursion balance,4 ,5 ,9 ,30 ,31 shuttle run,6 ,24 ,32–34 6-meter timed hop,20–22 ,32 triple hop,20–22 ,35 40-yard sprint,25 ,34 ,36 triple crossover hop for distance,6 ,17 ,22 6-meter timed crossover hop,7 ,20 ,21 T-agility,36 ,37 hexagon hop,8 ,23 medial hop7 ,38 and the lateral hop.7 ,38 The PPTs and their properties were equally studied in healthy populations (47% of the studies) and injured populations (53% of studies). In the studies that focused on the injured, the area of injury was most often the ankle followed by the entire lower extremity, the hip, and the hamstrings muscle group.
Of the PPTs studied by more than one group of authors (see online supplementary table 1), the name given to the test and the methodology by which it was performed and scored varied greatly. As an example, the vertical jump is also called a vertical leap, a single leg vertical, a depth vertical, a run-up jump and a vertical jump on two legs. The warm-up ranged from non-existent to jogging to practice trials to a dynamic warm-up. The PPT itself was described as taking off two feet, running and jumping off of one foot and dropping 22 cm before a maximum vertical leap. The final scoring also varied from the best of two or three trials to the mean of three trials to as many trials as needed until a plateau in performance was reached.
Summary of the methodological quality of included studies
Of the 14 tests that were studied by more than one author and had similar methodologies, all but the T agility and multistage fitness or ‘beep’ had reported reliability (table 1). Unfortunately, the methodological quality of the studies was generally poor due to the fact that only one measure of reliability was examined. Exceptions to this trend of poor methodological ratings were the medial and lateral hop tests where one group38 had good methodological quality in a healthy subject pool.
Three of the 14 PPTs were studied for agreement: The 6-meter timed crossover hop, the medial hop test, and the lateral hop test. The one study7 to examine these three PPTs was found to have poor methodological quality based on the fact that only one measurement, the MDC, was reported. Reporting on the MIC in addition to the MDC would have improved the methodological rating to good.
Hypothesis testing/construct validity
The methodological quality for the construct validity of the 6-meter timed crossover hop, hexagon hop, medial hop and lateral hop were all rated good. Of these tests, the medial hop test demonstrated the ability to detect a difference in a painful compared to a non-painful hip, and the hexagon hop, the ability to differentiate a lax from a stable ankle in the same participant or between participants. Two other tests, the star excursion balance test (SEBT) and the single leg hop, had a range of ratings from poor to good. For the star excursion balance test, the study of good methodological quality5 showed this PPT to differentiate a chronically unstable ankle from a normal, within the same participant and between participants. For the single leg hop, one study23 of good methodological quality demonstrated the ability of this test to differentiate between subjects with ankle laxity and those without. Finally, there were no studies examining the construct validity of the 40-yard sprint, the vertical jump or the T-agility test.
In contrast to the other statistical properties, criterion validity of some PPTs was examined in studies of excellent methodological quality. Poor performance (less than 94% of the opposite limb for total reach distance) on the SEBT was associated with a three-fold increase in lower extremity injury risk, and an anterior reach difference of greater than 4 cm was associated with a 2.7-fold increase in injury risk.9 The vertical jump, single leg hop and beep tests were examined in the same study18 and none found to be a risk factor for lower extremity injury.
With regard to the responsiveness of PPTs of the lower extremity, the vertical jump and beep tests were the only tests to be examined in studies of good or excellent methodological quality.27 ,28 In both of these studies the vertical jump improved with sport-specific training in female volleyball players and male rugby players. The beep test also improved with sport-specific training. In two studies of healthy athletes, one of good quality36 and one of fair quality,37 the T-agility test was unaffected by bracing of ankles and was significantly improved after 6 weeks of plyometric training, respectively. In two studies of fair methodological quality,19 ,20 athletes who received, in one case, a hamstring stretching programme, and in another case, a 6-week isokinetic strengthening programme, showed improvement in the single hop and triple hop.
Summary of the quality of the measurement properties
Reliability was graded positive for all 14 PPTs except for the hexagon hop test, which was found to have an intraclass correlation (ICC) of 0.64.8 The 6-meter timed hop was found reliable in two20 ,21 of three32 articles.
Of the 31 articles included in this review, only one group of authors7 investigated the measurement error of just 3 PPTs. The 6-meter timed crossover hop had an MDC of 0.42 s, the medial hop had an MDC of 20.81 cm, and the lateral hop had an MDC of 22.62 cm. However, because the MIC was not calculated, the grade assigned was indeterminate.
Hypothesis testing/construct validity
The quality rating for the measurement property construct validity is positive almost without exception. The dominant themes are that our studied PPTs correlate well with return to activity, and are able to differentiate between an injured lower extremity and a healthy one: within an athlete and between athletes. Exceptions to these chief themes are the shuttle run and the triple crossover hop for distance which may not be able to distinguish between an injured and uninjured ankle,6 and the 6-meter timed crossover hop and lateral hop which may not be able to distinguish between a painful and non-painful hip within the same athlete.7
The quality rating of almost all PPTs in this review for criterion validity was negatively biased largely on the inability of the tests to predict injury.8 ,18 Also, the medial and lateral hop tests demonstrated no correlation with isokinetic testing results.7 The exception was the modified SEBT which had a positive rating based on the association of deficits from side-to-side with increased risk of lower extremity injury.9
The quality of the responsiveness of the PPTs was generally positive due largely to the fact that their performance decreased with restrictive braces or tape and with ice application, and increased with sport-specific training27 and plyometrics37 (but the effect of combining these two interventions may not be cumulative).28
Best evidence synthesis by PPT
The best evidence synthesis (table 2) combines considerations from the methodological quality of the included articles and the quality of the measurement properties of each PPT. The additional scrutiny provided at this highest level summary is in the areas of sample size and methodological quality. Studies of sample size 30 or less, or that had a ‘poor’ methodological quality rating, were eliminated as we pooled the results from all articles for each PPT.
Star excursion balance test
The reliability of the SEBT was examined in two studies of good methodological quality30 ,31 but small sample sizes caused them to be eliminated from consideration leaving the reliability of the SEBT unknown. Also in question is the responsiveness and the agreement of the SEBT since no studies of fair or better quality involving athletes have been conducted to the best of our knowledge. However, there is moderate evidence that the SEBT can detect differences between unstable and normal ankles within and between participants.5 There is strong evidence that the modified 3-direction SEBT can predict injury. Both a composite reach score difference of less than 94% and an anterior reach difference of 4 cm or greater is associated with increased injury risk.9
Sprint test: 40 yards
There was one study of better than poor quality and this study examined responsiveness of the 40-yard sprint.36 However, the study examined only 20 healthy individuals and so the evidence in total for the 40-yard dash leaves us to conclude that we know nothing about the measurement properties of the 40-yard dash in athletes.
The shuttle run, based on one study of fair methodological quality,33 has limited and conflicting evidence with regard to construct validity. College freshman with previous injury have slower times than freshmen without a prior injury. This relationship did not hold true for sophomores, juniors or seniors. Because of sample size and methodological quality issues, the reliability, agreement, criterion validity and responsiveness of the 20 m shuttle run are unknown.
There is no evidence as to the reliability, agreement or construct validity of the vertical leap. There is strong evidence based on one excellent study that the vertical jump does not predict injury in female soccer players.18 There is also strong but conflicting evidence that the vertical jump is responsive. The vertical jump increases with sport-specific training in female volleyball players27 but does not respond to plyometric training beyond sport-specific training in male rugby players.28
One leg hop for distance
There is moderate evidence of the construct (discriminant) validity of the one leg hop for distance which provides different results between athletes who have ankle instability and those who do not.23 There is strong evidence that this PPT is not a predictor of injury based on one high quality study.18 There is limited evidence of the responsiveness of the one leg hop for distance. This PPT does not change in athletes with functional ankle instability with changes in isokinetic strength.20 There is no evidence as to the reliability or agreement of the one leg hop.
One leg hop for distance: 3 hops/triple hop
Similar to the one leg hop for distance test, there is limited evidence that the triple hop is not responsive to a change in isokinetic strength of the ankle.20 There is no evidence of the reliability, agreement, construct validity or criterion validity of the triple hop.
Triple crossover hop for distance
There is no evidence for the use of this PPT in the lower extremity with an athletic population.
Six-meter timed hop
There is no evidence for the use of this PPT in the lower extremity with an athletic population.
Six-meter timed crossover hop
This PPT, a combination of the triple crossover hop and the 6-meter timed hop, has moderate evidence that it cannot discriminate between a painful and non-painful hip in dancers.7 There is limited evidence that changes in the 6-meter timed crossover hop do not correlate with changes in isokinetic strength.20 There is no evidence as to the reliability, agreement or criterion validity of this PPT.
There is no evidence of the reliability, agreement, criterion validity or responsiveness of the hexagon hop test, but there is moderate evidence of the ability of this PPT to discriminate between military athletes with ankle instability and those without.
Medial hop test
The medial hop test has moderate evidence to support that this PPT can discriminate between a painful and non-painful hip in dancers.7
Lateral hop test
The lateral hop test has moderate evidence that it cannot discriminate between a painful and non-painful hip in dancers.7
Multistage fitness or beep test
Although there is no evidence with regard to the reliability, agreement or construct validity of the beep test, there is moderate evidence that the test is responsive to sport-specific training.27 There is also strong evidence that the beep test cannot predict injury.18
PPTs are used by coaches, strength and conditioning experts, and healthcare professionals to estimate function, gauge progress after surgery or injury, predict which athletes are at a greater risk for injury, and also in the return to play decision. We evaluated 31 studies pertaining to 14 PPTs of the lower extremity in athletes.
How do our findings add value to the sports medicine community?
The naming of PPTs, how they are performed, their warm-up, and their final scoring vary enough to cause confusion and to limit generalisability. We found similar problems in our review of PPTs in the knee.3 Thus, a primary conclusion from our study is to call for an international consortium to develop consistency in terminology and methodology of commonly used PPTs.
The most frequent rating of study methods was ‘poor’. The most immediate and achievable need is for adequately powered studies that examine the validity and both the intra-rater and inter-rater reliability of PPTs.
There are gaps in the current knowledge base about PPTs for athletes that might make these tests unhelpful for practical use. Tests that lack validity should be dropped.10 The National Football League (NFL) uses the vertical jump and 40-yard sprint as part of the NFL combine/predraft testing, and the National Basketball Association (NBA) uses the vertical jump as part of the NBA combine/predraft testing. These tests lack proven validity. The vertical leap is unable to predict injury and there is conflicting evidence of its responsiveness, casting doubt on the ability of the vertical leap to provide valuable information as an outcome measure. Some contend that the goal of the combine tests is to predict performance but the NFL combine tests do not appear to possess this ability either.39
What we appear to know currently about PPTs of the lower extremity for athletes is:
There is strong evidence that the vertical leap and single leg hop are NOT predictors of injury.
Normalised composite right reach distance of 94% or less and an anterior right/left reach distance difference of 4 cm or more of the SEBT performed in three directions (anterior, posteromedial, and posterolateral) appears to be the only test to be associated with increased injury risk.
There is moderate evidence that the one leg hop for distance provides different results between athletes who have ankle instability and those who do not, which strengthens the argument for this test as an outcome measure in the rehabilitation of athletes with ankle instability.
There is moderate evidence that the 6-meter timed crossover hop has no ability to discern between a painful and non-painful hip in dancers.
There is moderate evidence that the hexagon hop can distinguish between normal and unstable ankles in military academy athletes.
There is moderate evidence that the medial hop can distinguish between painful and normal hips in dancers.
There is strong evidence that the beep test has no ability to predict injury and moderate evidence that this test is responsive to sport-specific training
The current body of knowledge should leave the clinician-scientist sceptical about the use of these tests for preseason screening, as predictors of injury, and as outcome measures after injury or surgery. There is both an opportunity, and an urgent need, for further research on all of these tests.
As with any systematic review, we were limited in our findings by the quantity and quality of the original articles. There is very limited information about the use of PPTs in patients with hip and thigh pathologies. Because many of the original articles were of a small sample size, much of the information gained in examining methodological quality was lost in the production of the synthesis of evidence. We may have overlooked some articles because we excluded those not written in English, and because there is no accepted search strategy for PPTs. Further, focus on individual PPTs that required little technology eliminated the examination of clusters of PPTs and those tests that take advantage of more advanced technology like three-dimensional motion capture and force plates. However, this limitation increased the generalisability and clinical utility of our findings. Finally, our use of the COSMIN as the tool to judge methodological quality, a vital component of the overall evidence synthesis, can be questioned since the measurement properties of the COSMIN itself are not well understood.15
PPTs of the lower extremity
There are a plethora of PPTs used as assessments of function, measures of symmetry or in an effort to predict which athletes might become injured. In the lower extremity, only the modified SEBT has the ability to predict injury in high school basketball players. There is moderate evidence that the one leg hop for distance and the hexagon hop tests provide different results between athletes who have ankle instability and those who do not. Also, there is moderate evidence that the medial hop can distinguish between painful and normal hips in dancers. Finally, there is moderate evidence that the beep test is responsive to sport-specific training.
PPTs of the knee
From part 1 of our systematic review, the only finding of moderate evidence or better is that the one leg hop for distance is responsive in that test results improve as rehabilitation after anterior cruciate ligament (ACL) reconstruction progresses, strengthening the use of this PPT as an outcome measure.
In our review and summary of 60 articles pertaining to the lower extremity in athletes, test naming, description and methodology are, at best, variant and, at worst, thoroughly confusing. The one leg hop for distance was the single test of use at the knee and ankle since it is responsive to rehabilitation after ACL reconstruction and discriminant in cases of ankle instability. Further, only one test, the modified SEBT, has strong evidence of the ability to predict injury in the lower extremity.9 Finally, only the medial hop has shown utility at the hip, a vastly understudied region.
We call for an international consortium comprised of experts in sports to standardise the descriptions and methodologies of PPTs, and pursue a research agenda (table 3) to establish the psychometric properties of the most common PPTs so that healthcare professionals, coaches, trainers and sporting organisations can discover whether these tests may be used with confidence as measures of function, as outcome measures, or as predictive factors, or whether—alternatively—they are simply a waste of time and resources.
What are new findings
There are 14 physical performance tests (PPTs) pertinent to the lower extremity and 6 to the knee that have been substantially studied so that we have some idea of their measurement properties in an athletic population.
The naming and methodology of PPTs in the entire lower extremity are not consistent.
The one leg hop for distance was the single test of use at the knee and ankle since it is responsive to rehabilitation after anterior cruciate ligament reconstruction and discriminant in cases of ankle instability.
Only one test, the modified star excursion balance test (SEBT), has shown strong evidence of the ability to predict injury in the lower extremity.
The hip region is understudied. Only the medial hop has shown utility at the hip where this test can discriminate between a painful and non-painful hip in dancers.
How might it impact on clinical practice
Caution is urged in making any firm clinical conclusions based on the results of PPTs when testing the lower extremity and in deciding whether an observed change in these outcome measures is meaningful in athletes.
The medial hop appears able to discriminate between a painful and non-painful hip and may have utility as an outcome measure.
Poor performance on the modified SEBT, seems to predict injury based on the results of one study.
The authors would like to acknowledge Ms Connie Schardt, MLS, AHIP and FMLA for her assistance with search strategies.
This web only file has been produced by the BMJ Publishing Group from an electronic file supplied by the author(s) and has not been edited for content.
Files in this Data Supplement:
Contributors EJH, SMM, and CB planned the study, reviewed the citations, examined the articles for quality, and edited the final manuscript. DB and CEC, examined articles for quality and edited the manuscript. EJH wrote the manuscript. EJH, SMM, and CB planned the study, reviewed the citations, examined the articles for quality, and edited the final manuscript. DB and CEC examined articles for quality and edited the manuscript. EJH wrote the manuscript.
Competing interests None.
Provenance and peer review Not commissioned; externally peer reviewed.
Data sharing statement The authors are happy to share with receipt of a written request by the corresponding author.