Background Several physical assessment protocols to identify intrinsic risk factors for injury aetiology related to movement quality have been described. The Functional Movement Screen (FMS) is a standardised, field-expedient test battery intended to assess movement quality and has been used clinically in preparticipation screening and in sports injury research.
Aim To critically appraise and summarise research investigating the reliability of scores obtained using the FMS battery.
Study design Systematic literature review.
Methods Systematic search of Google Scholar, Scopus (including ScienceDirect and PubMed), EBSCO (including Academic Search Complete, AMED, CINAHL, Health Source: Nursing/Academic Edition), MEDLINE and SPORTDiscus. Studies meeting eligibility criteria were assessed by 2 reviewers for risk of bias using the Quality Appraisal of Reliability Studies checklist. Overall quality of evidence was determined using van Tulder's levels of evidence approach.
Results 12 studies were appraised. Overall, there was a ‘moderate’ level of evidence in favour of ‘acceptable’ (intraclass correlation coefficient ≥0.6) inter-rater and intra-rater reliability for composite scores derived from live scoring. For inter-rater reliability of composite scores derived from video recordings there was ‘conflicting’ evidence, and ‘limited’ evidence for intra-rater reliability. For inter-rater reliability based on live scoring of individual subtests there was ‘moderate’ evidence of ‘acceptable’ reliability (κ≥0.4) for 4 subtests (Deep Squat, Shoulder Mobility, Active Straight-leg Raise, Trunk Stability Push-up) and ‘conflicting’ evidence for the remaining 3 (Hurdle Step, In-line Lunge, Rotary Stability).
Conclusions This review found ‘moderate’ evidence that raters can achieve acceptable levels of inter-rater and intra-rater reliability of composite FMS scores when using live ratings. Overall, there were few high-quality studies, and the quality of several studies was impacted by poor study reporting particularly in relation to rater blinding.
- Sports medicine
Statistics from Altmetric.com
The purpose of screening in sports injury prevention is to identify the presence of risk factors that may predispose a person to injury and require further, more detailed, assessment. Screening is, therefore, of particular interest to injury researchers,1 physical therapists, coaches, strength and conditioning specialists, and sports medicine practitioners.2 ,3 The Functional Movement Screen (FMS), as described by Cook et al,4 ,5 is a movement screening test battery intended to provide a clinically interpretable measure of ‘movement quality’,6–8 using visual assessment of seven active movement tasks and three clearing tests using standardised scoring criteria. The FMS appears to be gaining international acceptance as an injury risk screening measure,2 and has also been incorporated into other screening batteries.9 To date, there is emergent evidence from cohort studies for the association between FMS scores and injury risk in several populations, including American football players,10 ,11 university athletes,12 military personnel13 ,14 and firefighters.15 However, the use of FMS as a predictor variable in studies of injury risk or for clinical use should be predicated on acceptable psychometric properties. For any given clinical measure, the limits of validity are constrained by reliability;16 therefore, reliability is a pre-requisite requirement for both research and clinical application.17 In particular, for field-expedient measures, such as the FMS battery, that are often administered by different raters and at different time points, it is necessary to demonstrate acceptable reliability both within and between raters, as well as within and between sessions.1 To date, there has been no systematic evaluation of the overall quality of these studies, so that clinicians and researchers can make informed decisions about potential use of these movement quality test measures. Therefore, the aim of this systematic review was to critically appraise and summarise research investigating inter-rater and intra-rater reliability of FMS scores.
A systematic review of the literature was undertaken taking into consideration the reporting requirements of the PRISMA statement.18 The focus of the review was on the reliability of scores obtained by visual ratings of movement quality using the FMS battery as described by Cook et al.4 ,5 The battery consists of seven test movements (Deep Squat, Hurdle Step, In-line Lunge, Shoulder Mobility, Active Straight-leg Raise, Trunk Stability Push-up and Rotary Stability) and three additional pain provocation ‘clearing tests’ (Shoulder Impingement, Spinal Extension and Spinal Flexion).4 ,5
The search strategy and syntax (see figure 1) were developed in consultation with a specialist librarian with an initial search undertaken on 1 December 2013 and repeated at regular intervals until 4 February 2015. The reference lists of retrieved full-text articles were hand-searched, and the citation history of each full-text article reviewed using citation tracking (Scopus, Elsevier B.V.) to identify any additional records.
One author (RWM) conducted all database searches and undertook preliminary screening of search results based on article title and abstract. All articles that made reference to ‘functional movement screen*’ or ‘FMS’ in the title or abstract were saved using reference management software and duplicates were removed. Titles and abstracts of each identified article were independently considered by two authors (RWM and AGS), and a composite list of articles that satisfied the following eligibility criteria was compiled: (1) the primary aim was to investigate inter-rater or intra-rater reliability of the FMS; (2) reliability data were derived from visual assessment of live or video recordings; and (3) article was published in English. There were no limits imposed on vocation, academic or professional qualifications, or level of clinical experience of raters; or on the characteristics of the participant sample. The full text of each article was retrieved and then independently reviewed by two of the authors (RWM and AGS) using the eligibility criteria. Articles that met all three criteria were selected for quality appraisal. A third author (SJS) was available to resolve disagreement about eligibility or selection. Reliability data reported in conference abstracts, or in methods sections of studies in which reliability was not the primary aim, were not eligible for appraisal because of the high likelihood of insufficient methodological detail being available to permit a robust appraisal.
Assessment for risk of bias was undertaken using the Quality Appraisal for Reliability Studies (QAREL) checklist.19 The 11-item QAREL checklist has previously demonstrated good test-retest reliability,20 and has been used in recent systematic reviews of rater reliability.21–23 Each QAREL item is equally weighted and scored as ‘Yes’, ‘No’ or ‘Unclear’. Before appraisal started, each QAREL item was operationally defined within the context of the FMS (see online supplementary materials S1).
Two reviewers (RWM and KMM) piloted the QAREL extraction template and checklist, and discussed how each item would be interpreted based on the operational definitions. One QAREL item (item 5 ‘blinding to results of accepted reference standard’) was excluded from appraisal because an accepted reference standard for the FMS does not exist. Reviewers met to compare findings after independently appraising batches of 1–3 articles. Disagreements in scoring between reviewers were resolved by consensus following further consideration of the operational definitions. A third reviewer (AGS) was available to resolve disagreement about quality appraisal.
Summary statistics for reliability of scores obtained from FMS ratings were extracted from appraised studies at two levels: (A) coefficients of agreement (κ or similar statistic) for categorical data (4-point ordinal scale: 0–3 for each of the 7 FMS subtests); and (B) coefficients of agreement (intraclass correlation coefficient (ICC)) on overall FMS composite scores (sum of scores from each of the 7 subtests). ‘Acceptable reliability’ was operationally defined as ≥0.4 for κ and ≥0.6 for ICCs. A κ ≥0.4 corresponds to at least ‘moderate’ agreement,24 and ICC ≥0.6 has been defined as the minimum useful level of agreement.25 When extracting reliability values, the lower limit of the CI was employed;26 however, if CIs were not reported, the value of the reliability coefficients was extracted. Further to appraisal of individual studies, we interpreted the overall quality of evidence across all appraised studies using a similar approach to that employed in other recent systematic reviews of rater reliability,22 ,23 ,27 based on an adapted version23 of van Tulder et al's28 levels of evidence (table 1). In defining study quality, previous systematic reviews of rater reliability employing QAREL have used ≥50%,22 ,27 ,29 ≥60%23 ,27 ,30 and ≥70%27 cut-points. In the absence of a single accepted cut-point for defining study quality, and as conclusions about overall levels of evidence can be sensitive to operational definitions of study quality,31 we conducted analyses using three cut-points for defining ‘high quality’. Studies were defined as high quality if ≥50%, ≥60% and ≥70% of applicable QAREL checklist items were scored as ‘Yes’. When considering the levels of evidence, ratings made from observations of video recordings were considered separately from live ratings.
Results of the database search are shown in figure 1. Of the 23 full-text articles assessed for eligibility, 11 studies reported reliability data but did not meet the inclusion criteria and were excluded. Reasons for exclusion were: not published in English;32 rater reliability was not the primary study aim;9 ,33–35 source was an unpublished thesis;36–38 or record was published in abstract rather than full-text form.39–41 Thus, 12 studies met the criteria for inclusion and underwent appraisal using the QAREL checklist.42–53 After independent appraisal, the two reviewers agreed on 84.9% of appraised items (κ=0.77, 95% CI 0.68 to 0.86), and achieved consensus on the remaining items after discussion and consideration of the operational definitions. Characteristics of appraised studies are displayed in table 2. Appraised studies were characterised by ratings of live performances of the FMS battery, and/or observing video recordings of participant performances. Of the 12 studies reviewed, there were six different combinations of inter-rater and intra-rater reliability, indicating a high level of diversity in study design (figure 2). Reliability coefficients for both composite FMS scores and each individual subtest were extracted and plotted for inter-rater (figure 3) and intra-rater reliability (figure 4).
Results for quality appraisal of each study are reported in table 3. Eight of the 12 studies met the operational definition for ‘high quality’ based on satisfying ≥50% of the applicable QAREL items. However, when applying the ≥60% quality threshold, there were three studies of high quality,47 ,50 ,52 and only one high-quality study when applying the ≥70% quality threshold.52
Levels of evidence for composite FMS scores
Inter-rater reliability of composite scores: ratings made while viewing video recordings
Inter-rater reliability of composite FMS scores was investigated in five studies using ratings made while viewing video recordings.42 ,43 ,45 ,47 ,50 However, Minick et al47 did not report agreement coefficients for composite scores, while Butler et al42 investigated a modified scoring system intended for research purposes rather than clinical use (table 2). Therefore, based on these three studies, the overall level of evidence was ‘conflicting’ irrespective of the threshold for ‘high quality’.
Inter-rater reliability of composite scores: ratings made from live observation
Four studies reported inter-rater reliability of composite scores established from live ratings.48 ,49 ,51 ,52 Regardless of the threshold for study quality (≥50%, ≥60% or ≥70%), the overall level of evidence was ‘moderate’.
Intra-rater reliability of composite scores: ratings made from viewing of video recordings
Three studies reported intra-rater reliability based on repeated viewing of video recordings.44 ,49 ,50 The intra-rater aspect of Shultz et al50 was considered methodological (comparison of one rater's FMS scores derived from live ratings compared with those made from video recordings) with limited clinical application and therefore, was not included in the analysis (table 2). Applying the ≥50% threshold of study quality resulted in a ‘moderate’ level of evidence; however, at ≥60% and ≥70% thresholds, the overall level of evidence was ‘limited’.
Intra-rater reliability of composite scores: ratings made from live observation
Six studies report intra-rater reliability of composite scores from live ratings.46 ,48 ,50–53 The findings of Shultz et al50 were excluded on the basis of limited clinical applicability, and those of Waldron et al53 excluded because an interpretable reliability coefficient was not reported (table 2). Applying the ≥50% threshold of study quality resulted in a ‘strong’ level of evidence; however, at ≥60% and ≥70% thresholds, the overall level of evidence was ‘moderate’.
Levels of evidence for FMS subtests
The levels of evidence for individual FMS subtests are summarised in table 4. Apart from one exception (Hurdle Step), the levels of evidence for all subtests were robust with regard to increasing the threshold of study quality. For inter-rater reliability, live ratings out-performed those made from video with ‘moderate’ evidence of acceptable reliability for four subtests (Deep Squat, Shoulder Mobility, Active Straight-leg Raise, Trunk Stability Push-up), and ‘conflicting’ evidence for the remaining three (Hurdle Step, In-line Lunge, Rotary Stability). The levels of evidence for inter-rater reliability from video ratings were ‘conflicting’ for all subtests except for Hurdle Step—which decreased from ‘strong’ to ‘limited’ when increasing the threshold of study quality. For intra-rater reliability, the level of evidence was ‘moderate’ for all subtests except for Rotary Stability—which was ‘conflicting’. There were no published intra-rater reliability data for ratings of individual subtests made from video.
Our systematic review indicates a ‘moderate’ level of evidence in favour of acceptable inter-rater and intra-rater reliability for composite scores derived from live scoring of the FMS battery. For composite scores derived from viewing video recordings, there is conflicting evidence of ‘acceptable’ inter-rater reliability, and limited evidence for intra-rater reliability.
The FMS is one of several observation-based movement test batteries that purport to assess movement patterns using visual observation.8 ,9 ,35 ,54 ,55 Formal data about the extent to which the FMS is utilised by practitioners in injury prevention programmes are sparse; however, one recent survey of international professional football (soccer) clubs identified the FMS as the test most commonly used to identify injury risk.2
Clinical interpretation of what constitutes an acceptable level of rater reliability is widely considered to be context dependent and somewhat arbitrary.56 ,17 In defining thresholds of ‘acceptable’ reliability in this review, we defined coefficient values (ICCs ≥0.6, κ≥ 0.4) corresponding to ‘moderate’ reliability as sufficient for observing human movement for screening purposes. However, increasing the thresholds above these levels would decrease the number of studies reporting acceptable reliability and the overall levels of evidence would be downgraded.
Methodological issues identified in the studies reviewed
There are a wide range of different study designs available to investigate rater reliability,26 and of the 12 studies included in this review, there were six different combinations of inter-rater/intra-rater reliability and live/video observation, and this diversity of study designs precluded meta-analysis. Of the studies included, only three reached a QAREL score ≥60% and one study ≥70%, highlighting that the majority of studies were at risk of methodological bias. Close inspection of the QAREL appraisal results show a substantial number of ‘unclear’ ratings arising as a consequence of poor study reporting (table 3). In appraising the 12 studies in this review using the 10 applicable QAREL items, we assigned an ‘unclear’ rating to approximately half of the total applicable ratings (54 of 111 rated items). An ‘unclear’ rating arises when authors fail to report sufficient procedural or methodological detail, and a substantial proportion (35/63) were related to issues of blinding (see table 3, QAREL Items 3, 4, 6, 7).
Guidelines to improve the reporting of clinical studies are now well established in clinical research, including those for studies of reliability and agreement.26 The poor reporting in studies we reviewed further highlights previous calls for investigators in sports injury prevention research to comply with reporting guidelines.57 Given the proportion of ‘unclear’ ratings related to poor reporting, it is possible that the true quality of the existing reliability literature is higher than we have appraised. We elected not to personally contact authors to clarify status because, like other reviewers,22 we considered this would could introduce a high risk of recall bias.
In addition to issues related to blinding, there were three other notable methodological weaknesses in the studies:
Representativeness of participants (QAREL item 1): The small samples (n≤5) employed in several studies43 ,44 are unlikely to provide a sufficient spectrum of ratings; a problem which can threaten both internal validity when using the κ statistic,58 and also limit external validity because reliability should be investigated in a sample representing scores across the full scale.56 Although methods of calculating sample size for reliability studies are available for both raters59 and participants,60 none of the 12 studies reported calculations or provided a rationale in support of sample size.
Representativeness of raters (QAREL item 2): Rater training and experience in FMS administration were not systematically reported in the studies reviewed. The often incomplete description of rater characteristics precluded analysis of the influence of prior training and experience in FMS; however, it appears that raters with relatively little experience can achieve acceptable reliability. Rater training and experience should be systematically itemised, including vocational designation, the extent of formal or informal instruction, and perhaps most importantly, the level of clinical experience in administering the FMS.
Appropriate statistical treatment (QAREL item 11): There are well-accepted statistical conventions for analysing and reporting rater reliability data,26 and compliance is important so that findings are correctly interpreted.17 It is unfortunate, though preventable, that several studies included inappropriate methods of analysis, such as employing inappropriate statistical analysis for inter-rater agreement for individual subtests,45 using ICCs to report agreement for subtests (4-point ordinal scale),51 or use of non-chance corrected statistics.53 Failure to report confidence limits for reliability coefficients was also apparent.42 ,43 ,47
Implications for practitioners
Based on our findings, practitioners using the FMS as a movement quality test can expect to achieve acceptable inter-rater and intra-rater reliability for deriving a composite score from live ratings, assuming they have comparable training and experience to raters in the studies reviewed. Based on the conflicting level of evidence for reliability based on video ratings, live ratings may be preferred over those made from viewing of the recorded video.
Cook et al61–63 indicate that practitioners should interpret the composite score in conjunction with scores of individual subtests, and the number of left-right asymmetries. Although acceptable levels of inter-rater and intra-rater reliability for composite scores are achievable, recent explorations of the underlying factor structure of the FMS battery in military personnel64 and elite athletes65 suggest it may not be unidimensional, and summation of subtest scores into a single composite score may not be justified. We recommend that practitioners interpret the findings of each individual subtest separately; however, practitioners should note the level of evidence for live inter-rater reliability is conflicting for three tests (Hurdle Step, In-line Lunge, Rotary Stability) and in circumstances where multiple practitioners are working collaboratively (such as multistation preparticipation screening of large training squads), more judicious interpretation of these tests is recommended. Whenever possible, practitioners working together in the same setting should review test administration and scoring criteria in order to calibrate among themselves.
Recommendations for further research
Researchers designing clinical trials make decisions that impact on internal and external validity based on design ‘attitude’ which may be pragmatic or explanatory in orientation.66 Similarly, researchers designing rater reliability studies make design decisions that impact on external and internal validity. None of the appraised studies reported designs in which individual raters independently administered the FMS in a manner that closely resembled typical administration in a clinical or field setting. Studies that are designed for execution in a laboratory or simulated clinical setting can produce findings that have acceptable internal validity, but may not be sufficiently representative of the usual practice environment for the findings to be generalised. To improve internal validity, investigators may introduce conditions in which risk of bias within the study is improved; however, these more controlled conditions compromise the representativeness of the context in which the test is usually administered. For instance, the use of video in reliability studies controls for within-participant variability, but is not representative of the live test administration typical of clinical settings.67 None of the appraised studies were designed to simulate the conditions in which the FMS might normally be conducted in routine practice; therefore, the ecological validity of the studies reviewed is questionable. In addition to improved study reporting, we recommend future investigations to incorporate design characteristics that more closely resemble conditions encountered in typical clinical applications. Examples include: (1) raters independently administer the FMS in isolation from other raters, (2) all subtests and clearing tests are included, (3) raters manipulate FMS test equipment, including setting hurdle height and establishing cut-off measurement for Shoulder Mobility, (4) raters verbalise their own instructions, (5) raters enquire about the presence of pain during each subtest, and (6) inclusion of rater notation of test outcomes and interpretation of findings. Given that reliability is a property of the measurement and not of the instrument,68 we recommend that when employing FMS scoring in injury prevention research, authors should consider establishing reliability data for their raters, as appropriate to the study or practice context. To aid practical interpretation, investigators reporting reliability data should also report any measurement error in the form of SE of measurement and minimum detectable change.17 Of the studies reviewed, only Teyhen et al52 reported these data.
This study set specific eligibility criteria that excluded theses and other sources of non-indexed and non-peer-reviewed grey literature, and included only English language studies that investigated reliability as a primary study aim. This approach may have excluded good quality studies that may exist outside these parameters, but were not included in this synthesis.
This review found ‘moderate’ evidence that raters can achieve acceptable levels of inter-rater and intra-rater reliability of composite FMS scores when using live ratings, but ‘limited’ and ‘conflicting’ evidence for those derived from video recordings. Overall, there were few high-quality studies, with just one study satisfying the most rigorous definition for high quality. The quality of several studies were negatively impacted by poor reporting, particularly in relation to rater blinding which is critically important in reliability designs.
What are the new findings?
This study is the first systematic review of rater reliability for Functional Movement Screening (FMS) scores.
There is a ‘moderate’ level of evidence in favour of acceptable inter-rater and intra-rater reliability for composite scores derived from live scoring.
Ratings made from live observation were superior to those made from viewing of video recordings.
This web only file has been produced by the BMJ Publishing Group from an electronic file supplied by the author(s) and has not been edited for content.
- Data supplement 1 - Online supplement
Contributors RWM conceived the idea for the systematic review. RWM undertook the literature search. RWM and KMM appraised the articles. AGS took the final decision on appraisal decisions when not agreed on by RWM and KMM. RWM drafted the manuscript, and AGS and SJS revised it critically for intellectual content. All authors approved the final version. RWM submitted the article.
Competing interests None declared.
Provenance and peer review Not commissioned; externally peer reviewed.
If you wish to reuse any or all of this article please use the link below which will take you to the Copyright Clearance Center’s RightsLink service. You will be able to get a quick price and instant permission to reuse the content in many different ways.