# Enhancing the efficacy of the 20 m multistage shuttle run test

^{1}Faculty of Applied Health Sciences, Brock University, St Catherines, Ontario, Canada L2S 3AI^{2}School of Sports, University of Wolverhampton, Wolverhampton, UK^{3}Department of Sport and Exercise Science, University of Thessaly, Trikala, Greece

- Correspondence to: Yiannis Koutedakis University of Thessaly, Department of Sport and Exercise Science, Karies, Trikala GR42100, Greece; y.koutedakisuth.gr

- Accepted 8 June 2004

## Abstract

**Objective:** Maximal oxygen uptake (Vo_{2max}) of 44 ml kg^{−1} min^{−1} is an accepted criterion (Vo_{2CR}) below which health and fitness for young male adults may be compromised. New algorithms validated for Vo_{2CR} screening using the 20 m multistage shuttle run test (20mMST) were developed.

**Methods:** Vo_{2max} was assessed in 110 males using a stationary gas analyser in a treadmill test (TT) and in 40 of these subjects using a portable
gas analyser in the 20mMST. Vo_{2max} predicted from the 20mMST in 70 subjects was used for cross validation. Two equations predicting Vo_{2max} during 20mMST (EQ_{MST}) and TT (EQ_{TT}) were developed.

**Results:** Significant energy cost variance (EC_{V}) was detected between TT and 20mMST (p<0.001), correlated significantly with subject height, and was a significant predictor
of Vo_{2max} differences between TT and 20mMST. The *r*^{2} of EQ_{MST} was 0.92 (p<0.001). Predicted Vo_{2max} values from EQ_{MST} correlated with directly measured 20mMST Vo_{2max} at *r* = 0.96 (p<0.001). ANOVA detected no mean difference (p>0.05) between predicted and measured values. Prevalence of low fitness
based on Vo_{2CR} was 0.37. McNemar χ^{2} indicated significant differences in sensitivity (p<0.001) and specificity (p<0.05) between the original 20mMST equation
(EQ_{LÉG}) and EQ_{TT}, regarding Vo_{2CR} screening. Cohen’s κ demonstrated higher agreement with TT Vo_{2max} for EQ_{TT} (p<0.001) than EQ_{LÉG} (p<0.05). TT Vo_{2max} correlated with the end result of both EQ_{LÉG} and EQ_{TT} at *r* = 0.75 (p<0.001). Unlike EQ_{TT} (p>0.05), mean predicted Vo_{2max} from EQ_{LÉG} was significantly higher compared to TT Vo_{2max} (p<0.001).

**Conclusion:** These algorithms increase the efficacy of 20mMST to accurately evaluate aspects of health and fitness.

- ANOVA, analysis of variance
- CF, cardiorespiratory fitness
- CI, confidence interval
- CV, coefficient of variation
- EC, energy cost
- GEE, generalised estimating equations
- GLM, general linear model
- LIM
_{AG}, limits of agreement analysis - MAS, maximal attained speed
- ROC, receiver operating characteristics
- SD, standard deviation
- TT, treadmill test
- 20mMST, 20 m multistage shuttle run test

Despite the vast amounts of research focusing on various cardiorespiratory fitness (CF) assessments and the acceptance of
specific CF cut offs in national health guidelines,^{1,}^{2} statistical screening methodology such as calculating receiver operating characteristics (ROC) curves has not been employed
hitherto. The ROC curve analysis is extensively used in epidemiology to provide a graphic means for assessing the accuracy
of a diagnostic instrument.^{3} The difficulty in adopting ROC curves in sports medicine is mainly attributed to the fact that most outcome measures are
in continuous format. However, these biomarkers can be dichotomised using dummy variables according to clinically accepted
critical values Q and defined positive or negative if the test outcome measure is greater or lesser than Q. For instance,
a maximal oxygen uptake (Vo_{2max}) of 44 ml kg^{−1} min^{−1} for young male adults (18–29 years of age) has been generally accepted as a criterion (Vo_{2CR}) below which both health and fitness may be compromised.^{1,}^{4,}^{5}

The 20 m multistage shuttle run test (20mMST)^{6} represents an acceptable field assessment tool for CF, and has been repeatedly employed in different health^{7,}^{8} and fitness^{9} settings. However, the popularity of the 20mMST is mainly attributed to its practical use for simultaneous measurement of
large groups of individuals. Studies evaluating its accuracy in predicting laboratory Vo_{2max} have reported contradictory results.^{9–}^{11} More importantly, the efficacy (that is, the extent to which a specific procedure produces a valid classification of data
in relation to established criteria) of the original 20mMST model in screening for CF remains unknown.

From a statistical standpoint, the limited accuracy of the 20mMST may be attributed to the repeated measures design used in
the original study.^{6} It is well known that the inherent dependency of within-subject observations can reduce the power of prediction models.^{12} Concurrently, it seems tenable that the theoretical basis of the original 20mMST model may be further compromised by the
use of generally large and heterogeneous samples in the validation procedures.^{6} It has been established that severely biased linear relationships can occur owing to sample heterogeneity.^{13}

From a physiological viewpoint, it could be argued that the curtailed ability of the original 20mMST model to predict treadmill
Vo_{2max} values might be attributed to differences in the exercise modes utilised in the validation procedures (that is, shuttle running
*v* forward running). Findings from recent investigations suggested that Vo_{2max} during the 20mMST is significantly higher compared to a treadmill test.^{14,}^{15} Ergo, a prediction model controlling for differences in energy cost (EC) between the reference standard laboratory assessment
and the proxy 20mMST may result in more accurate prediction of Vo_{2max} and increased efficacy in screening for Vo_{2CR}. The objective of the present investigation was to develop a new Vo_{2max} prediction algorithm for the 20mMST using data collected via portable indirect calorimetry and statistical procedures which
accounted for within-subject observation dependency. Thereafter, the efficacy of both the original and the novel models was
assessed in predicting standard treadmill Vo_{2max} and screening for Vo_{2CR}.

## METHODS

### Subjects and procedures

A total of 110 healthy males (age: 21.6 (SD 2.5); BMI: 23.6 (2.2)) volunteered. Exclusion criteria included smoking and any muscular or skeletal injuries. Written informed consent was obtained from all participants after full explanation of the procedures involved. The cohort was arbitrarily divided into model (n = 40) and validation (n = 70) groups. Analysis of variance (ANOVA) revealed no significant difference between the two groups in terms of anthropometrical characteristics.

Within a 14 day period, all participants underwent a treadmill Vo_{2max} assessment and performed the 20mMST in an indoor rubber floored gymnasium. Unlike the validation group, participants in the
model group were subjected to Vo_{2max} assessment whilst performing the 20mMST using a portable gas analyser. Special care was taken to maintain similar environmental
conditions in both measurement sites during assessment. Prior to data collection visits, subjects were familiarised with all
assessment protocols. They were also advised to avoid stressful activities 36–48 h prior to the data collection visits. Tests
were conducted in a random order, by the same investigators, and at the same time for each subject either between 9:00 and
12:00 h or between 14:00 and 17:00 h. The study was approved by the Research Ethics Board of the University of Wolverhampton.

### Data collection

#### Laboratory assessment of Vo_{2max} (TT)

A modified Bruce treadmill test (TT) to exhaustion was used.^{16} The treadmill running speed was manipulated accordingly in order to bring the subject to exhaustion in 7–10 min. The treadmill
inclination was increased by 2.5° every 3 min from an initial 3.5°. Oxygen uptake (Vo_{2} (ml kg^{−1} min^{−1})) was measured via open circuit spirometry using an automated gas analyser (Vmax 29, SensorMedics, Yorba Linda, CA) previously
calibrated with standard gases. Respiratory parameters were recorded every 20 s during testing, while subjects inspired room
air through a low resistance two-way Rudolph valve. To ensure that subjects achieved Vo_{2max}, measurements were considered for further analysis when at least two of the following criteria were met: (i) maximal heart
rate greater than 185 bpm, (ii) respiratory exchange ratio greater than 1.1, and/or (iii) detection of plateau in Vo_{2} curve. EC in kcal was calculated for each individual minute/stage as the product of mean Vo_{2} (l min^{−1}) by the corresponding caloric equivalent.^{17}

#### Field assessment of Vo_{2max} (20mMST)

This test was conducted according to established procedures.^{6} In the model group a portable gas analyser (K4*b ^{2}*, Cosmed, Rome, Italy) was used to record respiratory parameters every 20 s during testing, while subjects inspired room air
through a facemask. Maximal oxygen uptake was the main parameter determined using the open circuit method. Prior to measurement,
the gas analyser was calibrated with standard gases. Exhaustion was confirmed when at least two of the following criteria
were met: (i) maximal heart rate greater than 185 bpm, (ii) respiratory exchange ratio greater than 1.1, and/or (iii) detection
of plateau in Vo

_{2}curve. The EC in kcal was calculated for each individual minute/stage as the product of mean Vo

_{2}(l min

^{−1}) by the corresponding caloric equivalent.

^{17}In the validation group, Vo

_{2max}was predicted from the 20mMST performance according to established procedures.

^{6}

The K4*b ^{2}* gas analyser weighed 475 g and was not expected to significantly alter the subjects’ energy demands. A pilot study using
five subjects (age: 21.6 (SD 1.3); BMI: 24.3 (1.5)) was conducted in order to investigate additional energy demands and ensure
that significant agreement existed between the two gas analysers employed. The subjects, who did not partake in the main part
of the investigation, performed the previously described TT twice using both gas analysers. Results showed no significant
difference (p>0.05) between the mean Vo

_{2max}value recorded by the stationary (Vmax 29, SensorMedics) and the portable (K4

*b*, Cosmed) gas analyser (48.7 (SD 3.1)

^{2}*v*49.1 (3.5) ml kg

^{−1}min

^{−1}, respectively), with an average absolute error of 0.51 (SD 0.18) ml kg

^{−1}min

^{−1}.

### Statistical analyses

ANOVA was used to compare mean EC between TT and 20mMST. The effect of energy-cost variance between TT and 20mMST (EC_{V}) on the original 20mMST prediction model (EQ_{LÉG}^{6}) was assessed via a simultaneous general linear model (GLM). This model aimed to predict Vo_{2max} differences/errors between TT and EQ_{LÉG} using mean EC_{V} as an independent variable. In addition, Pearson’s correlation coefficients were used to detect linearity between EC_{V} and various anthropometrical characteristics.

For the calculation of the novel prediction model, the generalised estimating equations (GEE)^{18} approach was employed to account for subject specific dependency between the repeated observations. The GEE is a powerful
approach in fitting generalised linear models to non-normally but *dependently* distributed response variables.^{18} A GLM framework with GEE estimation was introduced to generate an equation (EQ_{MST}) predicting Vo_{2max} measured during the 20mMST using the model group data (n = 40). For the latter model, the maximal attained speed (MAS) during
the 20mMST was set as the independent variable. Thereafter, a second GLM with GEE estimation was performed generating the
EQ_{TT} model which aimed to predict the reference standard TT Vo_{2max} (dependent variable) using the end result of EQ_{MST} as an independent variable. This procedure was employed to produce a 20mMST Vo_{2max} model that accounts for EC_{V}. In order to ensure that the procedures followed in the calculation of the EQ_{TT} model were indeed superior to the traditional approach, a GLM was calculated using TT Vo_{2max} (dependent variable) and MAS (independent variable). ANOVA and Pearson’s correlation coefficients were used to detect possible
bias between the mean actual and predicted Vo_{2max} values for the three models.

Data from the remaining 70 subjects (referred to as the validation group) were used to cross validate EQ_{TT} and the original EQ_{LÉG} model. Correlation coefficients, ANOVA, 95% limits of agreement analyses (LIM_{AG}) and percent coefficients of variation (CV_{%}) were adopted to validate the two models according to established procedures.^{19} Ninety five percent confidence intervals (CI_{95%}) and ROC curve analysis were calculated using statistical software incorporated in SAS/Macro/IML. The latter software is
designed specifically to fit ROC curves using dummy variables for data obtained from repeated measures designs. The area under
the ROC curve was estimated using the Wilcoxon non-parametric method.^{20} The demarcation point for Vo_{2CR} was set at 44 ml kg^{−1} min^{−1} according to available guidelines.^{1,}^{4,}^{5} Calculated sensitivity and specificity with corresponding CI_{95%} were used to determine the efficacy of the two equations in screening for Vo_{2CR}. Sensitivity (S_{E}) was defined as the proportion of subjects below the Vo_{2CR} who demonstrated a 20mMST predicted value below 44 ml kg^{−1} min^{−1}. Specificity (S_{P}) was defined as the proportion of subjects above the Vo_{2CR} who revealed a 20mMST predicted value above or equal to 44 ml kg^{−1} min^{−1}. McNemar χ^{2} analysis examined the differences between calculated sensitivity and specificity at the cut off point for both equations.
Cohen’s κ statistic was used to evaluate the agreement between the prediction models and the reference standard test. Finally,
ANOVA and Pearson’s correlation coefficients were used to detect possible bias between the mean actual and predicted values.
All statistical analyses were carried out with SPSS (version 11.5; SPSS, Chicago, IL) and SAS (version 8.2; SAS Institute,
Cary, NC, USA) statistical software packages. The level of significance was set at p<0.05.

## RESULTS

### Effect of energy-cost variance on EQ_{LÉG}

ANOVA detected significant differences in EC and Vo_{2max} between TT and EQ_{LÉG} (p<0.001; fig 1). Further, GLM results indicated that mean EC_{V} was a significant predictor of Vo_{2max} differences between TT and EQ_{LÉG} (*r*^{2} = 0.25, *F*_{1, 38} = 28.89, p<0.001). A significant linearity was also detected between EC_{V} and subject height (*r* = 0.94, p<0.001).

### Prediction of Vo_{2max} achieved via 20mMST and TT

Table 1 shows relevant statistics for the calculated models (that is, EQ_{MAS}, EQ_{MST}, and EQ_{TT}). Routine pre-analysis screening procedures were used to assess whether the data conformed to the assumptions of GLM. Although
normally distributed, the variables used in these analyses were not independent of one another. Examination of residuals scatterplots
detected no violation of normality, linearity, and homoscedasticity between predicted Vo_{2max} scores and errors of prediction. Mahalanobis distance of each case to the centroid of all cases detected no multivariate
outliers for χ^{2}<0.001. As expected the values in the variables utilised were multicollinear, being similar measures of the same parameter
(that is, Vo_{2max}). As significant linearity was detected between EC_{V} and subject height (see previous section), initial calculations for EQ_{MST} and EQ_{TT} included height as a covariate. Nevertheless, the latter variable was not a significant predictor (p>0.05) for either model.

[EQ_{MAS}] Vo_{2max} = MAS×6.87−39.54

[EQ_{MST}] Vo_{2max} = MAS×6.65−35.8

[EQ_{TT}] Vo_{2max} = EQ_{MST}×0.95+0.182

Thus,

[EQ_{TT}] Vo_{2max} = (MAS×6.65−35.8)×0.95+0.182

### Model cross validation

Means (SD) and comparisons of various performance indices from the TT and the 20mMST, as well as results for LIM_{AG} and CV_{%} appear in table 2. Preliminary analyses for LIM_{AG} revealed no positive relationship between the differences/errors (either (EQ_{LÉG}–TT) or (EQ_{TT}–TT)) and the size of measurements (given by either (the mean of EQ_{LÉG} and TT) or (mean of EQ_{TT} and TT)), respectively. Thus, the LIM_{AG} can be reported as absolute measurements.^{21} Finally, unlike EQ_{TT} and TT (*t* = 1.46, p>0.05), the mean difference (error) between estimates from EQ_{LÉG} and TT (*t* = −8.86, p<0.001) was biased.

Relevant univariate statistics and ROC curve analyses for the designated cut off point (that is, 44 ml kg^{−1} min^{−1}) appear in table 3 and fig 2. Twenty six subjects (37.1%; CI_{95%}: 0.9%) were diagnosed below the Vo_{2CR} using the reference standard TT. In contrast, EQ_{LÉG} and EQ_{TT} identified six and 29 subjects below the Vo_{2CR}, respectively. Cohen’s κ statistic demonstrated significant agreement with the TT measurement for both the EQ_{LÉG} (p<0.05) and the EQ_{TT} (p<0.001).

## DISCUSSION

Sedentary lifestyle is a common phenomenon in modern societies, representing a major risk factor for numerous pathologies.^{22} Consequently, screening for, and evaluation of, CF has become important for both health and fitness. The aim of the present
investigation was to utilise the most salient physiological and epidemiological procedures in order to enhance the efficacy
of the 20mMST for CF screening. Results suggested that the developed prediction models significantly increased the efficacy
of the 20mMST to discern subjects according to Vo_{2CR}. To our knowledge, the present study represents the first direct clinical appraisal of the 20mMST as a screening tool for
specific CF cut off points such as Vo_{2CR}.

To account for the increased energy requirements of shuttle running compared to forward treadmill running,^{14,}^{15} we developed a prediction equation which incorporates indirect calorimetry data collected while the subjects performed the
20mMST. Results from the newly developed model demonstrated increased accuracy in predicting Vo_{2max} and a minimised standard error of the estimate (1.9 ml kg^{−1} min^{−1}) compared to the original EQ_{LÉG} and EQ_{MAS} (4.4 and 2.7 ml kg^{−1} min^{−1}, respectively). ^{6} Although the limits of agreement in EQ_{TT} are still relatively wide, this range is more likely to be acceptable compared to EQ_{LÉG} and EQ_{MAS}. Further, as illustrated by the present CV_{%} indices, the traditional Vo_{2max} prediction can be up to 1.2 times as unreliable as the prediction of EQ_{TT}. ROC curve analysis indicated that both EQ_{TT} and EQ_{LÉG} were highly specific in discriminating individuals according to Vo_{2CR}. However, sensitivity in the former was significantly increased compared to the latter model (81% *v* 23%).

**What is already known on this topic**

The 20 m multistage shuttle run test (20mMST) is an acceptable field assessment tool for cardiorespiratory fitness but its original prediction model is subject to significant bias.

The theoretical basis of the EQ_{TT} model is advantageous in that it seeks to parallel the energy utilisation of the human body during the 20mMST and the TT,
rather than relying on statistical inference from a generally large and heterogeneous sample. The cohort consisted entirely
of males to avoid the well known phenomenon of severely biased (that is, nonsense or spurious) linear relationships attributed
to sample heterogeneity.^{13} This phenomenon has been demonstrated explicitly by Anderson^{23} who examined various factors associated with prediction power in the original 20mMST model. Anderson concluded that research
utilising large heterogeneous samples in the validation process of predictive tests of aerobic capacity must be suspect. It
seems reasonable to suggest that the prediction models developed using these procedures are rather generalised, representing
merely vague indicators of the true values. These hypotheses are verified in the present study by the reduced accuracy of
the EQ_{MAS} prediction model, as compared to EQ_{TT}.

On another note, the present results are in line with previous studies suggesting increased energy demands during shuttle
running compared to treadmill running.^{14,}^{15} This may well be attributed to differences in factors such as intensity, exercise mode, technique, and musculature employed
between the two conditions. These factors should be considered in the design of physical training programmes that incorporate
shuttle running elements. This information should also be taken into account when designing the physical training for sports
incorporating shuttle running (for example, football, basketball, rugby). In addition, the present results suggest that EC_{V} is exacerbated with increased body stature. It is tenable that various biomechanical complexities of shuttle running may
account for this. The EQ_{MST} model developed herein to predict Vo_{2max} during the 20mMST can be used to calculate the oxygen transport demands of shuttle running, when such information is required.

It is important to acknowledge, however, that the 20mMST is a test requiring maximal effort. Therefore, it may not be suitable
for populations with specific diseases. In addition, the novel EQ_{TT} model represents a strict means of assessing CF. Three subjects with CF above the Vo_{2CR} in our cross validation sample were mis-screened as performing below the Vo_{2CR}. Practicing such strict screening techniques may be beneficial in circumstances where adequate levels of CF are crucial (for
example, military training). The applications from the present investigation would be further increased by calculating additional
prediction models for both males and females of various age groups. In addition, it is worth mentioning that the present results
are subject to some variability among different models of metabolic carts.^{24} Within the limits of the present investigation, it is concluded that the developed models can be valuable tools that explicitly
increase the efficacy of the 20mMST to discern subjects according to Vo_{2CR}.

**What this study adds**

The prediction models introduced in the present study increase the efficacy of 20mMST thus providing increased accuracy in evaluating aspects of health and fitness.