Purpose The authors evaluated the accuracy of three automated accelerometer wear-time estimation algorithms against self-report. Direct effects on sedentary time (<100 cpm) and indirect effects on moderate-to-vigorous physical activity (MVPA, ≥1952 cpm) time were examined.
Methods A subsample from the 2004/2005 Australian Diabetes, Obesity and Lifestyle Study (n=148) completed activity logs and wore accelerometers for a total of 987 days. A published algorithm that allows movement within non-wear periods (Algorithm 1) was compared with one that allows less movement (Algorithm 2) or no movement (Algorithm 3). Implications for population estimates were examined using 2003/2004 US National Health and Nutrition Examination Survey data.
Results Mean difference per day between the criterion and estimated wear time was negligible for all three algorithms (≤11 min), but 95% limits of agreement (LOA) were wide (±≥2 h). Respectively, the algorithms (1, 2 and 3) misclassified sedentary time as non-wear on 31.9%, 19.4% and 18% of days and misclassified non-wear time as sedentary on 42.8%, 43.7% and 51.3% of days. Use of Algorithm 2 (compared with Algorithm 1) affected population estimates of sedentary time (higher by 20 min/day) but not MVPA time. Agreement between Algorithms 1 and 2 was good for MVPA time (mean difference −0.08, LOA: −2.08, 1.91 min), but not for wear time or sedentary time.
Conclusion Accelerometer wear time can be estimated accurately on average; however, misclassification can be substantial for individuals. Algorithm choice affects estimates of sedentary time. Allowing very limited movement within non-wear periods can improve accuracy.
Statistics from Altmetric.com
Accelerometers are increasingly used to provide valid, objective assessments of sedentary time1 and physical activity2 3 in free-living populations, including in large-scale population monitoring.4 5 Good correlation or agreement between accelerometer output and activity intensity or energy expenditure has been established6; however, other issues need to be considered, particularly how best to determine accelerometer wear time. Study protocols typically specify wearing the accelerometer during waking hours only, and removing for any water-based activities. Thus, wear time varies between days and between participants. Collecting self-report data can assist in determining wear time, but adds to participant and researcher burden. Automated wear-time estimation algorithms are, thus, especially desirable for large-scale studies, but are not standardised7 and their accuracy remains to be established.
Automated estimations classify prolonged periods of non-movement (eg, ≥60 min or ≥20 min at zero intensity) as non-wear time. However, non-moving periods could be either non-wear time or sedentary time. Estimations cannot detect non-wear periods that are shorter than the minimum length of the algorithm criteria (eg, <60 or <20 min) or of a higher intensity than the algorithm allows (eg, where failure of the accelerometer filtration process or external movement has occurred). To overcome this latter problem, spurious data can be defined and removed8; alternatively, some wear-time estimation algorithms1 allow a limited amount of movement (non-zero counts) to occur within a block of non-wear time.
In addition to the direct impact on sedentary time estimates, achieving the criteria for valid data can be affected by wear-time estimation. Thus, all accelerometer measures can be affected, even if they do not directly include very low-intensity activities in their calculation (eg, moderate-to-vigorous physical activity (MVPA) time). Studies often employ a minimum daily wear-time criterion (typically 10 h) and often require a minimum number of valid days (commonly ≥4). Potentially, bias can be introduced if the automated estimation process erroneously excludes days on which participants are most sedentary (and possibly also least physically active) and/or excludes the participants who are most sedentary (and least physically active).
Wear-time estimation algorithms produce varied results.7 8 Defining non-wear time as all blocks of non-movement ≥20 min, relative to ≥60 min, leads to lower estimates of wear time, lower sedentary time and higher average counts (a marker of overall activity intensity).8 One study concluded that ≥60 min was preferable to the ≥20 min criterion, but lacked a referent assessment.8 No studies that assessed automated estimation against a criterion method in free-living persons were found in database searches of PubMed and ProQuest (June 2010).
We examined the validity and compared the performance for adults of three automated accelerometer wear-time estimation algorithms. We also examined whether misclassification varied with sociodemographic characteristics or bodyweight. Finally, we examined the net impact of misclassification on population estimates of sedentary time, and also on MVPA time.
Automated estimation algorithms
All accelerometer data were analysed in SAS 9.1 via a program developed by the National Cancer Institute.4 The program's wear-time component (Algorithm 1) has been used to generate US population estimates of sedentary time1 and physical activity.5 The program was adapted to produce Algorithms 2 and 3. In view of the apparent advantage of longer (≥60 min) over shorter (≥20 min) criteria for identifying non-wear periods,8 we focused on algorithms that use the ≥60 min criterion, and compared the effect of the extent of interruptions they allowed within the non-wear period. More/fewer or no interruptions mean less/more spurious data but also more/less discarding of true sedentary time. The interruptions allowed by the algorithms were:
▶ <100 cpm in intensity, with no more than two occurring consecutively (Algorithm 1);
▶ <50 cpm in intensity, with no more than two occurring per non-wear period (Algorithm 2) and
▶ no interruptions allowed, 0 cpm only (Algorithm 3).
We used data from a substudy of the 2004/2005 Australian Diabetes Obesity and Lifestyle Study (AusDiab)9 and the 2003/2004 US National Health and Nutrition Examination Surveys (NHANES; http://www.cdc.gov/nchs/nhanes.htm).10 Both studies obtained ethical approval from relevant parties and written informed consent from participants. For both studies, accelerometers (ActiGraph model 7164; ActiGraph LLC, Fort Walton Beach, Florida, USA) were set to record in 1-min epochs and participants were instructed to wear the accelerometer on the right hip for seven consecutive days during all waking hours, unless doing water-based activities.
AusDiab accelerometer substudy
Detailed methods for the AusDiab substudy are reported elsewhere.9 Participants (n=202) recorded the times that they wore the accelerometer (on/off times) in an activity log, plus times that they removed it for a period of 15 min or more. Data from 148 participants with 987 matching days for accelerometer wear and the activity log were available for analyses. Data were excluded if the accelerometer failed (n=6) or the participant withdrew (n=7), and from all observed days where the accelerometer was not worn or the activity log was poorly completed (n=41 participants, n=336 days). Specifically, data that were suspicious (eg, all on/off times occurred on the hour), ambiguous (eg, cannot be certain that the times recorded referred to AM or PM), or missing time or date information were excluded.
Determining non-wear time from activity logs and accelerometer
The beginning and end times of periods classed as non-wear by the algorithms were extracted and compared with the activity logs. Self-report was used as a criterion of whether participants were wearing the accelerometer or not at any given time; however, allowances were made for imprecision in self-reported times (of up to 30 min). The beginning and end times of our criterion non-wear periods were defined in one of the three ways depending on the discrepancies between self-report times from the log and those derived from the algorithms (figure 1). If the discrepancy was less than 30 min (assumed to be imprecise time reporting), then the times identified by whichever algorithm most closely matched the activity log were used. If the discrepancy was 30 min or more, the algorithm was assumed to have failed and the times reported in the activity log were used. Times from the activity log were also used if the algorithms did not detect a reported non-wear period.
We calculated the overall agreement between the algorithms and the criterion measure in their assessment of each epoch as non-wear/wear. Although agreement with our imperfect criterion should not be interpreted in terms of diagnostic accuracy, we reported our results in terms of sensitivity and specificity instead of the traditionally used κ, as these statistics can distinguish between types of misclassification and are unaffected by the amount of time that is non-wear according to the criterion. The best combination of sensitivity and specificity was judged by highest Youdin's J (sensitivity + specificity −1). In view of the non-independence of observations, we used a cluster bootstrap method to assess 95% CIs (STATA v11). To explore the consistency of the performance of the algorithms, we also calculated sensitivity and specificity for each day, and reported the range observed.
We descriptively compared Algorithms 2 and 3 with Algorithm 1 in terms of the percentage of days in which misclassification occurred and the amount of misclassification for days in which misclassification occurred. Due to skewness, the latter were reported as medians and ranges. Because of the direct substitution between non-wear and sedentary time, we report wear time that was misclassified as non-wear and sedentary time that was misclassified as non-wear. Similarly, a failure to detect non-wear time was reported as non-wear time that was misclassified as sedentary. Using Bland–Altman analysis11 we also looked at mean differences and 95% limits of agreement (LOA) between estimated and criterion wear time for all three algorithms. To explore whether misclassification was systematic, we looked at bivariate associations of sociodemographic characteristics (age, gender, employment, education, income) and body mass index (BMI, kg/m2) with the amount of sedentary time misclassified as non-wear (0, <1 h, 1 to <2 h, 2 to <3 h or 3 h+) using generalised estimating equations (GEE; in view of the repeated measures).
National Health and Nutrition Examination Surveys
NHANES used a complex, multistage design and its 2003–2004 cycle included an accelerometer component10 for which all ambulatory participants at least 6 years of age who attended the Mobile Examination Centre (MEC) were eligible. Because we examined algorithm validity for adults, we focus only on data for adults (n=4741 MEC participants aged ≥20 years). To estimate the potential impact of algorithm choice on population estimates, we compared the algorithm that has, to date, been used on the NHANES data (Algorithm 1)1 5 with the algorithm that agreed most with the criterion. Daily values (minutes) and valid averages were calculated for wear time, total sedentary time (worn time of intensity <100 cpm)1 and MVPA time (worn time of intensity ≥1952 cpm).12 As with other analyses of the NHANES accelerometer data,1 5 valid averages include only data from monitors that were returned in calibration and from days with ≥10 h of wear time. We further removed days when excessively high counts were encountered (≥20 000 cpm) as these may indicate unreliable data.
The algorithms were compared descriptively in terms of the population averages estimated, and the valid sample of participants and days from which these averages were derived. Population figures were calculated using linearised methods and appropriate sample weights.13 Population averages were based on all available valid data. Bland–Altman analysis was used to examine agreement between the algorithms in their estimates of wear time (min/day), MVPA (min/day) and sedentary time (min/day, percentage of time worn and as min/day with correction for wear time by the residuals method14). Agreement was examined for 3078 participants with sufficient data for reliable estimates15; that is, at least 4 valid days according to both estimation algorithms.
AusDiab subsample: validity of automated estimation against diaries
Participants from the AusDiab accelerometer substudy were aged between 30 and 87 years (mean=54.2 years, SD 12). The sample included men and women diverse in age range, weight status and sociodemographic characteristics (table 1).
Agreement between the algorithms and the criterion was excellent for all three algorithms. On average, all three algorithms had high sensitivity and specificity (all >95%); however, the values observed for the days with least sensitivity (0–43%) indicate that performance was not consistently good (table 2). Algorithm 1 had the most sensitivity and the least specificity, Algorithm 3 had the least specificity and the most sensitivity, whereas Algorithm 2 showed the best balance of both, by a very small amount. When misclassification occurred, sedentary time was misclassified as non-wear for 72–78 min on average whereas only 44–50 min on average of non-wear time was misclassified as sedentary time.
Misclassification of sedentary time as non-wear occurred on approximately one-third (31.9%) of all observed days using Algorithm 1. Algorithm 2 reduced this to 19.4% without altering the percentage of days on which non-wear time was misclassified as sedentary time. Algorithm 3 also reduced the occurrence of misclassified sedentary time (to 18%) but at the same time increased misclassification of non-wear time as sedentary time (51.3% vs 42.8% for Algorithm 1).
For all algorithms, the mean differences between estimated and criterion wear time were negligible (≤11 min) but LOA were wide, spanning approximately 2–3 h. The LOA were wider for Algorithm 1 than the others (table 2). The Bland–Altman plots (figure 2) further indicated that all three algorithms tended to have more underestimation of wear time than overestimation, and that the most extreme underestimation occurred at lower values of wear time.
Correlates of misclassification
Significantly more misclassification occurred among overweight or obese participants compared with those of normal or underweight BMI (table 2). Based on the crude percentages (with differences of ≥5% considered noteworthy) and the GEE analysis, there was otherwise very little evidence that the amount of misclassification of sedentary time by the original method (Algorithm 1) varied across sociodemographic groups (table 3). Noteworthy, but non-significant differences in crude percentages were seen only for participants who were working full-time and those in the 40–49-year age bracket, when compared with their respective counterparts.
NHANES: effect of algorithm choice on US population estimates
Table 4 compares estimates for the adult US population that result from using Algorithm 2 versus Algorithm 1. Compared with Algorithm 1, Algorithm 2 generated higher estimates of wear time on average and consequently classified 581 more observed days as valid, and more participants as having valid data (using either a 1-day or 4-day criterion). Mean population estimates of sedentary time were higher when using Algorithm 2 than Algorithm 1, even when correcting for wear time or when examining sedentary time as a percentage of worn time. The magnitude of the difference in estimates was modest: approximately 20 min or 1%. By contrast, the estimates of average time spent in MVPA were not affected by choice of wear-time algorithm, with the median [minimum, maximum] for Algorithms 1 and 2 being 18 [0, 215.5] and 17.6 [0, 215.5] min, respectively.
Figure 3 shows the Bland–Altman plots of agreement between Algorithms 1 and 2 for wear time, sedentary time and MVPA. Agreement between the algorithms was poor for wear time and for all measures of sedentary time. For wear time, sedentary time and percentage sedentary time, some heteroscedasticity was evident (ie, the amount of misclassification increased with the mean), so the Bland–Altman plots are displayed for the log-transformed data. The back-transformed mean difference (1.02) and LOA (0.94, 1.11) indicate that Algorithm 2 produced estimates of wear time that were 2% higher on average than Algorithm 1, and anywhere between 6% lower and 11% higher for 95% of people. Relative to Algorithm 1, Algorithm 2 also generated higher estimates of sedentary time (+4%, LOA −8%, +17%) and percentage worn time spent sedentary (+2%, LOA −3%, +7%). For corrected sedentary time, the mean difference was 20.4 min (LOA: −12.7, 53.5). Although the mean differences were small, the wide LOA and large outliers show that the two algorithms do not yield equivalent estimates of sedentary time or wear time. In contrast, there was a good agreement for MVPA estimated by the two algorithms (mean difference −0.08, LOA: −2.08, 1.91 min).
All three wear-time estimation algorithms showed excellent agreement with the criterion. However, for some days, the algorithms had poor sensitivity and specificity and generated estimates that were incorrect by several hours. In addition to the problems with short periods of spurious data, long bouts of time were often misclassified. Reducing the amount of movement permitted within non-wear time (Algorithm 2) reduced the misclassification of sedentary time as non-wear without affecting the detection of non-wear time. By contrast, allowing no movement within non-wear periods (Algorithm 3) had a similar effect on misclassification of sedentary time, but at a cost of failing to detect ‘true’ non-wear time. One problem for all algorithms was true non-wear periods shorter than 60 min, which commonly occurred when the accelerometer was removed after 23:00. Overall, allowing very limited interruptions (ie, <50 cpm, no more than two per non-wear period) appeared optimal, although the benefit over no interruptions (ie, 0 cpm) was only slight.
It is possible that misclassification resulting from wear-time estimation algorithms is differential; however, our analysis of this issue was limited by the small sample. The association of misclassification with BMI suggests that the algorithms may perform better for normal-weight adults than for adults who are overweight or obese. One possible explanation is under-detection of movement for overweight or obese persons on whom accelerometers tend to sit at the wrong angle.16
Algorithm 1 was developed for surveillance purposes to estimate population means, primarily MVPA, in large-scale population studies, such as NHANES.4 The use of this versus Algorithm 2, which has also been used in analysing the NHANES data,17 affected the estimates of total wear time, and thus the number of valid days and participants included in analysis. This did not translate into an impact on MVPA time, and the impact on population estimates of sedentary time was modest. However, agreement in sedentary time estimates across algorithms was poor, even when wear time was supposedly ‘controlled’ either via the residuals method14 or by conversion to percentages. Overall, the implications for research studies are that algorithm choice may be of little importance when obtaining descriptions of population levels, but studies aiming to examine factors associated with sedentary time or to detect within-person change may be affected by misclassification.
This study adds to the relevant research literature7 8 by using a referent assessment method in free-living participants. Our findings complement those of a recent laboratory study, which established that several automated estimations with long minimum durations (60 or 90 min) perform adequately, particularly when allowing interruptions.17 There is no gold-standard criterion for free-living populations and although our criterion was not ideal, it was unlikely to favour any particular algorithm and was likely adequate for comparison purposes. The sample was not fully population representative,9 thus generalisability is not certain. Other populations including children, adolescents, young adults and people from various racial and ethnic backgrounds should also be examined. Given the rapid advances in accelerometer technology (which includes the collection of large amounts of raw data), appropriate algorithms also need to be validated for different epoch lengths and also for accelerometers using dual-axis or triaxial modes. This study used three ‘cut-points’ for allowable interruptions; further exploration may reveal better algorithms.18
Conclusions and recommendations
Automated accelerometer wear-time estimation has acceptable validity for adults for many purposes, with the better results achieved by allowing non-wear periods to contain very limited movement (Algorithm 2) rather than extensive interruptions (Algorithm 1). However, further achievable improvements are needed, particularly when accurate sedentary time measures are necessary, such as identifying and removing spurious data8 and reducing the failure to detect short non-wear periods (<60 min) by allowing non-wear bouts to continue past midnight.2 Estimation algorithms are a time-efficient and feasible option for large-scale population monitoring, but associated measurement error in sedentary time is substantial and needs consideration.
What is already known on this topic
▶ Wear-time estimation algorithms are resource-efficient and useful tools for large-scale accelerometer studies.
▶ Different wear-time estimation algorithms yield markedly varied results for wear time and some physical activity measures, for example, average accelerometer counts.
▶ There is no evidence on the accuracy of wear-time estimation algorithms against a criterion in free-living participants.
What this study adds
▶ This study provides evidence on the accuracy of three wear-time estimation algorithms against self-report in free-living participants. The algorithms do not agree well with self-report: all had good accuracy on average, but can be incorrect by several hours on some observed days.
▶ This study shows that using different wear-time estimation algorithms affect estimates of sedentary time, even when sedentary time is corrected for wear time, but have minimal effect on estimates of time spent in moderate-to-vigorous physical activity.
Data from the AusDiab study were used (for full acknowledgments of the many funding sources, see ref 12). Data used in this study (NHANES) were collected by the National Center for Health Statistics, Centers for Disease Control and Prevention.
Funding BKC, PAG, GNH, EAHW and NO are supported by a Queensland Health Core Research Infrastructure grant and by NHMRC Program Grant funding (#569940). PAG is supported by a Heart Foundation of Australia (# PP 06B 2889). BKC is supported by an Australian Post-graduate Award. GNH is also supported by a NHMRC (#569861)/National Heart Foundation of Australia (PH 374 08B 3905) Postdoctoral Fellowship.
Competing interests None.
Ethics approval The study uses secondary data from two studies, each of which obtained ethical approval from relevant parties. AusDiab substudy – The University of Queensland, Ethics Committee of the International Diabetes Institute. NHANES (publically available data) had ethical approval from NHANES Institutional Review Board/NCHS Research Ethics Review Board.
Provenance and peer review Not commissioned; externally peer reviewed.
If you wish to reuse any or all of this article please use the link below which will take you to the Copyright Clearance Center’s RightsLink service. You will be able to get a quick price and instant permission to reuse the content in many different ways.