Article Text

Download PDFPDF

What statistical data of observational performance can tell us and what they cannot: the case of Dutee Chand v. AFI & IAAF
  1. Simon Franklin1,
  2. Jonathan Ospina Betancurt2,
  3. Silvia Camporesi3
  1. 1 Centre for Economic Performance, London School of Economics and Political Science, London, UK
  2. 2 Faculty of Health Sciences, Physical Activity and Sports Sciences, Isabel I University, Burgos, Spain
  3. 3 Department of Global Health & Social Medicine, King’s College London, London, UK
  1. Correspondence to Dr Simon Franklin, Centre for Economic Performance, London School of Economics and Political Science, London, UK; S.Franklin1{at}

Statistics from

Request Permissions

If you wish to reuse any or all of this article please use the link below which will take you to the Copyright Clearance Center’s RightsLink service. You will be able to get a quick price and instant permission to reuse the content in many different ways.

How can performance data resolve the arbitration of sensitive matters in the world of sports? In the absence of experimental data (ie, clinical trials), researchers must build an argument based on correlations in observational data. Such data are often not widely available. The Dutee Chand v. AFI & IAAF case is a case in point (box ).1


Background on the Dutee Chand v. AFI & IAAF case

IAAF Hyperandrogenism Regulations were in place from 1 May 2011 to 24 July 2015 when they were suspended by the Court of Arbitration for Sport (CAS). The regulations stated that female athletes who naturally produce levels of testosterone >10 nmol/L were not eligible to compete in the female category and need to take androgen-suppressive drugs to resume competition. Dutee Chand, an Indian sprinter, was asked to abide by these regulations in July 2014 and appealed to CAS on grounds that the regulations unfairly discriminated against women who naturally produced higher levels of testosterone. CAS was not satisfied with the evidence IAAF provided and hence suspended the regulations on 24 July 2015, but allowed the International Association of Athletics Federations (IAAF) up to 2 years (later extended) to submit additional evidence on the correlation between endogenous levels of testosterone and athletic performance. The regulations currently remain suspended until 19 July 2018.

Bermon and Garnier2 use correlations between free testosterone (fT) and athletic performance across 21 women’s events to claim that women with high fT have a performance advantage in a very specific subset of athletic events. On 29 September 2017, their data were filed as evidence by the International Association of Athletics Federations (IAAF) to the Court of Arbitration for Sport (CAS) in the Dutee Chand v. AFI & IAAF case. These data added to draft revised regulations (not available in the public domain) that would apply only to female track events over distances of between 400 m and 1 mile.

The CAS Panel has made no ruling (as of 28 January 2018) about the sufficiency of the evidence put forward by IAAF. As of a press release dated 19 January 2018, the regulations remain suspended for an additional 6 months at which time the IAAF is  to advise CAS on how it intends to implement the regulations moving forward.3

We argue that the evidence put forward by Bermon and Garnier2 is not sufficient to sustain even draft revised regulations applying only to specific events. Our reanalysis of the available data presented by Bermon and Garnier2 suggests, at the very least, that further analysis is required to establish the claims made in the paper.

The application of statistical techniques, and interpretation of results, in such studies is not neutral, nor standardised; correlations in observational data require careful interpretation by independent researchers with access to the original data. Can the data used in Bermon and Garnier2 tell us something about whether testosterone confers an advantage in particular events, individually, or just whether there is an overall correlation across all events?

We argue that, given the sample sizes used for each event, and the number of statistical tests conducted by the authors, any particular significant result in an event is more likely to have arisen by chance. Unfortunately, without publicly available raw data, it is not possible to perform all the desired robustness checks on the data. In lieu of access to such data, we performed a Fisher’s combination test using the P values calculated from the published data. After performing such a test, we were unable to reject the global null hypothesis that all null hypotheses are true, that is, the pattern of P values is not inconsistent with there being no advantage to high fT women, in any one of the events. In simpler terms: it is reasonably likely that the correlations presented in the paper (even the largest ones) occurred by chance.

Given the number of tests performed, the few significant findings detected could have arisen without there being a true correlation between testosterone and performance for female athletes. To avoid these chance findings (also known as false positives) , appropriate multiple hypothesis testing corrections ought to be applied. 

Our reanalysis of the data (see online supplementary web content for additional details) suggests that this correction would not yield a robust and significant correlation in any event. Given these findings, we believe that it is scientifically incorrect to draw the conclusions in the Bermon and Garnier2 paper from the statistical results presented. Their paper claims that certain athletes have an advantage in precisely the five events where a significant effect was found: we calculate that a high share of those five significant effects are likely to be false positives. The overall range of coefficients across all events is large: we estimate the average advantage for high testosterone women to be 0.7%, with a minimum of −2.6% and a maximum of 4.5% across the 21 events.Only in 12 (57%) of the events do higher fT athlete perform better on average. With access to the data, a more sensible test might be one single test of correlation conducted across all events.

Supplementary file 1

In light of our reanalysis, we conclude that

  1. Raw data used in such studies, which will have direct implications for real-world outcomes, should be made publicly available for other researchers to analyse.

  2. Interpretation of estimated correlations should also be conducted with great caution and be referred to independent statisticians. While we do not claim to play the role of such an independent statistical arbitrator in this case (especially since we have not had access to the raw data), our statistical analysis already allows one to conclude that the article by Bermon and Garnier2 does not meet the standard of proof set by the CAS, without further analysis. Independent analysis is necessary in this situation, and others like it.



  • Contributors SF performed the statistical reanalysis of Bermon and Garnier’s data. JOB provided original data from his doctoral dissertation. SC conceived the idea for the paper, coordinated the team work, was responsible for ensuring the coherence among the different parts and for the final draft of the manuscript. All authors provided feedback at all stages and approved the final version of the manuscript.

  • Competing interests None declared.

  • Provenance and peer review Not commissioned; externally peer reviewed.

  • Data sharing statement The doctoral dissertation by JOB (in Spanish) is available for download here ( with username and password to be requested to the author.

Linked Articles