Statistics from Altmetric.com
In this issue of BJSM, September et al1 report DNA variants within the COL5A1 gene among patients with Achilles tendinopathy (the cases) and among controls with no tendinopathy, matched for age and country of origin. Their findings suggest that a common DNA variation in the COL5A1 gene may be a risk factor for Achilles tendinopathy. Replication of their results among larger cohorts will be necessary to validate this finding.
It can be a challenge for the busy sports medicine practitioner to distil the clinical relevance of association studies such as this one. Although genetic testing for mendelian (single-gene) disorders is widely available in many countries, genetic testing for single-nucleotide polymorphisms (SNPs) is generally unavailable outside research laboratories. With certain notable exceptions (eg, the link between apoE variants, cardiovascular and neurological disease),2-4 statistical associations between common SNPs and complex diseases have not been borne out by further study. Numerous pitfalls occur in both the design and interpretation of these studies, which has resulted in a poor track record for independent replication. Thus, a brief summary of these pitfalls may be useful to the BJSM’s readership.
By way of background, there are just over 14 million SNPs known to exist in the human genome.5 Most of these exist in two possible forms, reflecting variation in the sequence that arose as a new mutation many generations ago. At some loci, any of three or even all four DNA bases (adenine, guanine, cytosine and thymine; A,G, C and T) may occur at measurable frequency in a population. Once a variant’s frequency reaches 1% of all alleles in the population, such a variant is no longer considered a “mutation,” but rather a “polymorphism.” Recent advances in the rapidity and cost-effectiveness of DNA sequencing technologies have enabled the assessment of such variants as risk factors for a broad range of medical conditions. Association studies have been used for many years to search for genes that predispose to aetiologically complex diseases such as obesity, type 2 diabetes, heart disease and adult-onset dementia. Case–control studies that examine SNPs at “candidate” genes are often used, in which a gene is selected as a candidate based on the plausibility of its involvement in a molecular pathway relevant to the disease under study.
One often-overlooked pitfall of such an approach is that of the a priori equivalence of the two SNP variants. For most SNPs, no obvious functional effect will be discernible from visual inspection of the DNA sequence. If the SNP occurs in a protein-coding sequence and terminates the protein prematurely, one can have reasonable confidence that it has a true functional effect. Unfortunately, our ability to use theoretical algorithms to predict the biological effect of a particular SNP is relatively poor, apart from this rare situation. Most SNPs occur outside of coding areas, and most (even coding SNPs) will not change the amount or activity of any gene product. From a prior hypothesis standpoint, then, either SNP allele is equally likely to confer disease risk, but on a biological basis only one variant allele can confer increased risk, and if the risk factor were any other type of risk factor (smoking, viral exposure, etc.), this variant would have to be defined as “exposed” before the study were carried out.
Consider the example of a C or T polymorphism at a specific locus. A human subject can then have SNP genotypes CC, CT or TT. With no obvious reason to prefer the C or T allele as the “at-risk” allele, a statistical association with the C or with the T allele remains equally plausible. This is somewhat akin to a situation in which “exposure to virus” is equally believable as a risk factor for heart disease as “lack of exposure to virus.” Epidemiological studies concluding that “lack of exposure to virus” conferred risk for heart disease (equivalent to saying that exposure to virus protected against heart disease) would be rejected as flawed, unless exemplary in methodology and highly significant. Without a way to assess which SNP allele is likely to interfere with gene function until after the study is completed, roughly twice as many studies will achieve significance (at whatever p value is agreed upon) as “should” achieve it. This phenomenon of a priori equivalence acts to increase the false discovery rate if epidemiological data alone are considered. In this scenario, supportive experiments using in vitro systems such as transfected cells become very useful as an additional, independent line of evidence. In their article, September et al have identified two putative micro-RNA recognition sequences in the 3′ untranslated region of the COL5A1 gene. They suggest that suboptimum microRNA binding at these sites may promote increased translation of COL5A1 mRNA into protein, thereby interfering with healing. Laboratory studies that examined the binding of microRNAs to transcripts bearing these recognition sequences have yet to be performed, and would be useful to support the assertion that the identified SNPs truly affect COL5A1 biology in a meaningful way (fig 1).
Another pitfall is the fact that there are numerous potential sources of bias that may affect the SNP allele frequencies observed in the study population. Perhaps the most insidious of these is population substructure (also known as population stratification). Most human populations are genetically more heterogeneous than they appear at first glance. Thus, matching of cases and controls based on ethnic origin is a robust technique only in the context of populations that have minimal admixture (interbreeding) with other populations. Briefly, if two founding populations interbreed to create a third population, the genetic architecture of the new population will reflect the relative contributions of the different SNP allele frequencies that were present in each of the founding populations (fig 2). If the two source populations differ in the prevalence of the medical condition under study (such as Achilles tendinopathy), a statistical association may arise between the condition and one of the SNP alleles. This association would arise from the allele frequencies in the source populations multiplied by the proportion of each that contributed to the admixed population. Given that the genome has many possible variants, the resulting statistical association is unlikely to reflect biological causality, even if the SNP-bearing candidate gene has a high biological plausibility. Among admixed populations such as South African Caucasians (known from historical records to include genetic contributions from Dutch Afrikaner, British and other European populations) and Australian Caucasians (known from historical records to include significant genetic contributions from the British Isles and mainland Europe) spurious allelic associations can easily arise. This erodes the power to detect a significant effect of a SNP on a disease without bringing in additional evidence such as functional data. Debate exists regarding the actual magnitude of the bias introduced by population stratification, and whether this source of bias is significant when care is taken to match cases with controls from the same ancestral population (as was done by September et al). The topic of population stratification is usefully reviewed in an article by Cardon and Palmer, which also discusses genetic methods of detecting and quantifying stratification.6
When considering the problem of population stratification, it is worthwhile noting the significance of Hardy–Weinberg equilibrium (HWE) and of departures from it. Its presence is traditionally taken as evidence of random mating within a population over a period of time that is sufficient for the allele frequencies to equilibrate. If HWE is present, the proportion of homozygotes and heterozygotes for the alleles in question closely matches the proportion that would be predicted solely on the basis of the individual allele frequencies (ie, there is neither a relative excess nor a deficiency of a particular genotype). If HWE is not present, there may be natural selection favouring a particular genotype (such that shown for heterozygous carriers of sickle-cell anaemia, thalassaemia and other haemoglobinopathies), but the most likely explanation for a lack of HWE is population substructure. Other possibilities include assortative mating (the tendency of individuals of like genotype to mate with each other) and errors in classification.7 In case–control studies such as that of September et al, artificial selection (the very process of ascertaining cases and controls from a sports medicine clinic) may result in some departure from HWE. The presence of HWE among the controls and its absence among the cases is reassuring, but is not in itself definitive proof that the statistical association of disease with SNP truly reflects a causative biological association. Such proof of causation must await replication studies among much larger populations.
Consensus guidelines for such replication studies have been published.8 Examples of high-quality studies include those recently published associating SNPs near the FTO and MC4R genes with obesity phenotypes.9 10 Even then, the highest degree of confidence will be achieved only when long-term prospective follow-up studies are completed alongside other in vitro work that elucidates the biological mechanism by which the SNP (or a closely linked DNA variant) influences cellular processes.
Because large studies require significant scientific and financial resources, it is worth considering whether other lines of genetic evidence might strengthen confidence in results obtained from case–control studies. Data from rare patients who present “extreme phenotypes” (including rare monogenic diseases) are useful in this regard. Precisely what is considered to be an “extreme phenotype” will vary from one disorder to the next. Broadly speaking, features such as earlier onset and the presence of multiple affected family members often indicate a higher “genetic load.” In the case of overuse tendinopathy, it might be possible to determine the prevalence of an at-risk allele among patients with early-onset, bilateral, severe and/or recurrent tendinopathy, as well as among those with multiple other tendinous sites affected. If there exists a gender difference in the prevalence of the disease under study, it might also be expected that the at-risk allele would be overrepresented among patients of the gender affected less often. In other words, if the disease is found more rarely among women than among men, women who do present it are likely to harbour additional risk factors that offset the protection ordinarily offered by their gender. Cases with a family history of tendinopathy could be particularly valuable in this regard, because when additional family members (both affected and unaffected) can be collected, it is possible to apply tests such as the transmission disequilibrium test (TDT),11 which tests the hypothesis that the at-risk allele will be transmitted more often to offspring who present the disease than to those do not.
Knowledge of rare mendelian disorders can also be beneficial. If a gene’s activity is critical for normal functioning, a complete loss-of-function mutation will almost invariably have a much more noticeable effect than that of a common variant such as an SNP. Were this not the case, the SNP itself would have been identified years ago and classified as a relatively common genetic disorder in its own right (as is arguably the case for ApoE). In the situation of COL5A1, loss-of-function mutations cause Ehlers–Danlos syndrome (EDS), which includes recurrent tendinopathy as a feature. Because EDS is rare, full loss-of function mutations in COL5A1 are unlikely to account for a significant fraction of Achilles tendinopathy in the general population.12 When reading SNP association studies, consideration should be given to whether mutations in the candidate gene cause any known mendelian disorders, and whether the occurrences of these disorders include the common disease under study. If patients with a well-recognised mendelian disorder do not show an increased prevalence of a common disease, it is unlikely that SNPs in or around that gene will confer a high population-attributable risk. Association studies that claim otherwise should be carefully scrutinised for potential sources of bias.
Using a similar rationale, data from transgenic mouse models can be incorporated into the selection of candidate genes for SNP association studies, and in the evaluation of their results. For example, mice deficient in the myostatin gene appear to have increased tendon fragility.13 This suggests that the study of SNPs in or near the MSTN gene for association with Achilles tendinopathy may yield interesting results. The same is true for growth differentiation factor-5 deficiency.14 15 Conversely, if transgenic mice missing a candidate gene do not have a particular disease, the likelihood that SNPs in or near that gene contribute to the disease in human populations is significantly reduced.
The hypothesis that genetic risk factors contribute to susceptibility to overuse injuries is highly plausible. Additional work needs to be carried out in order to identify specifically which genes and which SNPs (or other variants such as insertion/deletion polymorphisms) confer the greatest proportion of the genetic risk for these disorders. Ultimately, the clinical use of DNA-based testing to refine outcome predictions and/or modify rehabilitation regimens will have to wait until larger case-control studies and long-term follow-up studies have been performed.
Competing interests: none.
If you wish to reuse any or all of this article please use the link below which will take you to the Copyright Clearance Center’s RightsLink service. You will be able to get a quick price and instant permission to reuse the content in many different ways.