Skip to main content
Log in

PCA and Clustering Reveal Alternate mtDNA Phylogeny of N and M Clades

  • Published:
Journal of Molecular Evolution Aims and scope Submit manuscript

Abstract

Phylogenetic trees based on mtDNA polymorphisms are often used to infer the history of recent human migrations. However, there is no consensus on which method to use. Most methods make strong assumptions which may bias the choice of polymorphisms and result in computational complexity which limits the analysis to a few samples/polymorphisms. For example, parsimony minimizes the number of mutations, which biases the results to minimizing homoplasy events. Such biases may miss the global structure of the polymorphisms altogether, with the risk of identifying a “common” polymorphism as ancient without an internal check on whether it either is homoplasic or is identified as ancient because of sampling bias (from oversampling the population with the polymorphism). A signature of this problem is that different methods applied to the same data or the same method applied to different datasets results in different tree topologies. When the results of such analyses are combined, the consensus trees have a low internal branch consensus. We determine human mtDNA phylogeny from 1737 complete sequences using a new, direct method based on principal component analysis (PCA) and unsupervised consensus ensemble clustering. PCA identifies polymorphisms representing robust variations in the data and consensus ensemble clustering creates stable haplogroup clusters. The tree is obtained from the bifurcating network obtained when the data are split into k = 2,3,4,…,k max clusters, with equal sampling from each haplogroup. Our method assumes only that the data can be clustered into groups based on mutations, is fast, is stable to sample perturbation, uses all significant polymorphisms in the data, works for arbitrary sample sizes, and avoids sample choice and haplogroup size bias. The internal branches of our tree have a 90% consensus accuracy. In conclusion, our tree recreates the standard phylogeny of the N, M, L0/L1, L2, and L3 clades, confirming the African origin of modern humans and showing that the M and N clades arose in almost coincident migrations. However, the N clade haplogroups split along an East-West geographic divide, with a “European R clade” containing the haplogroups H, V, H/V, J, T, and U and a “Eurasian N subclade” including haplogroups B, R5, F, A, N9, I, W, and X. The haplogroup pairs (N9a, N9b) and (M7a, M7b) within N and M are placed in nonnearest locations in agreement with their expected large TMRCA from studies of their migrations into Japan. For comparison, we also construct consensus maximum likelihood, parsimony, neighbor joining, and UPGMA-based trees using the same polymorphisms and show that these methods give consistent results only for the clade tree. For recent branches, the consensus accuracy for these methods is in the range of 1–20%. From a comparison of our haplogroups to two chimp and one bonobo sequences, and assuming a chimp-human coalescent time of 5 million years before present, we find a human mtDNA TMRCA of 206,000 ± 14,000 years before present.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6

Similar content being viewed by others

References

  • Bandelt H-J, Richards M, Macaulay V (2006) Human mitochondrial DNA and the evolution of Homo sapiens (Nucleic acids and molecular biology), 1st edn. Springer, New York

    Google Scholar 

  • Cann RL, Stoneking M, Wilson AC (1987) Mitochondrial DNA and human evolution. Nature 325(6099):31–36

    Article  PubMed  CAS  Google Scholar 

  • Cerny V (1985) A thermodynamical approach to the travelling salesman problem: an efficient simulation algorithm. J Optim Theory Appl 45:41–51

    Article  Google Scholar 

  • Densmore LD 3rd (2001) Phylogenetic inference and parsimony analysis. Methods Mol Biol 176:23–36

    PubMed  CAS  Google Scholar 

  • Drummond A, Rodrigo AG (2000) Reconstructing genealogies of serial samples under the assumption of a molecular clock using serial-sample UPGMA. Mol Biol Evol 17(12):1807–1815

    PubMed  CAS  Google Scholar 

  • Felsenstein J (1996) Inferrring phylogenies. Sinauer Associates, Sunderland, MA

    Google Scholar 

  • Harpending H, Eswaran V, Macaulay V et al (2005) Tracing modern human origins. Science 309(5743):1995b–1997

    Article  Google Scholar 

  • Hasegawa M, Kishino H, Saitou N (1991) On the maximum likelihood method in molecular phylogenetics. J Mol Evol 32(5):443–445

    Article  PubMed  CAS  Google Scholar 

  • Ingman M, Kaessmann H, Paabo S, Gyllensten U (2000) Mitochondrial genome variation and the origin of modern humans. Nature 408(6813):708–713

    Article  PubMed  CAS  Google Scholar 

  • Jin G, Nakhleh L, Snir S, Tuller T (2006) Maximum likelihood of phylogenetic networks. Bioinformatics 22(21):2604–2611

    Article  PubMed  CAS  Google Scholar 

  • Jobling MA, Hurles ME, Tyler-Smith C (2004) Human evolutionary genetics: origins, peoples, and disease. Garland Science, New York

    Google Scholar 

  • Jolliffe IT (2002) Principal component analysis, 2nd edn. Springer, New York

    Google Scholar 

  • Kaufmann L, Rousserw PJ (1990) Finding groups in data: an introduction to cluster analysis, 1st edn. John Wiley & Sons, New York

    Google Scholar 

  • Kirkpatrick S, Gelatt C, Vecchi M (1983) Optimization by simulated annealing. Science 220(4598):671–680

    Article  PubMed  Google Scholar 

  • Kong QP, Bandelt HJ, Sun C et al (2006) Updating the East Asian mtDNA phylogeny: a prerequisite for the identification of pathogenic mutations. Hum Mol Genet 15(13):2076–2086

    Article  PubMed  CAS  Google Scholar 

  • Kumar S, Gadagkar SR (2000) Efficiency of the neighbor-joining method in reconstructing deep and shallow evolutionary relationships in large phylogenies. J Mol Evol 51(6):544–553

    PubMed  CAS  Google Scholar 

  • Minh BQ, Vinh le S, von Haeseler A, Schmidt HA (2005) pIQPNNI: parallel reconstruction of large maximum likelihood phylogenies. Bioinformatics 21(19):3794–3796

    Article  PubMed  CAS  Google Scholar 

  • Monti S, Tamayo P, Mesirov PJ, Golub T (2003) Consensus clustering: a resampling-based method for class discovery and visualization of gene expression microarray data. Machine Learn J 52(1–2):91–118

    Article  Google Scholar 

  • Myers E, Miller W (1998) Optimal alignments in linear space. CABIOS 4(1):11–17

    Google Scholar 

  • Ota S, Li WH (2000) NJML: a hybrid algorithm for the neighbor-joining and maximum-likelihood methods. Mol Biol Evol 17(9):1401–1409

    PubMed  CAS  Google Scholar 

  • Parsons BL, Heflich RH (1998) Detection of basepair substitution mutation at a frequency of 1 × 10(−7) by combining two genotypic selection methods, MutEx enrichment and allele-specific competitive blocker PCR. Environ Mol Mutagen 32(3):200–211

    Article  PubMed  CAS  Google Scholar 

  • Pearson WR, Robins G, Zhang T (1999) Generalized neighbor-joining: more reliable phylogenetic tree reconstruction. Mol Biol Evol 16(6):806–816

    PubMed  CAS  Google Scholar 

  • Saitou N (1990) Maximum likelihood methods. Methods Enzymol 183:584–598

    Article  PubMed  CAS  Google Scholar 

  • Saitou N, Nei M (1987) The neighbor-joining method: a new method for reconstructing phylogenetic trees. Mol Biol Evol 4(4):406–425

    PubMed  CAS  Google Scholar 

  • Sanderson MJ (1994) Reconstructing the history of evolutionary processes using maximum likelihood. Soc Gen Physiol Ser 49:13–26

    PubMed  CAS  Google Scholar 

  • Shinoda K-I (2005) Ancient DNA analysis of skeletal samples recovered from the Kuma-Nishioda Yayoi site. Bull Natl Sci Mus Ser D (Anthropol) 30:1–8

    Google Scholar 

  • Stewart CB (1993) The powers and pitfalls of parsimony. Nature 361(6413):603–607

    Article  PubMed  CAS  Google Scholar 

  • Strehl A, Ghosh J (2002) Cluster ensembles: a knowledge reuse framework for combining partitionings. In: Eighteenth National Conference on Artificial Intelligence, July 28–August 01, 2002, Edmonton, Alberta, Canada, pp 93–98

  • Stringer C (2001) Modern human origins—distinguishing the models. J Afr Archaeol Rev 18(2):67–75

    Article  Google Scholar 

  • Studier JA, Keppler KJ (1988) A note on the neighbor-joining algorithm of Saitou and Nei. Mol Biol Evol 5(6):729–731

    PubMed  CAS  Google Scholar 

  • Sullivan J (2005) Maximum-likelihood methods for phylogeny estimation. Methods Enzymol 395:757–779

    Article  PubMed  CAS  Google Scholar 

  • Tamura K, Nei M, Kumar S (2004) Prospects for inferring very large phylogenies by using the neighbor-joining method. Proc Natl Acad Sci USA 101(30):11030–11035

    Article  PubMed  CAS  Google Scholar 

  • Tanaka M, Ozawa T (1994) Strand and symmetry in human mitochondria. Genomics 22:327–335

    Article  PubMed  CAS  Google Scholar 

  • Tanaka M, Cabrera VM, Gonzalez AM et al (2004) Mitochondrial genome variation in eastern Asia and the peopling of Japan. Genome Res 14(10A):1832–1850

    Article  PubMed  CAS  Google Scholar 

  • Tibshirani R, Walther G, Hastie T (2001) Estimating the number of clusters in a dataset via the gap statistic. J Roy Stat Soc Ser B 63:411–423

    Article  Google Scholar 

  • Yang Z (1996) Phylogenetic analysis using parsimony and likelihood methods. J Mol Evol 42(2):294–307

    Article  PubMed  CAS  Google Scholar 

  • Yang Z (1997) PAML: a program package for phylogenetic analysis by maximum likelihood. Comput Appl Biosci 13(5):555–556

    PubMed  CAS  Google Scholar 

Download references

Acknowledgments

G.B. and M.T. thank Dr. K. Shinoda for insight into the Eastern migrations of the N and M haplogroups and Dr. Cabrera for a critical reading of an early version of the manuscript. G.A. and G.B. acknowledge discussions with many colleagues at IBM Research, where this study was initiated in 2005, and at the Aspen Center for Physics in 2007, where it was concluded.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to G. Bhanot.

Additional information

G. Alexe and R. Vijaya Satya—Joint first authors.

Electronic Supplementary Material

Appendices

Appendix A: Method for Finding Robust Clusters Using Consensus Ensemble Clustering

Unsupervised clustering algorithms divide data into meaningful groups or clusters such that the intracluster similarity is maximized and the intercluster similarity is minimized [A1]. Clustering is an NP-hard problem. However, many heuristic methods exist and they can be categorized into hierarchical, partitioning, and grid-based methods. We apply all these methods in an unsupervised way to the data, i.e., without assuming a predefined label on the objects to be classified. Unsupervised clustering is known to produce unstable solutions which are sensitive to various data parameters and/or perturbations and to the clustering techniques used [A2]. A relatively recent solution which corrects for this instability is consensus ensemble clustering [A2, A3]: Given several methods of clustering data, find a combination of these methods which is of better quality.

This problem can be broken up into two parts: a method that generates a collection of clustering solutions and a consensus function that combines them to produce a single output clustering of the data. There is an implicit assumption in this that combining the results of several clustering techniques will give groupings that are more reliable and less biased to a particular technique. This has been demonstrated in supervised classification schemes where it was shown that multiple solutions may reduce the variance of the error and, at the same time, increase the robustness of the result [A4, A5]. The ensemble clustering technique was introduced in [2], and the effects of consensus clustering were described in several subsequent studies [A4, A6, A7].

In our study the challenge posed to the ensemble consensus clustering approach was to identify meaningful clusters which were stable and robust both to perturbations of the data and to the choice of clustering methods used. This goal was approached in two ways.

  1. 1.

    If the method was stochastic, we reduced the effect of the stochastic variation by applying the method repeatedly and taking an appropriate average.

  2. 2.

    To reduce the sensitivity to random variation in the data, we applied each clustering method to multiple sample datasets obtained by bootstrapping in both the mutations used in the clustering and in the samples.

Clustering High-Dimensional Data

To correct for the fact that many mutations are only on individual samples and merely add noise to the data and the fact that many mutations travel together, we cluster on subspaces of attributes rather than on the entire space. The subset of attributes (mtSNPs) on which data are clustered may have an important influence on the clustering solution. Since mtDNA data are high dimensional, we restricted the clustering procedures to those attributes which were determined to be “discriminatory” through an initial principal component analysis (PCA).

The details of our clustering method are as follows.

Step 1

For each = 2,…,50, we created k clusters on resampled and random projected datasets based on individual clustering methods. We generated 150 datasets as follows: 50 datasets were created by bootstrapping the samples, 50 datasets by projecting the data onto subsets of mtSNPS bootstrapped from the data, and 50 additional datasets by first projecting the data on a bootstrapped subset of mtSNPS and then bootstrapping samples on the resulting dataset. We then applied representative methods for each major class of known clustering techniques. We discuss these briefly below.

Partitioning Relocation Methods

These methods divide data into several subsets and use certain greedy heuristics in the form of iterative optimization to reassign points between the k clusters. The optimization is applied to an objective function defined on unique cluster representatives (e.g., centroid, medoid), which is usually a dissimilarity measure.

We applied the following algorithms.

  1. i.

    Partition around medoids (PAM) [A8]: PAM is an iterative optimization that relocates the points between perspective clusters by renominating the points as potential medoids.

  2. ii.

    CLARA [A8]: This method uses several samples of the data and subjects each of them to PAM. The dataset is then reassigned to the resulting medoids and the best system of medoids is retained.

  3. iii.

    K-means [A9]: To each cluster, this method associates the mean (centroid) of its points and uses as the objective function the sum of distances between a point and its centroid.

  4. iv.

    Graph partitioning [A10]: In this method the points (samples) are associated with vertices in a graph and each point is connected to the closest neighbor. The resulting graph is then split into k-clusters by applying a min-cut approach.

Clusters produced by centroid methods (k-means, PAM, CLARA) work by identifying samples into clusters if they form a spheroid shape. Thus, they are suitable for clustering datasets with uniform and relatively low variation among samples. Graph partitioning methods produce clusters in which samples are added in if they are “close” to at least one sample in the candidate cluster. Thus graph partitioning approaches can successfully identify clusters with unequal variance along the feature coordinates (i.e., they can find a “long” shape).

Agglomerative Methods

These methods build the clusters gradually by trying to establish a hierarchical order [A1]. One starts by assigning each sample to its own cluster and then recursively merging two or more most similar clusters until a stopping criterion is fulfilled. The similarity between clusters is usually computed based on a linkage metric which reflects the connectivity and similarity between the clusters. In our study we applied hierarchical clustering techniques based on the following metrics.

  • Average linkage metric: Computes the distance between two clusters as the average of the distances between the pairs of points in these clusters.

  • Complete linkage metric: Computes the distance between two clusters as the maximum distance between the pairs of points in the two clusters.

  • Single linkage metric: Computes the distance between two clusters as the minimum distance between the pairs of points in the two clusters.

  • McQuitty metric: Computes the distance between two clusters as the average distance between the subclusters of the two clusters.

  • Centroid metric: Computes the distance between two clusters as the distance between the centroids of the two clusters.

  • Ward metric: Computes the distance between two clusters as the distance between the centroids of the two clusters averaged to the reciprocal mean of the sizes of the two clusters.

In addition, we applied a hybrid-biased agglomerative method (bagglo) which combines partitioning clustering with the agglomerative hierarchical approach [A11]. For n samples, we start with an initial partition into n 1/2 clusters and augment the original feature space by adding n 1/2 dimensions corresponding to the initial clusters. The agglomerative clustering approach is then applied to this augmented dataset.

Methods Based on Probability

In these methods, data are considered to be a sample independently drawn from a mixture model of several probability distributions and the clusters are associated with the area around the mean of each distribution. It is assumed that each point is assigned to a unique cluster. The probabilistic clustering method optimizes the log-likelihood of the data to be drawn from a given mixture model.

In our approach we applied the expectation maximization (EM) method [A12, A13]. EM is a two-step procedure which starts with estimating for each point the probability of belonging to a certain cluster. In the second step EM finds an approximation to the mixture model by maximizing the log-likelihood in an iterative way until the convergence to an optimal solution is reached.

Entropy-Based-Clustering (ENCLUST)

This method [A14] starts by dividing the interval associated with each attribute into one-dimensional bins (cells) and retaining only the cells with a high density. The iterative step consists in creating cells of higher dimensions by joining the cells with low dimension and retaining only those cells which have the entropy below a certain threshold as optimal for clustering.

Clustering on Subsets of Attributes (COSA)

This method [A15] uses an idea similar to that in ENCLUST by preferentially clustering on subsets of attributes with low variability across the samples in the clusters.

Self-Organizing Maps (SOM)

This method [A16] is both a data visualization and a clustering technique which reduces the dimensions of data through the use of self-organizing neural networks. The way SOM reduces dimensions is by producing a map of usually one or two dimensions which plot the similarities of the data by grouping similar data points together.

Step 2

Each method was applied 50 times with different parameter initialization on the full dataset and once on each of the 150 datasets obtained as described in Step 1. Based on the 200 clustering results, we constructed an agreement matrix for each method whose entries m ij represent the fraction of times the pair of samples (i, j) occurred in the same cluster of the total number of times the pair of samples was selected in the 200 datasets.

Using d ij  = 1 − m ij as the distance between the samples (i, j), we apply simulated annealing [A17] to find k “consensus” clusters which achieve the maximum value for the average internal similarity and the average external dissimilarity. The cost function used for the simulated annealing is given below.

$$ \frac{{\sum\limits_{l} {\sqrt {\sum\limits_{{i,j \in C_{l} }} {d_{ij} } } } }}{{\sum\limits_{l} {n_{l} \tfrac{{\sum\limits_{{i \in C_{l} ,j \in S}} {d_{ij} } }}{{\sqrt {\sum\limits_{{i,j \in C_{l} }} {d_{ij} } } }}} }} $$
(1)

where d ij represents the distance between the samples (i,j), C 1,…,C k are the k clusters to be determined, n l is the size of cluster C l , l = 1,.., k, and S is the set of samples.

At the end of this step, each method will give us its best clustering into k clusters.

Step 3

For each k, we combine the results from Step 2 by using an agreement matrix to create a consensus of all the individual clustering techniques and, once again, use simulated annealing to optimize the clustering.

In comparison with the traditional methods which use a single clustering technique, the consensus ensemble clustering approach, in combination with PCA, has better average performance across datasets and a lower sensitivity to noise, outliers, and sampling variation.

References A1

  1. A1.

    Rousseeuw PJ (1987) Silhouettes: a graphical aid to the interpretation and validation of cluster analysis. J Comput Appl Math 20:53–65

  2. A2.

    Strehl A, Ghosh J (2002) Cluster ensembles: a knowledge reuse framework for combining partitionings. In: Eighteenth National Conference on Artificial Intelligence. Edmonton, Alberta, Canada, July 28–August 1, 2002, pp 93–98

  3. A3.

    Monti S, Tamayo P, Mesirov PJ, Golub T (2003) Consensus clustering: a resampling-based method for class discovery and visualization of gene expression microarray data. Machine Learn J 52(1–2):91–118

  4. A4.

    Alexe G, Alexe S, Crama Y, Foldes S, Hammer PL, Simeone B (2004) Consensus algorithms for the generation of all maximal bicliques. Discrete Appl Math 145(1):11–21

  5. A5.

    Prodromidis AL, Stolfo SJ (1999) A comparative evaluation of meta-learning strategies over large and distributed data sets. Workshop on Meta-learning. Sixteenth International Conference on Machine Learning

  6. A6.

    Topchy A, Jain AK, Punch W (2005) Clustering ensembles: models of consensus and weak partitions. IEEE Trans Pattern Anal Machine Intel 27:1866–881

  7. A7.

    Topchy A, Minaei-Bidgoli B, Jain AK, Punch WF (2004) Adaptive clustering ensembles. In: 17th International Conference on Pattern Recognition (ICPR’04): 2004, pp 272–275

  8. A8.

    Kaufmann L, Rousserw PJ (1990) Finding groups in data: an introduction to cluster analysis, 1st edn. Wiley, New York

  9. A9.

    Hartigan JA (1975) Clustering algorithms. Wiley, New York

  10. A10.

    Karypis G, Kumar V (1995) Multilevel graph partitioning schemes. In: Proceedings, 24th International Conference on Parallel Processing. CRC Press, New York, pp 113–122

  11. A11.

    Rasmussen M, Karypis G (2004) gCLUTO: an interactive clustering, visualization, and analysis system. UMN-CS 2004. TR-04-021

  12. A12.

    Dempster AP, Laird NM, Rubin DB (1977) R: maximum likelihood from incomplete data via the EM algorithm. J Roy Stat Soc Ser B 39:1–38

  13. A13.

    Fraley B, Raftery AE (2002) Model-based clustering, discriminant analysis, and density estimation. J Am Stat Assoc 97:611–631

  14. A14.

    Cheng CH, Fu AW, Zhang Y (1999) Entropy-based subspace clustering for mining numerical data. In: Proceedings of ACM SIGKDD International Conference on Knowledge Discovery and Data Mining KDD-99: 1999. San Diego, CA

  15. A15.

    Friedman JH, Meulmany JJ (2004) Clustering objects on subsets of attributes. J Roy Stat Soc Ser B 66:1–25

  16. A16.

    Kohonen T (2001) Self-organizing maps, vol 30. Springer, New York

  17. A17.

    Kirkpatrick S, Gelatt C, Vecchi M (1983) Optimization by simulated annealing. Science 220(4598):671–680

Appendix B: Building Consensus Trees Using Maximum Parsimony, Maximum Likelihood, Neighbor Joining, and UPGMA

Preprocessing

There were a total of 1737 complete mtDNA sequences, classified into 61 haplogroups. We removed samples classified into the six bulk clusters: B*, BULK_D*, BULK_M*, BULK_N*, H*, and U*. The number of samples was reduced to 1222 and the number of polymorphisms was reduced to 2321.

Building the Consensus Clade Trees

We built one consensus tree for the five clades (L0/L1, L2, L3, M, and N) for each of the four methods: maximum parsimony (MP), maximum likelihood (ML), neighbor joining (NJ), and unweighted pair group method with arithmetic mean (UPGMA).

We created 100 data sets, each consisting of one sample picked randomly from each clade. The polymorphisms used were the same 34 mtSNPs identified by the cluster analysis on the clades (Supplementary Table STII).

Procedure

We used the software package Phylip [B.1] to construct the trees. Table 1 lists the software used in Phylip. For each of the four methods, one tree was created for each of the datasets. A consensus tree (see Appendix B Fig. 7a–d) was obtained for each method using the consensus program in Phylip by combining the 100 trees. The branches were labeled with the percentage of cases when the split appeared as shown in the figures. The algorithm automatically chooses the branch percentage which is maximum.

Fig. 7
figure 7

(a) The reliable (high bootstrap agreement) and unreliable (low bootstrap agreement) parts of the MP clade tree are estimated from the consensus fraction on branches across bootstrap replicates in this consensus tree. The tree was constructed using 100 bootstrap datasets, each with one sample from each clade using the 34 mtSNPs listed in Supplementary Table STII and building the consensus tree from them. The older branches have 100% consensus across bootstrap replicates and may be considered robust and reliable. However, the sequence in which the L3/M/N clades split is not determined to any reliable accuracy. In 67% of the bootstrap results, the MP algorithm reported this split as a trifurcation, (b) The reliable (high bootstrap agreement) and unreliable (low bootstrap agreement) parts of the ML clade tree are estimated from the consensus fraction on branches across bootstrap replicates in this consensus tree. Once again, the oldest branches are reliable, while the L3/M/N clade split is not. (c) The reliable (high bootstrap agreement) and unreliable (low bootstrap agreement) parts of the NJ clade tree are estimated from the consensus fraction on branches across bootstrap replicates in this consensus tree. Once again, the oldest branches are reliable, while the L3/M/N clade split is not. (d) The reliable (high bootstrap agreement) and unreliable (low bootstrap agreement) parts of the UPGMA clade tree are estimated from the consensus fraction on branches across bootstrap replicates in this consensus tree. Although the oldest branches are more accurate than the L3/M/N split, UPGMA is the only one of the four methods used here which reliably suggests that the N migration preceded the M migration. However, as the other methods do not resolve this split in a reliable way, we can only conclude that the L3/M/N should be shown as a trifurcation

Table 1 The four Phylip programs used in this analysis

Discussion

While the accuracy of the L0/L1 and L2 branches was always 100%, the percentage of times the split in the M/N/L3 clades appeared over all possibilities is reported in Table 2.

Table 2 Frequencies for the four ways of resolving the M/N/L3 trifurcation for each method: MP, ML, NJ, and UPGMA

Building Complete Consensus Trees

We built a consensus tree for all 55 haplogroups with each of the four phylogenetic tree building methods: MP, ML, NJ, and UPGMA.

We created 100 datasets by selecting one sample from each of the 55 haplogroups that remained after eliminating the six bulk clusters: B*, BULK_D*, BULK_M*, BULK_N*, H*, and U*. There were 869 polymorphisms left from the combined set of polymorphisms identified by the global PCA as well as the PCA for N, M, L. All these were used to build 100 trees for each of MP, ML, NJ, and UPGMA. From these, consensus trees were obtained for each of the methods. They are shown in Fig. 8a–d. The MP algorithm generated many trees with the same optimum weight for the same dataset. All trees with the same weight were first combined for each dataset before they were combined across datasets. We used the tree plotting software TreeGraph [B.2] to draw the trees.

Fig. 8
figure 8figure 8figure 8figure 8

(a) The consensus MP tree obtained using 869 mtSNPs from the union of all polymorphisms identified by PCA and clustering for the clades and M, N, L haplogroups. The tree shown is the consensus over trees from 100 datasets, each of which was created by selecting one sample randomly from each of the 55 haplogroups shown. The branch labels on the consensus MP tree are a measure of the reliability of the branch. This is estimated as the fraction of cases when the branch splits the downstream haplogroups into the sets shown in the tree over the sampled datasets. Ancient branches, corresponding to the clade tree in Fig. 2, are reliably reproduced, as are some recent branches. The middle branches have a lower reliability, with branch accuracies of <10% in many cases. This makes the overall reliability of the tree very low. (b) The consensus ML tree obtained using 869 mtSNPs from the union of all polymorphisms identified by PCA and clustering for the clades and M, N, L haplogroups. The tree shown is the consensus over trees from 100 datasets, each of which was created by selecting one sample randomly from each of the 55 haplogroups shown. The branch labels on the consensus ML tree are a measure of the reliability of the branch. This is estimated as the fraction of cases when the branch splits the downstream haplogroups into the sets shown in the tree over the sampled datasets. Ancient branches, corresponding to the clade tree in Fig. 2, are reliably reproduced, as are some recent branches. The middle branches have a lower reliability, with branch accuracies of <10% in many cases. This makes the overall reliability of the tree very low. (c) The consensus NJ tree obtained using 869 mtSNPs from the union of all polymorphisms identified by PCA and clustering for the clades and M, N, L haplogroups. The tree shown is the consensus over trees from 100 datasets, each of which was created by selecting one sample randomly from each of the 55 haplogroups shown. The branch labels on the consensus NJ tree are a measure of the reliability of the branch. This is estimated as the fraction of cases when the branch splits the downstream haplogroups into the sets shown in the tree over the sampled datasets. Ancient branches, corresponding to the clade tree in Fig. 2, are reliably reproduced, as are some recent branches. The middle branches have a lower reliability, with branch accuracies of <10% in many cases. This makes the overall reliability of the tree very low. (d) The consensus UPGMA tree obtained using 869 mtSNPs from the union of all polymorphisms identified by PCA and clustering for the clades and M, N, L haplogroups. The tree shown is the consensus over trees from 100 datasets, each of which was created by selecting one sample randomly from each of the 55 haplogroups shown. The branch labels on the consensus UPGMA tree are a measure of the reliability of the branch. This is estimated as the fraction of cases when the branch splits the downstream haplogroups into the sets shown in the tree over the sampled datasets. Ancient branches, corresponding to the clade tree in Fig. 2, are reliably reproduced, as are some recent branches. The middle branches have a lower reliability, with branch accuracies of <10% in many cases. This makes the overall reliability of the tree very low

References B

  1. B.1.

    Felsenstein J (2004) PHYLIP: Phylogeny Inference Package. ed 3.6. University of Washington, Seattle

  2. B.2.

    Müller J, Müller K (2004) TreeGraph: automated drawing of complex tree figures using an extensible tree description format. Mol Ecol Notes 4:786–788

Appendix C: Methods to Select the Root of the Network

In this paper, we have used two methods to root the networks identified by our procedure. The first method used the mtDNA sequence of an “outgroup” species (such as chimpanzee or bonobo). The root was identified as the internal node in the network, which minimizes the number of loci at which the mtDNA sequence at the internal node was different from the outgroup sequence for a robust subset of mtSNPs. The second defined the root as the internal node that was equidistant from the leaves across all possible trees, assuming that the number of polymorphisms on the tree was one instantiation of a Poisson process on the internal branches.

Method I: Rooting with Respect to an Outgroup

The first method is standard and depends only on the availability of an appropriate outgroup sequence. In our case, we used the consensus of the two chimpanzee and one bonobo sequences in Supplementary Table STI and limited the analysis to the 435 robust mtSNPs used in labeling the tree (this list of mtSNPs is given in Supplementary Table STIV). For illustration, we consider three internal nodes, R1, R2, and R3. In Fig. 2, R1 is the split between the L0/L1 superclade and rest of the tree, R2 is the node defining the split between the L2 superclade and the rest of the tree, and R3 is the node separating the L0/L1/L2 subtree from the rest of the tree. We find that the number of mtSNPs that are different between these internal nodes and the chimp-bonobo consensus sequence is: DR1 = 111, DR2 = 118, and DR3 = 121. This identifies R1 as the root of the tree.

Method II: Rooting Using Poisson Statistics for the Number of Mutations on Edges

The second method is novel and is based on the fundamental observation that the same amount of time has elapsed from the root to each leaf. Hence, if the mutation rate is assumed to be fixed (as it is here), then the number of mtSNPs from the root to all leaves must be (approximately) the same. We also impose the constraint that the root be the most probable choice to satisfy this criterion (equal distance from leaves) across all possible evolutionary scenarios on the internal branches, as described in greater detail below.

If all loci are equally likely to mutate and the mutation rate is low, the number of mtSNPs on internal branches is a Poisson variable with Poisson parameter proportional to the time corresponding to the branch. The number of actual mtSNPs on the branch is an unbiased estimator of this Poisson parameter. To find the root, we created a number of equivalent evolutionary networks by simulating the Poisson variables (number of mtSNPs on the internal branches), using the observed number of mtSNPs on the edges as the Poisson parameter. The distance D(R → Li) of a leaf Li from a node R is the sum of the number of mtSNPs labeling the edges in the path from R to Li. We simulated 1000 evolutionary scenarios, and for each scenario we computed the distances D(R → Li) for all the paths from each possible internal root to the leaves. For each scenario, we then computed the mean and standard deviation (SD) of D over these paths, and from these, the distribution of SDs over the 1000 evolutionary scenarios for each internal node. Figure 9 shows the distribution of SD for the three internal nodes R1, R2, and R3. If we make the reasonable assumption that the best root is the one with the lowest possible SD averaged over all possible evolutionary scenarios, then it is clear from Fig. 9 Fig. AIII.1 that, once again, R1 is the preferred root.

Fig. 9
figure 9

Standard deviation (SD) of the distances from different roots. Distribution of the SDs of the distances from three possible roots to the leaves over 1000 possible evolutionary scenarios, generated by considering the number of mutations on the internal edges as a Poisson variable with Poisson parameter equal to the observed number of mutations on the edge

Although both methods give the same result for the root of the mtDNA tree, we prefer the second method, because the averaging over equally probable evolutionary scenarios provides an additional measure of confidence in the identification.

Appendix D: Software to Reproduce Figs. 1 and 2 and Supplementary Table STII

The software is described below and is available for download at: https://biomaps.rutgers.edu/wiki/upload/9/93/MtDNA_utility.tar.gz.

The code uses Python (http://www.python.org) with C extensions. The README file gives specific instructions on how to install the libraries and compile the code. The code takes as input 1737 aligned mtDNA sequences on 3177 polymorphic loci and from them creates a binary “mutation matrix” B, whose rows represent samples and columns represent mtDNA loci. Each column element is assigned the value 1 if the nucleotide matches rCRS and the value 0 if it does not. The process begins by calling migration, which uses parsers.ParseSTI to read Table STI.txt. This parsing class creates a SampleData object containing each of the 1737 samples, their aligned sequences of 3177 polymorphic loci, and their haplogroups. This is then compared to the rCRS sequence pairwise to create the binary matrix B. Next, this matrix is centered so that each column has 0 mean using a call to pca.py. This class also performs a singular value decomposition of the matrix, and the resulting first two eigenvectors, which correspond to the highest eigenvalues, are used to generate a plot (saved to disk using display.py) of each sample in principal component space. This plot corresponds to Fig. 1 in the text. From this plot, using predefined cut-points on the coordinates, clade membership is assigned.

Once clade membership is assigned, frequency analysis of the mtSNPs in each clade is performed in migration.py. A mtSNP is selected if it appears in 95% of the samples in one clade and <5% of the samples in any other clade. This identifies the 34 mtSNPs given in Supplementary Table STII.

The call to migration.py also identifies the 410 mtSNPs which occur with high weights (top 25% by absolute value) as coefficients in 166 eigenvectors corresponding to the highest eigenvalues, which represent 85% of the variance in the data. These mtSNPs are listed by clade in Supplementary Table STIX. Migration.py then randomly samples 20 individuals from each of the five clades and reduces their data vectors to just these reliable mtSNPs. These 100 samples are then used as input for cluster.py, which repeatedly samples from within the set and uses k-means to divide them into k = 2, 3, 4, 5, and 6 clusters. To create a dataset of these 100 samples, cluster.py either randomly selects 80% of the samples, or randomly selects 80% of the mtSNPs for each sample, or randomly selects 80% of the samples and 80% of the mtSNPs for each sample, or leaves the samples and data vectors unmodified. This is repeated 300 times and the data for each k are combined into an agreement matrix M, whose entries M ij correspond to the fraction of times samples i and j were clustered together across the 300 samplings. (1 − M ij ) may be considered the “distance” between sample i and sample j. Using this definition of distance, we cluster the samples again using hierarchical clustering with average linkage to assign the final clusters for each k. This agreement matrix is then reordered using Simulated Annealing to maximize the similarity between adjacent samples within a cluster and maximize the dissimilarity between samples farther apart. The reordered matrix is recorded to disk using display.py as a heatmap, where bright spots represent samples which were heavily clustered together, and dark spots represent samples which were rarely clustered together.

At this point all five clades are split and the sequence of splits represents (unrooted) Fig. 2, which can then be rooted using the procedure described in Appendix C.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Alexe, G., Vijaya Satya, R., Seiler, M. et al. PCA and Clustering Reveal Alternate mtDNA Phylogeny of N and M Clades. J Mol Evol 67, 465–487 (2008). https://doi.org/10.1007/s00239-008-9148-7

Download citation

  • Received:

  • Revised:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s00239-008-9148-7

Keywords

Navigation