Visualization of Pametric Models

Load the package scaleboot.

library(scaleboot)

The methods are explained in Shimodaira and Terada (2019).

Hidetoshi Shimodaira and Yoshikazu Terada. Selective Inference for Testing Trees and Edges in Phylogenetics. 2019.

Phylogenetic Analysis of 15 trees of 6 taxa

As a working example, we estimate the phylogenetic tree from the same dataset previously analyzed in Shimodaira and Hasegawa (1999), Shimodaira (2001, 2002) using the same model of evolution. The dataset consists of mitochondrial protein sequences of six mammalian species with \(n=3414\) amino acids The taxa are Homo sapiens (human), Phoca vitulina (seal), Bos taurus (cow), Oryctolagus cuniculus (rabbit), Mus musculus (mouse), and Didelphis virginiana (opossum). The software package PAML (Yang 1997) was used to calculate the site-wise log-likelihoods for the trees. The mtREV model (Adachi and Hasgawa 1996) was used for amino acid substitutions, and the site-heterogeneity was modeled by the discrete-gamma distribution (Yang 1996).

We first perform phylogentic tree selection. Look at the other RMarkdown document phylo.Rmd for the details. Here, we consider only 15 trees of 6 taxa, by fixing the clade (seal, cow).

We first run a phylogenetic package, such as PAML, to calculate site-wise log-likelihood for trees. The tree topology file is mam15.tpl, and the site-wise log-likelhiood file is mam15.mt. The mam15.mt file is converted from mam15.lnf (output from PAML) by seqmt program in CONSEL. We also run treeass in CONSEL to get mam15.ass and mam15.log from mam15.tpl. We use CONSEL only for preparing mt and ass files. All these files are found in mam15 folder.

Instead of using the program consel in CONSEL to compute p-values, we use scaleboot here. First, read the following two files. Then run relltest (internally calling scaleboot function) to perform multiscale bootstrap resampling.

### dont run
nb.rell = 100000
nb.pvclust = 10000
library(parallel)
length(cl <- makeCluster(detectCores()))
mam15.mt <- read.mt("mam15-files/mam15.mt")
mam15.ass <- read.ass("mam15-files/mam15.ass")
sa <- 9^seq(-1,1,length=13) # specify scales for multiscale bootstrap
mam15.relltest <- relltest(mam15.mt,nb=nb.rell,sa=sa,ass=mam15.ass,cluster=cl)

We have run the above command in makedata.R preveously. To get the results, simply do below, which will also load other objects.

data(mam15) # load mam15, mam26, mam105
ls() # look at the objects
##  [1] "mam105.ass"      "mam105.aux"      "mam105.mt"      
##  [4] "mam105.relltest" "mam15.ass"       "mam15.aux"      
##  [7] "mam15.mt"        "mam15.relltest"  "mam26.ass"      
## [10] "mam26.aux"       "mam26.mt"

We have auxiliary information in mam15.aux. The topologies are in the order of mam15.tpl (the same order as mam15.mt). The edges are in the order of mam15.cld (extracted from mam15.log, which is the log file of treeass).

names(mam15.aux)
## [1] "tpl" "cld" "tax"
mam15.aux$tpl[1:3] # topologies (the first three trees, in the order of mam15.tpl file)
##                                             t1 
## "((Homsa,(Phovi,Bosta)),Orycu,(Musmu,Didvi));" 
##                                             t2 
## "(Homsa,Orycu,((Phovi,Bosta),(Musmu,Didvi)));" 
##                                             t3 
## "(Homsa,((Phovi,Bosta),Orycu),(Musmu,Didvi));"
mam15.aux$cld[1:3] # edges  (the first three edges, in the order of  mam15.cld file)
##       e1       e2       e3 
## "+++---" "++++--" "+--+--"
mam15.aux$tax # taxa, the order corresponds to the positions of + and - in the clade pattern.
## [1] "Homsa" "Phovi" "Bosta" "Orycu" "Musmu" "Didvi"

The output of relltest includes the results of trees and edges. We separate them, and also reorder the trees and edges in decreasing order of likelhiood values below. We can also specify the auxiliary information in sbphylo.

mam15 <- sbphylo(mam15.relltest, mam15.ass, treename=mam15.aux$tpl,edgename=mam15.aux$cld,taxaname=mam15.aux$tax)

This includes the multiscale bootstrap probability. The order can be checked as follows. T1, T2, T3, … are sorted tree (in decreasing order of likelhiood). t1, t2, t3, … are the original order of trees. E1, E2, E3, … are sorted edges, and e1, e2, e3, … are the original order of edges.

mam15$order.tree  # sorted tree to original tree
##  T1  T2  T3  T4  T5  T6  T7  T8  T9 T10 T11 T12 T13 T14 T15 
##   1   3   2   5   6   7   4  15   8  14  13   9  11  10  12
mam15$invorder.tree  # original tree to sorted tree
##  t1  t2  t3  t4  t5  t6  t7  t8  t9 t10 t11 t12 t13 t14 t15 
##   1   3   2   7   4   5   6   9  12  14  13  15  11  10   8
mam15$order.edge # sorted edge to original edge
##  E1  E2  E3  E4  E5  E6  E7  E8  E9 E10 
##   2   1   4   3   5   7   6   9   8  10
mam15$invorder.edge # original edge to sorted edge
##  e1  e2  e3  e4  e5  e6  e7  e8  e9 e10 
##   2   1   4   3   5   7   6   9   8  10

The \(p\)-values are calculated by the summary method.

mam15.pv <- summary(mam15)
mam15.pv$tree$value[1:5,] # p-values of the best 5 trees
##        raw        k.1        k.2       sk.1      sk.2      beta0     beta1
## T1 0.57493 0.56098585 0.74549688 0.12197169 0.3636738 -0.4069275 0.2534584
## T2 0.31884 0.30366056 0.46386781 0.60732111 0.7945775  0.3022976 0.2116035
## T3 0.03667 0.03653161 0.13098002 0.07306322 0.2074724  1.4571030 0.3353325
## T4 0.01324 0.01352179 0.07834786 0.02704357 0.1197193  1.8135806 0.3973077
## T5 0.03212 0.03158660 0.12442520 0.06317320 0.1951006  1.5055815 0.3523969
##         stat  shtest
## T1 -2.663776 0.94307
## T2  2.663776 0.80232
## T3  7.397870 0.57806
## T4 17.565482 0.17444
## T5 18.933775 0.14571
mam15.pv$edge$value[1:5,] # p-values of the best 5 edges
##        raw        k.1        k.2       sk.1      sk.2      beta0
## E1 0.93044 0.93062184 0.95585769 0.86124367 0.9030685 -1.5924796
## E2 0.58817 0.58068944 0.71769918 0.16137889 0.3375573 -0.3898388
## E3 0.32507 0.31756417 0.43388481 0.63512834 0.7731364  0.3205067
## E4 0.03683 0.03628412 0.12619454 0.07256825 0.2010770  1.4700512
## E5 0.06074 0.05905474 0.07380517 0.11810948 0.1411530  1.5053911
##         beta1
## E1 0.11204127
## E2 0.18618125
## E3 0.15401446
## E4 0.32548541
## E5 0.05736709

We make latex tables by the following code.

table2latex <- function(x) {
  rn <- rownames(x)
  cn <- colnames(x); cl <- length(cn)
  cat("\n\\begin{tabular}{",paste(rep("c",cl+1),collapse=""),"}\n",sep="")
  cat("\\hline\n")
  cat("&",paste(cn,collapse=" & "),"\\\\\n")
  for(i in seq(along=rn)) {
    cat(rn[i],"&",paste(x[i,],collapse=" & "),"\\\\\n")
  }
  cat("\\hline\n")
  cat("\\end{tabular}\n")  
}

In the tree table below, the first two columns are computed by sbphylo: stat (log-likelihood difference), shtest (Shimodaira-Hasegawa test \(p\)-value). The other values are computed by the summary method: k.1 (BP, bootstrap probability), k.2 (AU, approximately unbiased \(p\)-value), sk.2 (SI, selective inference \(p\)-value), beta0 (\(\beta_0\), signed distance), beta1 (\(\beta_1\), mean curvature), edge (the associated edges).

table2latex(mam15.pv$tree$character) # all the 15 trees
## 
## \begin{tabular}{cccccccccc}
## \hline
## & stat & shtest & k.1 & k.2 & sk.2 & beta0 & beta1 & tree & edge \\
## T1 & -2.66 & 0.943 (0.001) & 0.561 (0.000) & 0.745 (0.001) & 0.364 (0.001) & -0.41 (0.00) & 0.25 (0.00) & ((Homsa,(Phovi,Bosta)),Orycu,(Musmu,Didvi)); & E1,E2 \\
## T2 &  2.66 & 0.802 (0.001) & 0.304 (0.000) & 0.464 (0.001) & 0.795 (0.001) &  0.30 (0.00) & 0.21 (0.00) & (Homsa,((Phovi,Bosta),Orycu),(Musmu,Didvi)); & E1,E3 \\
## T3 &  7.40 & 0.578 (0.002) & 0.037 (0.000) & 0.131 (0.002) & 0.207 (0.003) &  1.46 (0.01) & 0.34 (0.00) & (Homsa,Orycu,((Phovi,Bosta),(Musmu,Didvi))); & E1,E4 \\
## T4 & 17.57 & 0.174 (0.001) & 0.014 (0.000) & 0.078 (0.002) & 0.120 (0.003) &  1.81 (0.01) & 0.40 (0.01) & ((Homsa,(Phovi,Bosta)),(Orycu,Musmu),Didvi); & E2,E5 \\
## T5 & 18.93 & 0.146 (0.001) & 0.032 (0.000) & 0.124 (0.002) & 0.195 (0.003) &  1.51 (0.01) & 0.35 (0.00) & (Homsa,((Phovi,Bosta),(Orycu,Musmu)),Didvi); & E5,E6 \\
## T6 & 20.11 & 0.117 (0.001) & 0.005 (0.000) & 0.035 (0.002) & 0.055 (0.003) &  2.18 (0.02) & 0.38 (0.01) & (Homsa,(((Phovi,Bosta),Orycu),Musmu),Didvi); & E3,E6 \\
## T7 & 20.60 & 0.110 (0.001) & 0.015 (0.000) & 0.103 (0.003) & 0.152 (0.003) &  1.72 (0.01) & 0.45 (0.01) & (Homsa,(Orycu,Musmu),((Phovi,Bosta),Didvi)); & E5,E7 \\
## T8 & 22.22 & 0.075 (0.001) & 0.001 (0.000) & 0.010 (0.001) & 0.014 (0.001) &  2.76 (0.03) & 0.42 (0.01) & ((Homsa,Musmu),((Phovi,Bosta),Orycu),Didvi); & E3,E8 \\
## T9 & 25.38 & 0.032 (0.001) & 0.000 (0.000) & 0.001 (0.000) & 0.001 (0.000) &  3.73 (0.11) & 0.42 (0.05) & (((Homsa,(Phovi,Bosta)),Musmu),Orycu,Didvi); & E2,E9 \\
## T10 & 26.32 & 0.033 (0.001) & 0.002 (0.000) & 0.029 (0.003) & 0.042 (0.004) &  2.37 (0.03) & 0.47 (0.02) & ((Homsa,Musmu),Orycu,((Phovi,Bosta),Didvi)); & E7,E8 \\
## T11 & 28.86 & 0.017 (0.000) & 0.000 (0.000) & 0.004 (0.001) & 0.005 (0.001) &  3.19 (0.06) & 0.49 (0.03) & (Homsa,Orycu,(((Phovi,Bosta),Didvi),Musmu)); & E4,E7 \\
## T12 & 31.64 & 0.006 (0.000) & 0.000 (0.000) & 0.001 (0.001) & 0.001 (0.001) &  3.67 (0.11) & 0.52 (0.06) & (((Homsa,Musmu),(Phovi,Bosta)),Orycu,Didvi); & E8,E9 \\
## T13 & 31.75 & 0.006 (0.000) & 0.000 (0.000) & 0.000 (0.000) & 0.000 (0.000) &  4.05 (0.13) & 0.49 (0.05) & (Homsa,(((Phovi,Bosta),Musmu),Orycu),Didvi); & E6,E10 \\
## T14 & 34.74 & 0.002 (0.000) & 0.000 (0.000) & 0.000 (0.000) & 0.000 (0.000) &  5.51 (0.33) & 0.37 (0.11) & (Homsa,Orycu,(((Phovi,Bosta),Musmu),Didvi)); & E4,E10 \\
## T15 & 36.25 & 0.001 (0.000) & 0.000 (0.000) & 0.000 (0.000) & 0.000 (0.000) &  5.45 (0.38) & 0.45 (0.13) & ((Homsa,((Phovi,Bosta),Musmu)),Orycu,Didvi); & E9,E10 \\
## \hline
## \end{tabular}
table2latex(mam15.pv$edge$character) # all the 10 edges
## 
## \begin{tabular}{cccccccc}
## \hline
## & k.1 & k.2 & sk.2 & beta0 & beta1 & edge & tree \\
## E1 & 0.931 (0.000) & 0.956 (0.001) & 0.903 (0.001) & -1.59 (0.00) & 0.11 (0.00) & ++++-- & T1,T2,T3 \\
## E2 & 0.581 (0.001) & 0.718 (0.001) & 0.338 (0.001) & -0.39 (0.00) & 0.19 (0.00) & +++--- & T1,T4,T9 \\
## E3 & 0.318 (0.000) & 0.434 (0.001) & 0.773 (0.001) &  0.32 (0.00) & 0.15 (0.00) & -+++-- & T2,T6,T8 \\
## E4 & 0.036 (0.000) & 0.126 (0.002) & 0.201 (0.002) &  1.47 (0.00) & 0.33 (0.00) & +--+-- & T3,T11,T14 \\
## E5 & 0.059 (0.000) & 0.074 (0.001) & 0.141 (0.002) &  1.51 (0.00) & 0.06 (0.00) & ---++- & T4,T5,T7 \\
## E6 & 0.037 (0.000) & 0.092 (0.002) & 0.156 (0.002) &  1.56 (0.01) & 0.23 (0.00) & -++++- & T5,T6,T13 \\
## E7 & 0.017 (0.000) & 0.070 (0.002) & 0.113 (0.003) &  1.79 (0.01) & 0.32 (0.01) & +--++- & T7,T10,T11 \\
## E8 & 0.003 (0.000) & 0.015 (0.001) & 0.024 (0.002) &  2.47 (0.02) & 0.28 (0.01) & +---+- & T8,T10,T12 \\
## E9 & 0.000 (0.000) & 0.001 (0.000) & 0.001 (0.000) &  3.65 (0.08) & 0.34 (0.04) & +++-+- & T9,T12,T15 \\
## E10 & 0.000 (0.000) & 0.000 (0.000) & 0.000 (0.000) &  4.20 (0.12) & 0.40 (0.04) & -++-+- & T13,T14,T15 \\
## \hline
## \end{tabular}

Visualization of the 15 trees (simple method)

We use the following mam15.mt which is already used for computing the \(p\)-values for the 15 trees. It is a matrix of size \(n\times m\), where \(n=3414\) is sample size and \(m=15\) is the number of bifurcating trees. We sort trees by the likelhiood value.

mt15 <- mam15.mt[,mam15$order.tree] # Now T1, T2, ...
colnames(mt15) <- names(mam15$order.tree)
dim(mt15)
## [1] 3414   15

The matrix consists of the site-wise log-likelihood of tree \(i\) at site \(t\) as \(\xi_{ti} = \log p_i(\boldsymbol{x}_t | \boldsymbol{\hat\theta}_i)\), \(t=1,\ldots,n\), \(i=1,\ldots, m\). The matrix is written as \((\boldsymbol{\xi}_1,\ldots,\boldsymbol{\xi}_m)\) for the vectors \(\boldsymbol{\xi}_i \in \mathbb{R}^n\), \(i=1,\ldots, m\). We compute the mean vector as \(\boldsymbol{\bar\xi} = \sum_{i=1}^m \boldsymbol{\xi}_i / m\). Then models are represented by \(\boldsymbol{a}_i := \boldsymbol{\xi}_i - \boldsymbol{\bar\xi}\). We apply PCA to \(\boldsymbol{A} = (\boldsymbol{a}_1, \ldots, \boldsymbol{a}_m)\) and biolot for visualization.

mt0 <- apply(mt15,1,mean) # using the mean vector
xx <- mt15 - mt0 # centering for rows
f <- prcomp(xx,center=FALSE,scale=FALSE)
f$x[,1] <- -f$x[,1];f$rotation[,1] <- -f$rotation[,1] # change the sign of PC1
biplot(f,cex=c(0.4,0.8),choices=c(1,2),col=c(rgb(0,0,0,alpha=0.3),rgb(1,0,0,alpha=0.5)) ) # (PC1, PC2)

biplot(f,cex=c(0.4,0.8),choices=c(3,2),col=c(rgb(0,0,0,alpha=0.3),rgb(1,0,0,alpha=0.5)) ) # (PC3, PC2)