Summary

Hierarchical Clustering of Lung Data

Introduction

Our method for computing selective inference \(p\)-values via multiscale bootstrap is explained in Shimodaira and Terada (2019). The theory for the selective inference behind the method is given in Terada and Shimodaira (2017). The pvclust package is originally described in Suzuki and Shimodaira (2006) for non-selective inference, and the multiscale bootstrap method of scaleboot is originally described in Shimodaira (2008).

The selective inference is also known as post-selection inference. This takes account of the fact that the clusters in the dendrogram are selected by looking at data. This contradicts the basic assumption of the conventional statistical hypothesis testing, which requires null hypotheses are selected before looking at the data. The newly proposed si values are adjusting the bias of this effect, but the conventional au values suffer from the selection bias.

pvclust and scaleboot packages

We use the following two packages here. Both packages implement the multiscale bootstrap method. Older versions of pvclust and scaleboot compute only BP and AU, but newer versions (pvclust>=2.2-0, scaleboot>=1.0-1) compute SI as well.

library(pvclust)  # computing p-values for clusters in hierarchical clustering 
library(scaleboot)  # computing p-values for general setttings

Using multi-core CPU

If your pc has cpu with many cores, then we can speed up bootstrap computation.

### dont run
library(parallel)
length(cl <- makeCluster(detectCores()))

Using pvclust package

lung data

We use the sample data of microarray expression profiles. It is \(n\times m\) matrix for \(n=916\) genes and \(m=73\) tumors. We compute clusters of tumors.

data(lung)  # in pvclust
dim(lung)
## [1] 916  73

run pvclust

We may run pvclust as follows. The default scale is specifed as r=seq(.5,1.4,by=.1) in pvlcust. It is equivalent to \(\sigma^{-2} = 0.5, 0.6, \ldots, 1.4\) for multiscale bootstrap.

### dont run
lung.pvclust <- pvclust(lung, nboot=10000, parallel=cl)

The above code takes a long time. You can simply download the preveously performed result. The following code loads lung.pvclust, lung73.pvclust, lung.sb, and lung73.sb (see help(lung73)).

data(lung73)

The default plot of cluster dendrogram shows two types of \(p\)-values (au, bp), and the significant clusters with au > 0.95 are shown as red boxes. au is the approximately unbiased \(p\)-value, and bp is the boostrap probability.

plot(lung.pvclust, cex=0.5, cex.pv=0.5) 
pvrect(lung.pvclust)  # au > 0.95