# Summary

• New versions of pvclust (>=2.2-0) and scaleboot (>=1.0-1) can compute selective inference (si) $$p$$-values as well as the conventional approximately unbiased (au) $$p$$-values and the bootstrap probability (bp).
• We recommend si values in most cases instead of au and bp values.
• Several estimating methods of $$p$$-values are implemented in the software. The default pvclust setting (e.g., lung.pvclust below) is a simple yet good enough method; this can be computed without scaleboot.
• More accurate, less biased versions of $$p$$-values (i.e., $$k=3$$) can be computed by scaleboot. The setting of the wide range of scales (e.g., lung73.k3 below) is then recommended.

# Hierarchical Clustering of Lung Data

## Introduction

Our method for computing selective inference $$p$$-values via multiscale bootstrap is explained in Shimodaira and Terada (2019). The theory for the selective inference behind the method is given in Terada and Shimodaira (2017). The pvclust package is originally described in Suzuki and Shimodaira (2006) for non-selective inference, and the multiscale bootstrap method of scaleboot is originally described in Shimodaira (2008).

The selective inference is also known as post-selection inference. This takes account of the fact that the clusters in the dendrogram are selected by looking at data. This contradicts the basic assumption of the conventional statistical hypothesis testing, which requires null hypotheses are selected before looking at the data. The newly proposed si values are adjusting the bias of this effect, but the conventional au values suffer from the selection bias.

## pvclust and scaleboot packages

We use the following two packages here. Both packages implement the multiscale bootstrap method. Older versions of pvclust and scaleboot compute only BP and AU, but newer versions (pvclust>=2.2-0, scaleboot>=1.0-1) compute SI as well.

library(pvclust)  # computing p-values for clusters in hierarchical clustering
library(scaleboot)  # computing p-values for general setttings

## Using multi-core CPU

If your pc has cpu with many cores, then we can speed up bootstrap computation.

### dont run
library(parallel)
length(cl <- makeCluster(detectCores()))

# Using pvclust package

## lung data

We use the sample data of microarray expression profiles. It is $$n\times m$$ matrix for $$n=916$$ genes and $$m=73$$ tumors. We compute clusters of tumors.

data(lung)  # in pvclust
dim(lung)
## [1] 916  73

## run pvclust

We may run pvclust as follows. The default scale is specifed as r=seq(.5,1.4,by=.1) in pvlcust. It is equivalent to $$\sigma^{-2} = 0.5, 0.6, \ldots, 1.4$$ for multiscale bootstrap.

### dont run
lung.pvclust <- pvclust(lung, nboot=10000, parallel=cl)

The above code takes a long time. You can simply download the preveously performed result. The following code loads lung.pvclust, lung73.pvclust, lung.sb, and lung73.sb (see help(lung73)).

data(lung73)

The default plot of cluster dendrogram shows two types of $$p$$-values (au, bp), and the significant clusters with au > 0.95 are shown as red boxes. au is the approximately unbiased $$p$$-value, and bp is the boostrap probability.

plot(lung.pvclust, cex=0.5, cex.pv=0.5)
pvrect(lung.pvclust)  # au > 0.95