João F. Matias Rodrigues

Postdoctoral Researcher

Christian von Mering's lab

Institute of Molecular Life Sciences

University of Zürich

Switzerland

# Software :: HPC-CLUST

## Motivation

HPC-CLUST is a set of tools designed to cluster large numbers (>1 million) pre-aligned nucleotide sequences. HPC-CLUST performs the clustering of sequences using the Hierarchical Clustering Algorithm (HCA). There are currently three different cluster metrics implemented: single-linkage, complete-linkage, and average-linkage. There are currently 4 sequence distance functions implemented, these are: identity (gap-gap counting as match), nogap (gap-gap being ignored), nogap-single (like nogap, but consecutive gap-nogap's count as a single mismatch), tamura (distance is calculated with the knowledge that transitions are more likely than transversions).

One advantage that HCA has over other algorithms is that instead of producing only the clustering at a given threshold, it produces the set of merges happening at each threshold. With this approach, the clusters can be very quickly computed for every threshold with little extra computations. This approach also allows the plotting of the variation of number of clusters with clustering threshold without requiring the clustering to be run for each threshold independently.

Another feature of the way HPC-CLUST is implemented is that the single, complete, and average linkage clusterings can be computed in a single go with little overhead.