Clustering
C++20 header-only: DBSCAN, HDBSCAN, k-means.
Loading...
Searching...
No Matches
clustering::hdbscan Namespace Reference

Classes

struct  MstEdge
 One edge of the minimum spanning tree of mutual-reachability distances. More...
struct  MstOutput
 Frozen output contract of every MST backend. More...
class  AutoMstBackend
 MST backend that dispatches between Prim, Boruvka, and NN-Descent on input shape. More...
class  BoruvkaMstBackend
 KDTree-accelerated Boruvka MST backend over mutual-reachability distances. More...
struct  NnDescentMstConfig
 Tuning knobs for the NN-Descent MST backend. More...
class  NnDescentMstBackend
 Approximate minimum-spanning-tree backend over mutual-reachability distance via an NN-Descent-built kNN graph plus Kruskal, with a connectivity fallback that closes any disconnected components. More...
class  PrimMstBackend
 Exact minimum-spanning-tree backend over mutual-reachability distance, streaming Prim. More...

Concepts

concept  MstBackendStrategy
 Contract for an MST backend satisfying the frozen clustering::hdbscan::MstOutput shape.

Enumerations

enum class  ClusterSelectionMethod : std::uint8_t { kEom , kLeaf }
 Cluster extraction method on the condensed tree. More...
enum class  MinSamplesConvention : std::uint8_t { kSklearn , kCampello }
 Semantics of the minSamples parameter at core-distance extraction. More...

Variables

constexpr std::size_t kPrimMaxN = std::size_t{16384}
 Compute budget that gates the streaming Prim backend, expressed as the maximum point count it will accept.
constexpr std::size_t kPrimMrdMatrixByteBudget = kPrimMaxN * kPrimMaxN * sizeof(float)
 Equivalent byte-budget phrasing of kPrimMaxN, kept so callers that gate on n*n*sizeof(T) <= kPrimMrdMatrixByteBudget compile unchanged.
constexpr std::size_t kPrimDenseCoreMinN = 1024
 Thresholds gating the dense symmetric core-distance pass.
constexpr std::size_t kPrimDenseCoreMinD = 17
constexpr std::size_t kPrimDenseCoreMaxMinSamples = 64
constexpr std::size_t kPrimPersistentRelaxMinWorkers = 8
constexpr std::size_t kPrimPersistentRelaxMinOpsPerWorker = std::size_t{1} << 14

Enumeration Type Documentation

◆ ClusterSelectionMethod

enum class clustering::hdbscan::ClusterSelectionMethod : std::uint8_t
strong

Cluster extraction method on the condensed tree.

The two methods differ in traversal behaviour, not in output shape. Excess-of-mass is the default and optimizes per-cluster stability over the full tree; leaf returns every leaf of the condensed tree as its own cluster.

Enumerator
kEom 

Excess-of-mass selection (the published default).

kLeaf 

Leaf-cluster selection; every condensed-tree leaf becomes a cluster.

Definition at line 37 of file hdbscan.h.

◆ MinSamplesConvention

enum class clustering::hdbscan::MinSamplesConvention : std::uint8_t
strong

Semantics of the minSamples parameter at core-distance extraction.

Campello 2015 defines core_k(p) as the distance from p to its k-th nearest neighbour excluding p itself. scikit-learn and the original hdbscan package count p itself as one of the k neighbours, so for a given min_samples = k those implementations return what Campello would call core_{k-1}. Both conventions are widely deployed and produce different MSTs on high-dimensional near-uniform inputs; the enum is the API boundary between them.

Enumerator
kSklearn 

minSamples counts the query point itself as one of the k neighbours.

The core distance returned is the Campello core_{minSamples-1}. Matches scikit-learn and the hdbscan package.

kCampello 

minSamples counts only non-self neighbours, matching Campello 2015 directly.

Definition at line 52 of file hdbscan.h.

Variable Documentation

◆ kPrimDenseCoreMaxMinSamples

std::size_t clustering::hdbscan::kPrimDenseCoreMaxMinSamples = 64
inlineconstexpr

Definition at line 53 of file prim_mst_backend.h.

◆ kPrimDenseCoreMinD

std::size_t clustering::hdbscan::kPrimDenseCoreMinD = 17
inlineconstexpr

Definition at line 52 of file prim_mst_backend.h.

◆ kPrimDenseCoreMinN

std::size_t clustering::hdbscan::kPrimDenseCoreMinN = 1024
inlineconstexpr

Thresholds gating the dense symmetric core-distance pass.

The dense pass computes all n*(n-1)/2 squared distances once and feeds both endpoints into a per-row top-minSamples tracker. Its total work is O(n^2 * d) distance compute plus O(n^2 * minSamples) top-k bookkeeping. It beats the KDTree kNN fallback once minSamples is small enough for the top-k write amp to stay bounded and (n, d) is large enough to amortise the symmetric scan's fixed cost. Below these bounds the KDTree path wins; above kPrimDenseCoreMaxMinSamples the top-k rescan per update dominates.

Definition at line 51 of file prim_mst_backend.h.

◆ kPrimMaxN

std::size_t clustering::hdbscan::kPrimMaxN = std::size_t{16384}
inlineconstexpr

Compute budget that gates the streaming Prim backend, expressed as the maximum point count it will accept.

The streaming Prim variant materialises only O(n) state (no n*n MRD matrix), but the inner relax recomputes d-wide squared-Euclidean distances per popped vertex, costing O(n^2 * d) total scalar work. Beyond this n the dispatcher should prefer NN-Descent (high d) or Boruvka (low d), both of which scale better than O(n^2).

Definition at line 34 of file prim_mst_backend.h.

◆ kPrimMrdMatrixByteBudget

std::size_t clustering::hdbscan::kPrimMrdMatrixByteBudget = kPrimMaxN * kPrimMaxN * sizeof(float)
inlineconstexpr

Equivalent byte-budget phrasing of kPrimMaxN, kept so callers that gate on n*n*sizeof(T) <= kPrimMrdMatrixByteBudget compile unchanged.

Numerically equal to kPrimMaxN squared, scaled by sizeof(float).

Definition at line 39 of file prim_mst_backend.h.

◆ kPrimPersistentRelaxMinOpsPerWorker

std::size_t clustering::hdbscan::kPrimPersistentRelaxMinOpsPerWorker = std::size_t{1} << 14
inlineconstexpr

Definition at line 55 of file prim_mst_backend.h.

◆ kPrimPersistentRelaxMinWorkers

std::size_t clustering::hdbscan::kPrimPersistentRelaxMinWorkers = 8
inlineconstexpr

Definition at line 54 of file prim_mst_backend.h.