|
Clustering
C++20 header-only: DBSCAN, HDBSCAN, k-means.
|
Enumerations | |
| enum class | PairwisePath : std::uint8_t { Simd , Gemm } |
| Tag identifying which inner kernel executed for a pairwise distance request. More... | |
| enum class | ArgminPath : std::uint8_t { Fused , Materialized } |
| Tag identifying which outer driver executed for a pairwiseArgminSqEuclidean request. More... | |
Functions | |
| float | horizontalSumAvx2 (__m256 v) noexcept |
| double | horizontalSumAvx2 (__m256d v) noexcept |
| float | sqEuclideanRowAvx2 (const float *xRow, const float *yRow, std::size_t d) noexcept |
| double | sqEuclideanRowAvx2 (const double *xRow, const double *yRow, std::size_t d) noexcept |
| template<class T, Layout LX, Layout LY> | |
| T | sqEuclideanRow (const NDArray< T, 2, LX > &X, std::size_t i, const NDArray< T, 2, LY > &Y, std::size_t j) noexcept |
| float | sqNormRowAvx2 (const float *xRow, std::size_t d) noexcept |
| double | sqNormRowAvx2 (const double *xRow, std::size_t d) noexcept |
| template<class T, Layout LX> | |
| T | sqNormRow (const NDArray< T, 2, LX > &X, std::size_t i) noexcept |
| template<class T, Layout LX> | |
| void | rowNormsSq (const NDArray< T, 2, LX > &X, NDArray< T, 1 > &norms, Pool pool) |
| Row-wise sum of squares: norms(i) = sum_k X(i, k)^2. | |
| template<class T, Layout LX, Layout LY> | |
| void | pairwiseSqEuclideanGemm (const NDArray< T, 2, LX > &X, const NDArray< T, 2, LY > &Y, NDArray< T, 2 > &out, Pool pool) |
| Large-path pairwise squared Euclidean via the GEMM identity. | |
| template<class T, Layout LX, Layout LY> | |
| void | pairwiseSqEuclideanSimd (const NDArray< T, 2, LX > &X, const NDArray< T, 2, LY > &Y, NDArray< T, 2 > &out, Pool pool) |
| Small-path pairwise squared Euclidean via SIMD accumulation per (i, j) pair. | |
| template<class T, Layout LX = Layout::Contig, Layout LY = Layout::Contig> | |
| PairwisePath | pairwiseSqEuclideanWithDispatchInfo (const NDArray< T, 2, LX > &X, const NDArray< T, 2, LY > &Y, NDArray< T, 2 > &out, Pool pool) |
| Test-only: runs the same dispatch as pairwiseSqEuclidean and reports which kernel fired. | |
| template<class T, Layout LX, Layout LY> | |
| bool | canUseFusedThreshold (const NDArray< T, 2, LX > &X, const NDArray< T, 2, LY > &Y) noexcept |
| Runtime predicate: true when the fused AVX2 threshold path is eligible. | |
| template<class T, Layout LX, Layout LY, class Emit> | |
| void | pairwiseSqEuclideanThresholdedMaterialized (const NDArray< T, 2, LX > &X, const NDArray< T, 2, LY > &Y, T radiusSq, Pool pool, Emit &&emit) |
| Materialized fallback for the thresholded-emit API: compute each pair's squared distance via sqEuclideanRow and walk survivors. | |
| template<class T, Layout LX, Layout LC> | |
| void | pairwiseArgminMaterializedWithScratch (const NDArray< T, 2, LX > &X, const NDArray< T, 2, LC > &C, NDArray< std::int32_t, 1 > &labels, NDArray< T, 1 > &outMinSq, NDArray< T, 2 > &distsScratch, Pool pool) |
Compute per-row argmin + minimum squared distance over n in 256-row strips using a caller-owned distance tile. | |
| template<class T, Layout LX, Layout LC> | |
| void | pairwiseArgminMaterialized (const NDArray< T, 2, LX > &X, const NDArray< T, 2, LC > &C, NDArray< std::int32_t, 1 > &labels, NDArray< T, 1 > &outMinSq, Pool pool) |
| Compute per-row argmin and minimum squared distance via the materialized two-step. | |
| template<class T, Layout LX, Layout LC> | |
| bool | canUseFusedArgmin (const NDArray< T, 2, LX > &X, const NDArray< T, 2, LC > &C, const NDArray< T, 1 > &cSqNorms) noexcept |
| Runtime predicate: true when the fused AVX2 path is eligible for this call. | |
| template<class T, Layout LX = Layout::Contig, Layout LC = Layout::Contig> | |
| ArgminPath | pairwiseArgminSqEuclideanWithDispatchInfo (const NDArray< T, 2, LX > &X, const NDArray< T, 2, LC > &C, const NDArray< T, 1 > &cSqNorms, NDArray< std::int32_t, 1 > &labels, NDArray< T, 1 > &outMinDistSq, Pool pool) |
| Test-only: runs the same dispatch as pairwiseArgminSqEuclidean and reports which outer driver fired. | |
Variables | |
| template<class T> | |
| constexpr std::size_t | kAvx2Lanes = std::is_same_v<T, float> ? 8 : 4 |
|
strong |
Tag identifying which outer driver executed for a pairwiseArgminSqEuclidean request.
Test surface only. Mirrors PairwisePath so dispatch tests can pin the branch crisply without resorting to timing or numerical fingerprinting.
| Enumerator | |
|---|---|
| Fused | |
| Materialized | |
Definition at line 51 of file pairwise_argmin.h.
|
strong |
Tag identifying which inner kernel executed for a pairwise distance request.
Test surface only. The public pairwiseSqEuclidean dispatches between the SIMD-per-pair path and the GEMM-identity path and exposes no indication of which ran; this enum lets the dispatch tests pin the branch crisply without inspecting wall-clock or numerics.
| Enumerator | |
|---|---|
| Simd | |
| Gemm | |
Definition at line 37 of file pairwise.h.
|
noexcept |
Runtime predicate: true when the fused AVX2 path is eligible for this call.
Dispatch criteria (all must hold):
CLUSTERING_USE_AVX2 is defined and T is float.X and C are Layout::Contig.X and C are runtime 32-byte aligned (the AVX2 microkernel issues aligned loads).d is non-zero and at most defaults::pairwiseArgminMaxD.n and k are non-zero.Any failure falls through to the materialized path.
Definition at line 173 of file pairwise_argmin.h.
|
noexcept |
Runtime predicate: true when the fused AVX2 threshold path is eligible.
Dispatch criteria (all must hold):
CLUSTERING_USE_AVX2 is defined and T is float.X and Y are Layout::Contig.X and Y are runtime 32-byte aligned (the AVX2 microkernel issues aligned loads).d is non-zero, at most detail::kThresholdMaxD, and at least 8 (one AVX2 lane's worth of features, below which the blocked packer is net-negative).n and m are non-zero. Definition at line 432 of file pairwise.h.
|
inlinenoexcept |
Definition at line 41 of file pairwise.h.
|
inlinenoexcept |
Definition at line 49 of file pairwise.h.
| void clustering::math::detail::pairwiseArgminMaterialized | ( | const NDArray< T, 2, LX > & | X, |
| const NDArray< T, 2, LC > & | C, | ||
| NDArray< std::int32_t, 1 > & | labels, | ||
| NDArray< T, 1 > & | outMinSq, | ||
| Pool | pool ) |
Compute per-row argmin and minimum squared distance via the materialized two-step.
Convenience wrapper that allocates the chunked distance scratch and forwards to pairwiseArgminMaterializedWithScratch. Used as the fall-through path when d exceeds defaults::pairwiseArgminMaxD and as the correctness oracle for the fused path in tests.
Definition at line 146 of file pairwise_argmin.h.
| void clustering::math::detail::pairwiseArgminMaterializedWithScratch | ( | const NDArray< T, 2, LX > & | X, |
| const NDArray< T, 2, LC > & | C, | ||
| NDArray< std::int32_t, 1 > & | labels, | ||
| NDArray< T, 1 > & | outMinSq, | ||
| NDArray< T, 2 > & | distsScratch, | ||
| Pool | pool ) |
Compute per-row argmin + minimum squared distance over n in 256-row strips using a caller-owned distance tile.
For each chunk of up to pairwiseArgminChunkRows rows of X, calls pairwiseSqEuclidean against the full centroid matrix to fill the distsScratch tile, then scalar-argmins each row into labels + outMinSq. The tile is sized so one full chunk's worth of distances stays L2-resident across the GEMM and the argmin scan; keeping the output matrix off DRAM is the load-bearing win over materializing (n, k) up front.
Chunks are independent so the outer chunk loop parallelises freely; per-row argmin carries no reduction across rows, so threaded and serial runs produce bit-identical labels and within-reassociation-tolerance outMinSq.
| X | Data matrix (n x d). |
| C | Centroid matrix (k x d). |
| labels | Output labels of length n; must be mutable. |
| outMinSq | Output minimum squared distances of length n; must be mutable. |
| distsScratch | Distance tile of shape chunkedMaterializedScratchShape; the function reuses it across chunks. |
| pool | Parallelism injection; forwarded to the per-chunk pairwiseSqEuclidean. |
Definition at line 79 of file pairwise_argmin.h.
| ArgminPath clustering::math::detail::pairwiseArgminSqEuclideanWithDispatchInfo | ( | const NDArray< T, 2, LX > & | X, |
| const NDArray< T, 2, LC > & | C, | ||
| const NDArray< T, 1 > & | cSqNorms, | ||
| NDArray< std::int32_t, 1 > & | labels, | ||
| NDArray< T, 1 > & | outMinDistSq, | ||
| Pool | pool ) |
Test-only: runs the same dispatch as pairwiseArgminSqEuclidean and reports which outer driver fired.
Shares the public entry's preconditions and branch; the return on empty input is ArgminPath::Materialized by convention (neither driver runs; the cheap path at (n, k) == (0, 0) short-circuits before dispatch).
Definition at line 274 of file pairwise_argmin.h.
| void clustering::math::detail::pairwiseSqEuclideanGemm | ( | const NDArray< T, 2, LX > & | X, |
| const NDArray< T, 2, LY > & | Y, | ||
| NDArray< T, 2 > & | out, | ||
| Pool | pool ) |
Large-path pairwise squared Euclidean via the GEMM identity.
Computes out(i, j) = ||X(i) - Y(j)||^2 by reconstructing ||x||^2 + ||y||^2 - 2 x . y^T. The dot-product matrix is evaluated with the public gemm entry at alpha = -2 and beta = 0; row-norm vectors are produced by rowNormsSq; a final elementwise broadcast adds the norms and clamps to zero.
| T | Element type (float or double). |
| LX | Layout tag of X; CTAD-resolved. |
| LY | Layout tag of Y; CTAD-resolved. |
| X | Left operand (n x d). |
| Y | Right operand (m x d). |
| out | Output matrix (n x m); isMutable() must be true. |
| pool | Parallelism injection; forwarded to rowNormsSq and gemm and used for the broadcast-and-clamp sweep. |
Definition at line 233 of file pairwise.h.
| void clustering::math::detail::pairwiseSqEuclideanSimd | ( | const NDArray< T, 2, LX > & | X, |
| const NDArray< T, 2, LY > & | Y, | ||
| NDArray< T, 2 > & | out, | ||
| Pool | pool ) |
Small-path pairwise squared Euclidean via SIMD accumulation per (i, j) pair.
Iterates every row pair and invokes sqEuclideanRow, which picks an AVX2 reduction on contiguous layouts with d at least one lane or falls back to a scalar loop. The outer row loop fans out over pool when shouldParallelize clears; per-cell arithmetic order is untouched across the fan-out so serial and threaded outputs are bit-identical.
| T | Element type (float or double). |
| LX | Layout tag of X; CTAD-resolved. |
| LY | Layout tag of Y; CTAD-resolved. |
| X | Left operand (n x d). |
| Y | Right operand (m x d). |
| out | Output matrix (n x m); isMutable() must be true. |
| pool | Parallelism injection for the outer row loop. |
Definition at line 296 of file pairwise.h.
| void clustering::math::detail::pairwiseSqEuclideanThresholdedMaterialized | ( | const NDArray< T, 2, LX > & | X, |
| const NDArray< T, 2, LY > & | Y, | ||
| T | radiusSq, | ||
| Pool | pool, | ||
| Emit && | emit ) |
Materialized fallback for the thresholded-emit API: compute each pair's squared distance via sqEuclideanRow and walk survivors.
Used when the fused-path eligibility check rejects (strided view, wrong type, tiny d, AVX2 off). Iterates in global (row, col) ascending order so emit fires row-major – a tighter guarantee than the fused path gives. No intermediate (n x m) tile is materialized: callers with small n (e.g. single-seed range queries) avoid per-call heap traffic that a materialized-then-mask path would incur.
Definition at line 472 of file pairwise.h.
| PairwisePath clustering::math::detail::pairwiseSqEuclideanWithDispatchInfo | ( | const NDArray< T, 2, LX > & | X, |
| const NDArray< T, 2, LY > & | Y, | ||
| NDArray< T, 2 > & | out, | ||
| Pool | pool ) |
Test-only: runs the same dispatch as pairwiseSqEuclidean and reports which kernel fired.
Shares the public entry's preconditions, workload metric, and branch. Empty inputs short-circuit before the branch is evaluated; the return in that case is PairwisePath::Simd by convention, which matches the cheap path any future dispatch refinement would pick for a zero-cell problem.
| T | Element type (float or double). |
| LX | Layout tag of X; CTAD-resolved. |
| LY | Layout tag of Y; CTAD-resolved. |
| X | Left operand (n x d). |
| Y | Right operand (m x d). |
| out | Output matrix (n x m); isMutable() must be true. |
| pool | Parallelism injection; forwarded to the selected kernel. |
Definition at line 394 of file pairwise.h.
| void clustering::math::detail::rowNormsSq | ( | const NDArray< T, 2, LX > & | X, |
| NDArray< T, 1 > & | norms, | ||
| Pool | pool ) |
Row-wise sum of squares: norms(i) = sum_k X(i, k)^2.
Inner reduction mirrors sqEuclideanRow: AVX2 when LX is Layout::Contig and the row fills at least one lane, scalar otherwise. The outer row loop fans out over pool when the workload clears shouldParallelize; per-row arithmetic is untouched across the fan-out.
| T | Element type (float or double). |
| LX | Layout tag of X; CTAD-resolved so strided views (e.g. Z.t()) bind without explicit template arguments. |
| X | Input matrix (n x d). |
| norms | Rank-1 output of length n; isMutable() must be true. |
| pool | Parallelism injection for the outer row loop. |
Definition at line 187 of file pairwise.h.
|
inlinenoexcept |
Definition at line 99 of file pairwise.h.
|
inlinenoexcept |
Definition at line 75 of file pairwise.h.
|
inlinenoexcept |
Definition at line 56 of file pairwise.h.
|
inlinenoexcept |
Definition at line 154 of file pairwise.h.
|
inlinenoexcept |
Definition at line 136 of file pairwise.h.
|
inlinenoexcept |
Definition at line 121 of file pairwise.h.
|
constexpr |
Definition at line 96 of file pairwise.h.