Clustering
C++20 header-only: DBSCAN, HDBSCAN, k-means.
Loading...
Searching...
No Matches
clustering::math::detail Namespace Reference

Enumerations

enum class  PairwisePath : std::uint8_t { Simd , Gemm }
 Tag identifying which inner kernel executed for a pairwise distance request. More...
enum class  ArgminPath : std::uint8_t { Fused , Materialized }
 Tag identifying which outer driver executed for a pairwiseArgminSqEuclidean request. More...

Functions

float horizontalSumAvx2 (__m256 v) noexcept
double horizontalSumAvx2 (__m256d v) noexcept
float sqEuclideanRowAvx2 (const float *xRow, const float *yRow, std::size_t d) noexcept
double sqEuclideanRowAvx2 (const double *xRow, const double *yRow, std::size_t d) noexcept
template<class T, Layout LX, Layout LY>
sqEuclideanRow (const NDArray< T, 2, LX > &X, std::size_t i, const NDArray< T, 2, LY > &Y, std::size_t j) noexcept
float sqNormRowAvx2 (const float *xRow, std::size_t d) noexcept
double sqNormRowAvx2 (const double *xRow, std::size_t d) noexcept
template<class T, Layout LX>
sqNormRow (const NDArray< T, 2, LX > &X, std::size_t i) noexcept
template<class T, Layout LX>
void rowNormsSq (const NDArray< T, 2, LX > &X, NDArray< T, 1 > &norms, Pool pool)
 Row-wise sum of squares: norms(i) = sum_k X(i, k)^2.
template<class T, Layout LX, Layout LY>
void pairwiseSqEuclideanGemm (const NDArray< T, 2, LX > &X, const NDArray< T, 2, LY > &Y, NDArray< T, 2 > &out, Pool pool)
 Large-path pairwise squared Euclidean via the GEMM identity.
template<class T, Layout LX, Layout LY>
void pairwiseSqEuclideanSimd (const NDArray< T, 2, LX > &X, const NDArray< T, 2, LY > &Y, NDArray< T, 2 > &out, Pool pool)
 Small-path pairwise squared Euclidean via SIMD accumulation per (i, j) pair.
template<class T, Layout LX = Layout::Contig, Layout LY = Layout::Contig>
PairwisePath pairwiseSqEuclideanWithDispatchInfo (const NDArray< T, 2, LX > &X, const NDArray< T, 2, LY > &Y, NDArray< T, 2 > &out, Pool pool)
 Test-only: runs the same dispatch as pairwiseSqEuclidean and reports which kernel fired.
template<class T, Layout LX, Layout LY>
bool canUseFusedThreshold (const NDArray< T, 2, LX > &X, const NDArray< T, 2, LY > &Y) noexcept
 Runtime predicate: true when the fused AVX2 threshold path is eligible.
template<class T, Layout LX, Layout LY, class Emit>
void pairwiseSqEuclideanThresholdedMaterialized (const NDArray< T, 2, LX > &X, const NDArray< T, 2, LY > &Y, T radiusSq, Pool pool, Emit &&emit)
 Materialized fallback for the thresholded-emit API: compute each pair's squared distance via sqEuclideanRow and walk survivors.
template<class T, Layout LX, Layout LC>
void pairwiseArgminMaterializedWithScratch (const NDArray< T, 2, LX > &X, const NDArray< T, 2, LC > &C, NDArray< std::int32_t, 1 > &labels, NDArray< T, 1 > &outMinSq, NDArray< T, 2 > &distsScratch, Pool pool)
 Compute per-row argmin + minimum squared distance over n in 256-row strips using a caller-owned distance tile.
template<class T, Layout LX, Layout LC>
void pairwiseArgminMaterialized (const NDArray< T, 2, LX > &X, const NDArray< T, 2, LC > &C, NDArray< std::int32_t, 1 > &labels, NDArray< T, 1 > &outMinSq, Pool pool)
 Compute per-row argmin and minimum squared distance via the materialized two-step.
template<class T, Layout LX, Layout LC>
bool canUseFusedArgmin (const NDArray< T, 2, LX > &X, const NDArray< T, 2, LC > &C, const NDArray< T, 1 > &cSqNorms) noexcept
 Runtime predicate: true when the fused AVX2 path is eligible for this call.
template<class T, Layout LX = Layout::Contig, Layout LC = Layout::Contig>
ArgminPath pairwiseArgminSqEuclideanWithDispatchInfo (const NDArray< T, 2, LX > &X, const NDArray< T, 2, LC > &C, const NDArray< T, 1 > &cSqNorms, NDArray< std::int32_t, 1 > &labels, NDArray< T, 1 > &outMinDistSq, Pool pool)
 Test-only: runs the same dispatch as pairwiseArgminSqEuclidean and reports which outer driver fired.

Variables

template<class T>
constexpr std::size_t kAvx2Lanes = std::is_same_v<T, float> ? 8 : 4

Enumeration Type Documentation

◆ ArgminPath

enum class clustering::math::detail::ArgminPath : std::uint8_t
strong

Tag identifying which outer driver executed for a pairwiseArgminSqEuclidean request.

Test surface only. Mirrors PairwisePath so dispatch tests can pin the branch crisply without resorting to timing or numerical fingerprinting.

Enumerator
Fused 
Materialized 

Definition at line 51 of file pairwise_argmin.h.

◆ PairwisePath

enum class clustering::math::detail::PairwisePath : std::uint8_t
strong

Tag identifying which inner kernel executed for a pairwise distance request.

Test surface only. The public pairwiseSqEuclidean dispatches between the SIMD-per-pair path and the GEMM-identity path and exposes no indication of which ran; this enum lets the dispatch tests pin the branch crisply without inspecting wall-clock or numerics.

Enumerator
Simd 
Gemm 

Definition at line 37 of file pairwise.h.

Function Documentation

◆ canUseFusedArgmin()

template<class T, Layout LX, Layout LC>
bool clustering::math::detail::canUseFusedArgmin ( const NDArray< T, 2, LX > & X,
const NDArray< T, 2, LC > & C,
const NDArray< T, 1 > & cSqNorms )
noexcept

Runtime predicate: true when the fused AVX2 path is eligible for this call.

Dispatch criteria (all must hold):

  • CLUSTERING_USE_AVX2 is defined and T is float.
  • X and C are Layout::Contig.
  • X and C are runtime 32-byte aligned (the AVX2 microkernel issues aligned loads).
  • d is non-zero and at most defaults::pairwiseArgminMaxD.
  • n and k are non-zero.

Any failure falls through to the materialized path.

Definition at line 173 of file pairwise_argmin.h.

◆ canUseFusedThreshold()

template<class T, Layout LX, Layout LY>
bool clustering::math::detail::canUseFusedThreshold ( const NDArray< T, 2, LX > & X,
const NDArray< T, 2, LY > & Y )
noexcept

Runtime predicate: true when the fused AVX2 threshold path is eligible.

Dispatch criteria (all must hold):

  • CLUSTERING_USE_AVX2 is defined and T is float.
  • X and Y are Layout::Contig.
  • X and Y are runtime 32-byte aligned (the AVX2 microkernel issues aligned loads).
  • d is non-zero, at most detail::kThresholdMaxD, and at least 8 (one AVX2 lane's worth of features, below which the blocked packer is net-negative).
  • n and m are non-zero.

Definition at line 432 of file pairwise.h.

◆ horizontalSumAvx2() [1/2]

float clustering::math::detail::horizontalSumAvx2 ( __m256 v)
inlinenoexcept

Definition at line 41 of file pairwise.h.

◆ horizontalSumAvx2() [2/2]

double clustering::math::detail::horizontalSumAvx2 ( __m256d v)
inlinenoexcept

Definition at line 49 of file pairwise.h.

◆ pairwiseArgminMaterialized()

template<class T, Layout LX, Layout LC>
void clustering::math::detail::pairwiseArgminMaterialized ( const NDArray< T, 2, LX > & X,
const NDArray< T, 2, LC > & C,
NDArray< std::int32_t, 1 > & labels,
NDArray< T, 1 > & outMinSq,
Pool pool )

Compute per-row argmin and minimum squared distance via the materialized two-step.

Convenience wrapper that allocates the chunked distance scratch and forwards to pairwiseArgminMaterializedWithScratch. Used as the fall-through path when d exceeds defaults::pairwiseArgminMaxD and as the correctness oracle for the fused path in tests.

Definition at line 146 of file pairwise_argmin.h.

◆ pairwiseArgminMaterializedWithScratch()

template<class T, Layout LX, Layout LC>
void clustering::math::detail::pairwiseArgminMaterializedWithScratch ( const NDArray< T, 2, LX > & X,
const NDArray< T, 2, LC > & C,
NDArray< std::int32_t, 1 > & labels,
NDArray< T, 1 > & outMinSq,
NDArray< T, 2 > & distsScratch,
Pool pool )

Compute per-row argmin + minimum squared distance over n in 256-row strips using a caller-owned distance tile.

For each chunk of up to pairwiseArgminChunkRows rows of X, calls pairwiseSqEuclidean against the full centroid matrix to fill the distsScratch tile, then scalar-argmins each row into labels + outMinSq. The tile is sized so one full chunk's worth of distances stays L2-resident across the GEMM and the argmin scan; keeping the output matrix off DRAM is the load-bearing win over materializing (n, k) up front.

Chunks are independent so the outer chunk loop parallelises freely; per-row argmin carries no reduction across rows, so threaded and serial runs produce bit-identical labels and within-reassociation-tolerance outMinSq.

Template Parameters
TElement type (float or double).
LXLayout tag of X.
LCLayout tag of C.
Parameters
XData matrix (n x d).
CCentroid matrix (k x d).
labelsOutput labels of length n; must be mutable.
outMinSqOutput minimum squared distances of length n; must be mutable.
distsScratchDistance tile of shape chunkedMaterializedScratchShape; the function reuses it across chunks.
poolParallelism injection; forwarded to the per-chunk pairwiseSqEuclidean.

Definition at line 79 of file pairwise_argmin.h.

◆ pairwiseArgminSqEuclideanWithDispatchInfo()

template<class T, Layout LX = Layout::Contig, Layout LC = Layout::Contig>
ArgminPath clustering::math::detail::pairwiseArgminSqEuclideanWithDispatchInfo ( const NDArray< T, 2, LX > & X,
const NDArray< T, 2, LC > & C,
const NDArray< T, 1 > & cSqNorms,
NDArray< std::int32_t, 1 > & labels,
NDArray< T, 1 > & outMinDistSq,
Pool pool )

Test-only: runs the same dispatch as pairwiseArgminSqEuclidean and reports which outer driver fired.

Shares the public entry's preconditions and branch; the return on empty input is ArgminPath::Materialized by convention (neither driver runs; the cheap path at (n, k) == (0, 0) short-circuits before dispatch).

Definition at line 274 of file pairwise_argmin.h.

◆ pairwiseSqEuclideanGemm()

template<class T, Layout LX, Layout LY>
void clustering::math::detail::pairwiseSqEuclideanGemm ( const NDArray< T, 2, LX > & X,
const NDArray< T, 2, LY > & Y,
NDArray< T, 2 > & out,
Pool pool )

Large-path pairwise squared Euclidean via the GEMM identity.

Computes out(i, j) = ||X(i) - Y(j)||^2 by reconstructing ||x||^2 + ||y||^2 - 2 x . y^T. The dot-product matrix is evaluated with the public gemm entry at alpha = -2 and beta = 0; row-norm vectors are produced by rowNormsSq; a final elementwise broadcast adds the norms and clamps to zero.

Template Parameters
TElement type (float or double).
LXLayout tag of X; CTAD-resolved.
LYLayout tag of Y; CTAD-resolved.
Parameters
XLeft operand (n x d).
YRight operand (m x d).
outOutput matrix (n x m); isMutable() must be true.
poolParallelism injection; forwarded to rowNormsSq and gemm and used for the broadcast-and-clamp sweep.

Definition at line 233 of file pairwise.h.

◆ pairwiseSqEuclideanSimd()

template<class T, Layout LX, Layout LY>
void clustering::math::detail::pairwiseSqEuclideanSimd ( const NDArray< T, 2, LX > & X,
const NDArray< T, 2, LY > & Y,
NDArray< T, 2 > & out,
Pool pool )

Small-path pairwise squared Euclidean via SIMD accumulation per (i, j) pair.

Iterates every row pair and invokes sqEuclideanRow, which picks an AVX2 reduction on contiguous layouts with d at least one lane or falls back to a scalar loop. The outer row loop fans out over pool when shouldParallelize clears; per-cell arithmetic order is untouched across the fan-out so serial and threaded outputs are bit-identical.

Template Parameters
TElement type (float or double).
LXLayout tag of X; CTAD-resolved.
LYLayout tag of Y; CTAD-resolved.
Parameters
XLeft operand (n x d).
YRight operand (m x d).
outOutput matrix (n x m); isMutable() must be true.
poolParallelism injection for the outer row loop.

Definition at line 296 of file pairwise.h.

◆ pairwiseSqEuclideanThresholdedMaterialized()

template<class T, Layout LX, Layout LY, class Emit>
void clustering::math::detail::pairwiseSqEuclideanThresholdedMaterialized ( const NDArray< T, 2, LX > & X,
const NDArray< T, 2, LY > & Y,
T radiusSq,
Pool pool,
Emit && emit )

Materialized fallback for the thresholded-emit API: compute each pair's squared distance via sqEuclideanRow and walk survivors.

Used when the fused-path eligibility check rejects (strided view, wrong type, tiny d, AVX2 off). Iterates in global (row, col) ascending order so emit fires row-major – a tighter guarantee than the fused path gives. No intermediate (n x m) tile is materialized: callers with small n (e.g. single-seed range queries) avoid per-call heap traffic that a materialized-then-mask path would incur.

Definition at line 472 of file pairwise.h.

◆ pairwiseSqEuclideanWithDispatchInfo()

template<class T, Layout LX = Layout::Contig, Layout LY = Layout::Contig>
PairwisePath clustering::math::detail::pairwiseSqEuclideanWithDispatchInfo ( const NDArray< T, 2, LX > & X,
const NDArray< T, 2, LY > & Y,
NDArray< T, 2 > & out,
Pool pool )

Test-only: runs the same dispatch as pairwiseSqEuclidean and reports which kernel fired.

Shares the public entry's preconditions, workload metric, and branch. Empty inputs short-circuit before the branch is evaluated; the return in that case is PairwisePath::Simd by convention, which matches the cheap path any future dispatch refinement would pick for a zero-cell problem.

Template Parameters
TElement type (float or double).
LXLayout tag of X; CTAD-resolved.
LYLayout tag of Y; CTAD-resolved.
Parameters
XLeft operand (n x d).
YRight operand (m x d).
outOutput matrix (n x m); isMutable() must be true.
poolParallelism injection; forwarded to the selected kernel.
Returns
Which inner kernel executed.

Definition at line 394 of file pairwise.h.

◆ rowNormsSq()

template<class T, Layout LX>
void clustering::math::detail::rowNormsSq ( const NDArray< T, 2, LX > & X,
NDArray< T, 1 > & norms,
Pool pool )

Row-wise sum of squares: norms(i) = sum_k X(i, k)^2.

Inner reduction mirrors sqEuclideanRow: AVX2 when LX is Layout::Contig and the row fills at least one lane, scalar otherwise. The outer row loop fans out over pool when the workload clears shouldParallelize; per-row arithmetic is untouched across the fan-out.

Template Parameters
TElement type (float or double).
LXLayout tag of X; CTAD-resolved so strided views (e.g. Z.t()) bind without explicit template arguments.
Parameters
XInput matrix (n x d).
normsRank-1 output of length n; isMutable() must be true.
poolParallelism injection for the outer row loop.

Definition at line 187 of file pairwise.h.

◆ sqEuclideanRow()

template<class T, Layout LX, Layout LY>
T clustering::math::detail::sqEuclideanRow ( const NDArray< T, 2, LX > & X,
std::size_t i,
const NDArray< T, 2, LY > & Y,
std::size_t j )
inlinenoexcept

Definition at line 99 of file pairwise.h.

◆ sqEuclideanRowAvx2() [1/2]

double clustering::math::detail::sqEuclideanRowAvx2 ( const double * xRow,
const double * yRow,
std::size_t d )
inlinenoexcept

Definition at line 75 of file pairwise.h.

◆ sqEuclideanRowAvx2() [2/2]

float clustering::math::detail::sqEuclideanRowAvx2 ( const float * xRow,
const float * yRow,
std::size_t d )
inlinenoexcept

Definition at line 56 of file pairwise.h.

◆ sqNormRow()

template<class T, Layout LX>
T clustering::math::detail::sqNormRow ( const NDArray< T, 2, LX > & X,
std::size_t i )
inlinenoexcept

Definition at line 154 of file pairwise.h.

◆ sqNormRowAvx2() [1/2]

double clustering::math::detail::sqNormRowAvx2 ( const double * xRow,
std::size_t d )
inlinenoexcept

Definition at line 136 of file pairwise.h.

◆ sqNormRowAvx2() [2/2]

float clustering::math::detail::sqNormRowAvx2 ( const float * xRow,
std::size_t d )
inlinenoexcept

Definition at line 121 of file pairwise.h.

Variable Documentation

◆ kAvx2Lanes

template<class T>
std::size_t clustering::math::detail::kAvx2Lanes = std::is_same_v<T, float> ? 8 : 4
constexpr

Definition at line 96 of file pairwise.h.