Clustering
C++20 header-only: DBSCAN, HDBSCAN, k-means.
Loading...
Searching...
No Matches
clustering::kmeans::detail Namespace Reference

Classes

struct  BlockPartition
 Work-unit partition for the label-grouped fold over n points into numBlocks bins.

Functions

std::size_t greedyKmppLocalTrials (std::size_t k) noexcept
 Compute the local-trials count used by greedy k-means++.
constexpr std::size_t greedyKmppTransposedWidth (std::size_t L) noexcept
 Round L up to the nearest multiple of 8 used by the transposed scoring layout.
template<std::size_t B>
void sqEuclideanRowToBatchAvx2Fixed (const float *x, const float *candData, std::size_t d, float *out) noexcept
 Compile-time batched scoring kernel: stream x once across B parallel AVX2 accumulators to compute B squared distances against candData rows.
template<std::size_t B>
void sqEuclideanRowToBatchAvx2Fixed (const double *x, const double *candData, std::size_t d, double *out) noexcept
template<class T>
void sqEuclideanRowToBatchAvx2 (const T *x, const T *candData, std::size_t L, std::size_t d, T *out) noexcept
 Compute L squared Euclidean distances against an (L, d) row-batched candidate layout in a single streaming pass over the x row.
void sqEuclideanRowAgainst8Transposed (const float *x, const float *candData, std::size_t d, float *out) noexcept
 Compute L squared distances against an (d, 8) transposed candidate layout with one streaming pass over the x row.
void sqEuclideanRowAgainst16Transposed (const float *x, const float *candData, std::size_t d, float *out) noexcept
 Compute two 8-way squared distance slabs against an (d, 16) transposed candidate layout in one streaming pass over the x row.
void sqEuclideanRowAgainst8TransposedStrided (const float *x, const float *candData, std::size_t d, std::size_t rowStride, float *out) noexcept
 Compute one 8-way squared distance slab against an (d, W) transposed candidate layout with an explicit row stride W.
template<class T>
void sqEuclideanRowToBatch (const T *x, const T *candData, std::size_t L, std::size_t d, T *out) noexcept
 Squared Euclidean distance from one x row to a batch of L candidate rows.

Variables

constexpr std::size_t kDirectArgminMaxD = 8
 Maximum d for the direct-compute argmin hot path.

Function Documentation

◆ greedyKmppLocalTrials()

std::size_t clustering::kmeans::detail::greedyKmppLocalTrials ( std::size_t k)
inlinenodiscardnoexcept

Compute the local-trials count used by greedy k-means++.

Matches sklearn's convention 2 + floor(ln(k)). Gives k=8 -> L=4, k=64 -> L=6, k=256 -> L=7, k=1000 -> L=8. The natural-log form keeps inertia within the per-pick scoring envelope while trimming ~30% off the seeder's candidate work at high k.

Returns
Local-trials count; always at least 1.

Definition at line 42 of file greedy_kmpp_seeder.h.

◆ greedyKmppTransposedWidth()

std::size_t clustering::kmeans::detail::greedyKmppTransposedWidth ( std::size_t L)
nodiscardconstexprnoexcept

Round L up to the nearest multiple of 8 used by the transposed scoring layout.

The transposed kernel operates on chunks of 8 candidates; the candidate pack and the per-(point, candidate) distance cache are padded to this width so the chunked scoring path can index with a fixed row stride. The commit-step minSq refresh only reads L lanes.

Definition at line 57 of file greedy_kmpp_seeder.h.

◆ sqEuclideanRowAgainst16Transposed()

void clustering::kmeans::detail::sqEuclideanRowAgainst16Transposed ( const float * x,
const float * candData,
std::size_t d,
float * out )
inlinenoexcept

Compute two 8-way squared distance slabs against an (d, 16) transposed candidate layout in one streaming pass over the x row.

Unrolls sqEuclideanRowAgainst8Transposed across two adjacent lane groups so each x[k] broadcast folds 16 candidate distances per FMA pair. At L in (8, 16] with d <= 8 this shaves half the broadcast + load traffic versus looping the 8-wide kernel twice.

Definition at line 235 of file greedy_kmpp_seeder.h.

◆ sqEuclideanRowAgainst8Transposed()

void clustering::kmeans::detail::sqEuclideanRowAgainst8Transposed ( const float * x,
const float * candData,
std::size_t d,
float * out )
inlinenoexcept

Compute L squared distances against an (d, 8) transposed candidate layout with one streaming pass over the x row.

The (d, 8) layout puts the k-th feature of every candidate at cand[k*8 .. k*8 + 8), so a single _mm256_load_ps fetches all 8 candidates' k-th component; broadcasting x[k] then folds 8 squared-distance contributions in one FMA. At d < 8 this collapses the per-row scoring from L*d scalar ops to d SIMD ops.

Definition at line 215 of file greedy_kmpp_seeder.h.

◆ sqEuclideanRowAgainst8TransposedStrided()

void clustering::kmeans::detail::sqEuclideanRowAgainst8TransposedStrided ( const float * x,
const float * candData,
std::size_t d,
std::size_t rowStride,
float * out )
inlinenoexcept

Compute one 8-way squared distance slab against an (d, W) transposed candidate layout with an explicit row stride W.

Generalizes sqEuclideanRowAgainst8Transposed to the L > 16 regime where the transposed pack keeps W = ceil(L/8) * 8 columns so chunked scoring can slide an 8-wide window across it.

Definition at line 260 of file greedy_kmpp_seeder.h.

◆ sqEuclideanRowToBatch()

template<class T>
void clustering::kmeans::detail::sqEuclideanRowToBatch ( const T * x,
const T * candData,
std::size_t L,
std::size_t d,
T * out )
inlinenoexcept

Squared Euclidean distance from one x row to a batch of L candidate rows.

Routes to the AVX2 batched kernel when the build is AVX2-enabled and d clears one lane width; otherwise dispatches L scalar sqEuclideanRowPtr calls. The batched AVX2 path streams the x row through L parallel accumulators so the inner loop reads each x byte once across the whole batch.

Definition at line 284 of file greedy_kmpp_seeder.h.

◆ sqEuclideanRowToBatchAvx2()

template<class T>
void clustering::kmeans::detail::sqEuclideanRowToBatchAvx2 ( const T * x,
const T * candData,
std::size_t L,
std::size_t d,
T * out )
inlinenoexcept

Compute L squared Euclidean distances against an (L, d) row-batched candidate layout in a single streaming pass over the x row.

Streams the x row through L parallel fmadd accumulators so each x byte is read from memory once for all L distances. Dispatches to sqEuclideanRowToBatchAvx2Fixed at compile-time-bound B for the common batch sizes.

Definition at line 170 of file greedy_kmpp_seeder.h.

◆ sqEuclideanRowToBatchAvx2Fixed() [1/2]

template<std::size_t B>
void clustering::kmeans::detail::sqEuclideanRowToBatchAvx2Fixed ( const double * x,
const double * candData,
std::size_t d,
double * out )
inlinenoexcept

Definition at line 129 of file greedy_kmpp_seeder.h.

◆ sqEuclideanRowToBatchAvx2Fixed() [2/2]

template<std::size_t B>
void clustering::kmeans::detail::sqEuclideanRowToBatchAvx2Fixed ( const float * x,
const float * candData,
std::size_t d,
float * out )
inlinenoexcept

Compile-time batched scoring kernel: stream x once across B parallel AVX2 accumulators to compute B squared distances against candData rows.

Templating on B unblocks the compiler's full unroll of the inner candidate loop, the load-bearing optimisation for the seeder's candidate scoring. The runtime entry sqEuclideanRowToBatchAvx2 dispatches to the B=8 specialisation for the common L=8 case (k in [16, 31]) and to other compile-time B values for adjacent batches.

Definition at line 75 of file greedy_kmpp_seeder.h.

Variable Documentation

◆ kDirectArgminMaxD

std::size_t clustering::kmeans::detail::kDirectArgminMaxD = 8
inlineconstexpr

Maximum d for the direct-compute argmin hot path.

At d <= this threshold the fused argmin-GEMM driver's packA + packB overhead dominates the handful of FMAs the microkernel performs, so the direct ||x - c||^2 formula with 8-row SIMD accumulators beats the packed-GEMM path. Measured on Zen5: crossover sits near d == 8 where the two paths tie; below that the direct path wins by the pack cost.

Definition at line 81 of file lloyd_fused_gemm.h.