|
Clustering
C++20 header-only: DBSCAN, HDBSCAN, k-means.
|
Classes | |
| struct | BlockPartition |
Work-unit partition for the label-grouped fold over n points into numBlocks bins. | |
Functions | |
| std::size_t | greedyKmppLocalTrials (std::size_t k) noexcept |
| Compute the local-trials count used by greedy k-means++. | |
| constexpr std::size_t | greedyKmppTransposedWidth (std::size_t L) noexcept |
Round L up to the nearest multiple of 8 used by the transposed scoring layout. | |
| template<std::size_t B> | |
| void | sqEuclideanRowToBatchAvx2Fixed (const float *x, const float *candData, std::size_t d, float *out) noexcept |
Compile-time batched scoring kernel: stream x once across B parallel AVX2 accumulators to compute B squared distances against candData rows. | |
| template<std::size_t B> | |
| void | sqEuclideanRowToBatchAvx2Fixed (const double *x, const double *candData, std::size_t d, double *out) noexcept |
| template<class T> | |
| void | sqEuclideanRowToBatchAvx2 (const T *x, const T *candData, std::size_t L, std::size_t d, T *out) noexcept |
Compute L squared Euclidean distances against an (L, d) row-batched candidate layout in a single streaming pass over the x row. | |
| void | sqEuclideanRowAgainst8Transposed (const float *x, const float *candData, std::size_t d, float *out) noexcept |
Compute L squared distances against an (d, 8) transposed candidate layout with one streaming pass over the x row. | |
| void | sqEuclideanRowAgainst16Transposed (const float *x, const float *candData, std::size_t d, float *out) noexcept |
Compute two 8-way squared distance slabs against an (d, 16) transposed candidate layout in one streaming pass over the x row. | |
| void | sqEuclideanRowAgainst8TransposedStrided (const float *x, const float *candData, std::size_t d, std::size_t rowStride, float *out) noexcept |
Compute one 8-way squared distance slab against an (d, W) transposed candidate layout with an explicit row stride W. | |
| template<class T> | |
| void | sqEuclideanRowToBatch (const T *x, const T *candData, std::size_t L, std::size_t d, T *out) noexcept |
Squared Euclidean distance from one x row to a batch of L candidate rows. | |
Variables | |
| constexpr std::size_t | kDirectArgminMaxD = 8 |
Maximum d for the direct-compute argmin hot path. | |
|
inlinenodiscardnoexcept |
Compute the local-trials count used by greedy k-means++.
Matches sklearn's convention 2 + floor(ln(k)). Gives k=8 -> L=4, k=64 -> L=6, k=256 -> L=7, k=1000 -> L=8. The natural-log form keeps inertia within the per-pick scoring envelope while trimming ~30% off the seeder's candidate work at high k.
Definition at line 42 of file greedy_kmpp_seeder.h.
|
nodiscardconstexprnoexcept |
Round L up to the nearest multiple of 8 used by the transposed scoring layout.
The transposed kernel operates on chunks of 8 candidates; the candidate pack and the per-(point, candidate) distance cache are padded to this width so the chunked scoring path can index with a fixed row stride. The commit-step minSq refresh only reads L lanes.
Definition at line 57 of file greedy_kmpp_seeder.h.
|
inlinenoexcept |
Compute two 8-way squared distance slabs against an (d, 16) transposed candidate layout in one streaming pass over the x row.
Unrolls sqEuclideanRowAgainst8Transposed across two adjacent lane groups so each x[k] broadcast folds 16 candidate distances per FMA pair. At L in (8, 16] with d <= 8 this shaves half the broadcast + load traffic versus looping the 8-wide kernel twice.
Definition at line 235 of file greedy_kmpp_seeder.h.
|
inlinenoexcept |
Compute L squared distances against an (d, 8) transposed candidate layout with one streaming pass over the x row.
The (d, 8) layout puts the k-th feature of every candidate at cand[k*8 .. k*8 + 8), so a single _mm256_load_ps fetches all 8 candidates' k-th component; broadcasting x[k] then folds 8 squared-distance contributions in one FMA. At d < 8 this collapses the per-row scoring from L*d scalar ops to d SIMD ops.
Definition at line 215 of file greedy_kmpp_seeder.h.
|
inlinenoexcept |
Compute one 8-way squared distance slab against an (d, W) transposed candidate layout with an explicit row stride W.
Generalizes sqEuclideanRowAgainst8Transposed to the L > 16 regime where the transposed pack keeps W = ceil(L/8) * 8 columns so chunked scoring can slide an 8-wide window across it.
Definition at line 260 of file greedy_kmpp_seeder.h.
|
inlinenoexcept |
Squared Euclidean distance from one x row to a batch of L candidate rows.
Routes to the AVX2 batched kernel when the build is AVX2-enabled and d clears one lane width; otherwise dispatches L scalar sqEuclideanRowPtr calls. The batched AVX2 path streams the x row through L parallel accumulators so the inner loop reads each x byte once across the whole batch.
Definition at line 284 of file greedy_kmpp_seeder.h.
|
inlinenoexcept |
Compute L squared Euclidean distances against an (L, d) row-batched candidate layout in a single streaming pass over the x row.
Streams the x row through L parallel fmadd accumulators so each x byte is read from memory once for all L distances. Dispatches to sqEuclideanRowToBatchAvx2Fixed at compile-time-bound B for the common batch sizes.
Definition at line 170 of file greedy_kmpp_seeder.h.
|
inlinenoexcept |
Definition at line 129 of file greedy_kmpp_seeder.h.
|
inlinenoexcept |
Compile-time batched scoring kernel: stream x once across B parallel AVX2 accumulators to compute B squared distances against candData rows.
Templating on B unblocks the compiler's full unroll of the inner candidate loop, the load-bearing optimisation for the seeder's candidate scoring. The runtime entry sqEuclideanRowToBatchAvx2 dispatches to the B=8 specialisation for the common L=8 case (k in [16, 31]) and to other compile-time B values for adjacent batches.
Definition at line 75 of file greedy_kmpp_seeder.h.
|
inlineconstexpr |
Maximum d for the direct-compute argmin hot path.
At d <= this threshold the fused argmin-GEMM driver's packA + packB overhead dominates the handful of FMAs the microkernel performs, so the direct ||x - c||^2 formula with 8-row SIMD accumulators beats the packed-GEMM path. Measured on Zen5: crossover sits near d == 8 where the two paths tie; below that the direct path wins by the pack cost.
Definition at line 81 of file lloyd_fused_gemm.h.