Classes
struct	BlockPartition
	Work-unit partition for the label-grouped fold over `n` points into `numBlocks` bins.

Functions
std::size_t	greedyKmppLocalTrials (std::size_t k) noexcept
	Compute the local-trials count used by greedy k-means++.
constexpr std::size_t	greedyKmppTransposedWidth (std::size_t L) noexcept
	Round `L` up to the nearest multiple of 8 used by the transposed scoring layout.
template<std::size_t B>
void	sqEuclideanRowToBatchAvx2Fixed (const float x, const float candData, std::size_t d, float *out) noexcept
	Compile-time batched scoring kernel: stream `x` once across `B` parallel AVX2 accumulators to compute `B` squared distances against `candData` rows.
template<std::size_t B>
void	sqEuclideanRowToBatchAvx2Fixed (const double x, const double candData, std::size_t d, double *out) noexcept
template<class T>
void	sqEuclideanRowToBatchAvx2 (const T x, const T candData, std::size_t L, std::size_t d, T *out) noexcept
	Compute `L` squared Euclidean distances against an (L, d) row-batched candidate layout in a single streaming pass over the `x` row.
void	sqEuclideanRowAgainst8Transposed (const float x, const float candData, std::size_t d, float *out) noexcept
	Compute `L` squared distances against an (d, 8) transposed candidate layout with one streaming pass over the `x` row.
void	sqEuclideanRowAgainst16Transposed (const float x, const float candData, std::size_t d, float *out) noexcept
	Compute two 8-way squared distance slabs against an (d, 16) transposed candidate layout in one streaming pass over the `x` row.
void	sqEuclideanRowAgainst8TransposedStrided (const float x, const float candData, std::size_t d, std::size_t rowStride, float *out) noexcept
	Compute one 8-way squared distance slab against an (d, W) transposed candidate layout with an explicit row stride `W`.
template<class T>
void	sqEuclideanRowToBatch (const T x, const T candData, std::size_t L, std::size_t d, T *out) noexcept
	Squared Euclidean distance from one `x` row to a batch of `L` candidate rows.

Variables
constexpr std::size_t	kDirectArgminMaxD = 8
	Maximum `d` for the direct-compute argmin hot path.

Function Documentation

◆ greedyKmppLocalTrials()

std::size_t clustering::kmeans::detail::greedyKmppLocalTrials ( std::size_t k )

inlinenodiscardnoexcept

Compute the local-trials count used by greedy k-means++.

Matches sklearn's convention 2 + floor(ln(k)). Gives k=8 -> L=4, k=64 -> L=6, k=256 -> L=7, k=1000 -> L=8. The natural-log form keeps inertia within the per-pick scoring envelope while trimming ~30% off the seeder's candidate work at high k.

Returns: Local-trials count; always at least 1.

Definition at line 42 of file greedy_kmpp_seeder.h.

◆ greedyKmppTransposedWidth()

std::size_t clustering::kmeans::detail::greedyKmppTransposedWidth ( std::size_t L )

nodiscardconstexprnoexcept

Round L up to the nearest multiple of 8 used by the transposed scoring layout.

The transposed kernel operates on chunks of 8 candidates; the candidate pack and the per-(point, candidate) distance cache are padded to this width so the chunked scoring path can index with a fixed row stride. The commit-step minSq refresh only reads L lanes.

Definition at line 57 of file greedy_kmpp_seeder.h.

◆ sqEuclideanRowAgainst16Transposed()

void clustering::kmeans::detail::sqEuclideanRowAgainst16Transposed	(	const float *	x,
		const float *	candData,
		std::size_t	d,
		float *	out )

inlinenoexcept

Compute two 8-way squared distance slabs against an (d, 16) transposed candidate layout in one streaming pass over the x row.

Unrolls sqEuclideanRowAgainst8Transposed across two adjacent lane groups so each x[k] broadcast folds 16 candidate distances per FMA pair. At L in (8, 16] with d <= 8 this shaves half the broadcast + load traffic versus looping the 8-wide kernel twice.

Definition at line 235 of file greedy_kmpp_seeder.h.

◆ sqEuclideanRowAgainst8Transposed()

void clustering::kmeans::detail::sqEuclideanRowAgainst8Transposed	(	const float *	x,
		const float *	candData,
		std::size_t	d,
		float *	out )

inlinenoexcept

Compute L squared distances against an (d, 8) transposed candidate layout with one streaming pass over the x row.

The (d, 8) layout puts the k-th feature of every candidate at cand[k*8 .. k*8 + 8), so a single _mm256_load_ps fetches all 8 candidates' k-th component; broadcasting x[k] then folds 8 squared-distance contributions in one FMA. At d < 8 this collapses the per-row scoring from L*d scalar ops to d SIMD ops.

Definition at line 215 of file greedy_kmpp_seeder.h.

◆ sqEuclideanRowAgainst8TransposedStrided()

void clustering::kmeans::detail::sqEuclideanRowAgainst8TransposedStrided	(	const float *	x,
		const float *	candData,
		std::size_t	d,
		std::size_t	rowStride,
		float *	out )

inlinenoexcept

Compute one 8-way squared distance slab against an (d, W) transposed candidate layout with an explicit row stride W.

Generalizes sqEuclideanRowAgainst8Transposed to the L > 16 regime where the transposed pack keeps W = ceil(L/8) * 8 columns so chunked scoring can slide an 8-wide window across it.

Definition at line 260 of file greedy_kmpp_seeder.h.

◆ sqEuclideanRowToBatch()

template<class T>

void clustering::kmeans::detail::sqEuclideanRowToBatch	(	const T *	x,
		const T *	candData,
		std::size_t	L,
		std::size_t	d,
		T *	out )

inlinenoexcept

Squared Euclidean distance from one x row to a batch of L candidate rows.

Routes to the AVX2 batched kernel when the build is AVX2-enabled and d clears one lane width; otherwise dispatches L scalar sqEuclideanRowPtr calls. The batched AVX2 path streams the x row through L parallel accumulators so the inner loop reads each x byte once across the whole batch.

Definition at line 284 of file greedy_kmpp_seeder.h.

◆ sqEuclideanRowToBatchAvx2()

template<class T>

void clustering::kmeans::detail::sqEuclideanRowToBatchAvx2	(	const T *	x,
		const T *	candData,
		std::size_t	L,
		std::size_t	d,
		T *	out )

inlinenoexcept

Compute L squared Euclidean distances against an (L, d) row-batched candidate layout in a single streaming pass over the x row.

Streams the x row through L parallel fmadd accumulators so each x byte is read from memory once for all L distances. Dispatches to sqEuclideanRowToBatchAvx2Fixed at compile-time-bound B for the common batch sizes.

Definition at line 170 of file greedy_kmpp_seeder.h.

◆ sqEuclideanRowToBatchAvx2Fixed() [1/2]

template<std::size_t B>

void clustering::kmeans::detail::sqEuclideanRowToBatchAvx2Fixed	(	const double *	x,
		const double *	candData,
		std::size_t	d,
		double *	out )

inlinenoexcept

Definition at line 129 of file greedy_kmpp_seeder.h.

◆ sqEuclideanRowToBatchAvx2Fixed() [2/2]

template<std::size_t B>

void clustering::kmeans::detail::sqEuclideanRowToBatchAvx2Fixed	(	const float *	x,
		const float *	candData,
		std::size_t	d,
		float *	out )

inlinenoexcept

Compile-time batched scoring kernel: stream x once across B parallel AVX2 accumulators to compute B squared distances against candData rows.

Templating on B unblocks the compiler's full unroll of the inner candidate loop, the load-bearing optimisation for the seeder's candidate scoring. The runtime entry sqEuclideanRowToBatchAvx2 dispatches to the B=8 specialisation for the common L=8 case (k in [16, 31]) and to other compile-time B values for adjacent batches.

Definition at line 75 of file greedy_kmpp_seeder.h.

Variable Documentation

◆ kDirectArgminMaxD

std::size_t clustering::kmeans::detail::kDirectArgminMaxD = 8

inlineconstexpr

Maximum d for the direct-compute argmin hot path.

At d <= this threshold the fused argmin-GEMM driver's packA + packB overhead dominates the handful of FMAs the microkernel performs, so the direct ||x - c||^2 formula with 8-row SIMD accumulators beats the packed-GEMM path. Measured on Zen5: crossover sits near d == 8 where the two paths tie; below that the direct path wins by the pack cost.

Definition at line 81 of file lloyd_fused_gemm.h.

Classes

Functions

Variables

Function Documentation

◆ greedyKmppLocalTrials()

◆ greedyKmppTransposedWidth()

◆ sqEuclideanRowAgainst16Transposed()

◆ sqEuclideanRowAgainst8Transposed()

◆ sqEuclideanRowAgainst8TransposedStrided()

◆ sqEuclideanRowToBatch()

◆ sqEuclideanRowToBatchAvx2()

◆ sqEuclideanRowToBatchAvx2Fixed() [1/2]

◆ sqEuclideanRowToBatchAvx2Fixed() [2/2]

Variable Documentation

◆ kDirectArgminMaxD