The cr.yp.to microblog: 2022.06.04 23:39:13

2022.06.04 23:39:13 (1533201734369103872) from Daniel J. Bernstein:

Tried Google's new vectorized quicksort code vqsort on Skylake, and timed Sorter() as ~8000 cycles for int32[256] (big chunk of code for a size-specific sorting network), ~19000 cycles for int32[1024] (non-constant-time). djbsort is 1230, 6286 (ct). Did I misuse vqsort somehow?

2022.06.04 23:44:01 (1533202940818640897) from Daniel J. Bernstein:

I included algo-inl.h and vqsort.h, did auto s = Sorter() and SortAscending order, and then did s(x,N,order) on an array x of N int32_t values. Size seems ok; I checked that the output is sorted correctly and (in a separate run to not slow things down) that asan doesn't complain.

2022.06.04 23:50:39 (1533204610311106560) from Daniel J. Bernstein:

I also checked in gdb that the dispatch is calling the AVX2 code (presumably the fastest vqsort option for Skylake, as opposed to machines with AVX-512 such as Skylake-X). I didn't notice an API with lower per-call overhead, and per-call can't explain the jump from 8000 to 19000.

2022.06.05 00:04:21 (1533208059811536896) from Daniel J. Bernstein:

I also looked briefly at the vqsort paper https://arxiv.org/pdf/2205.05982.pdf, which says "outperforms state of the art architecture-specific algorithms". Section 3 describes vqsort's size-256 sorting network, but is missing microbenchmarks and comparisons to, e.g., sid1607, djbsort, vxsort.

2022.06.05 00:08:37 (1533209131904925697) from Daniel J. Bernstein:

Oops, sorry, the "outperforms" quote is from the accompanying blog post https://opensource.googleblog.com/2022/06/Vectorized%20and%20performance%20portable%20Quicksort.html. The paper says that vqsort is "the fastest sorting implementation known to us for commercially available shared-memory machines". Should I maybe have added some extra cmake options?

2022.06.05 00:12:23 (1533210079540875264) from Daniel J. Bernstein:

Another theory is that the speeds depend on a newer compiler version than what I tried, but this theory doesn't seem compatible with the "performance portability" and "production-readiness" parts of the advertising. I also tried a Broadwell with gcc 10.2.1, which isn't so old.