The cr.yp.to microblog: 2022.06.05 06:51:30

2022.06.05 06:51:30 (1533310520798740480) from Daniel J. Bernstein, replying to "Danilo (@oak_doak)" (1533247813538045952):

The growing corner of CPUs with AVX-512 can definitely do better with that than using similar AVX2 code, but the paper says "fastest sort for individual (non-tuple) keys on AVX2 and AVX-512", which I understand to mean fastest on CPUs with AVX-512 _and_ on CPUs with just AVX2.

Context

2022.06.04 23:39:13 (1533201734369103872) from Daniel J. Bernstein:

Tried Google's new vectorized quicksort code vqsort on Skylake, and timed Sorter() as ~8000 cycles for int32[256] (big chunk of code for a size-specific sorting network), ~19000 cycles for int32[1024] (non-constant-time). djbsort is 1230, 6286 (ct). Did I misuse vqsort somehow?

2022.06.05 02:42:19 (1533247813538045952) from "Danilo (@oak_doak)":

What % of cycles were backend-bound on your tests? You can get this on perf's "stalled-cycles-backend" metric. Likely Google benchmarked on Skylake server, which has a wildly different SoC architecture than the client (consumer) version (see wikichip).