The cr.yp.to microblog: 2022.06.05 18:15:39

2022.06.05 18:15:39 (1533482694482403328) from Daniel J. Bernstein, replying to "Jacob Christian Munch-Andersen (@NoHatCoder)" (1533358075314323456):

Ran a loop of 33 rdtsc+vqsort, each >8000 cycles for the smaller size that I mentioned. One always expects initial calls to be outliers (not just for AVX2 ramp-up; the big starting issue is code caching); djbsort's int32-speed (https://sorting.cr.yp.to/speed.html) says medians and quartiles.

2022.06.05 18:19:57 (1533483773773221888) from Daniel J. Bernstein:

AVX2 usage has also become so pervasive in typical code that it's not surprising for the CPU to always have the AVX2 unit warmed up; cooldown is triggered after millions of non-AVX2 cycles. But the more important point is to always check for variations across many measurements.

Context

2022.06.04 23:39:13 (1533201734369103872) from Daniel J. Bernstein:

Tried Google's new vectorized quicksort code vqsort on Skylake, and timed Sorter() as ~8000 cycles for int32[256] (big chunk of code for a size-specific sorting network), ~19000 cycles for int32[1024] (non-constant-time). djbsort is 1230, 6286 (ct). Did I misuse vqsort somehow?

2022.06.05 00:40:19 (1533217109274140672) from "Ruben Kelevra (@RubenKelevra)":

Which type of processor did you use? Some Intel processors reduce their clock speed if you use a lot of AVX instructions for a short while which may show up as wrong numbers.

2022.06.05 03:39:55 (1533262309555965952) from Daniel J. Bernstein, replying to "Ruben Kelevra (@RubenKelevra)" (1533217109274140672):

Intel Xeon E3-1220 v5, pinned at 3GHz. Turbo Boost (which would be 3.5GHz) disabled. No evidence of any AVX2 throttling. Reasonable cooling, no evidence of thermal throttling, plus these were very short single-core runs. Both of the pieces of code being benchmarked were AVX2.

2022.06.05 10:00:28 (1533358075314323456) from "Jacob Christian Munch-Andersen (@NoHatCoder)":

What do you mean "very short"? There is a delay from you first start issuing 256 bit instructions until the core has powered on the relevant circuitry. Even if the clock doesn't go down you will generally take a hit of several µs.