The cr.yp.to microblog: 2022.06.05 21:27:27

2022.06.05 21:27:27 (1533530960297201664) from Daniel J. Bernstein, replying to "Jacob Christian Munch-Andersen (@NoHatCoder)" (1533519967488024577):

Um, page 155 of https://agner.org/optimize/microarchitecture.pdf reports full-power AVX2 after 56000 cycles on almost the same CPU. I measure first vqsort int32[256] call above 50000 cycles, then ~10000 for the next three runs, then rapidly settling down to around 8000. (djbsort: 4615, 2026, 1361, etc.)

2022.06.05 21:38:27 (1533533728403644416) from Daniel J. Bernstein:

First-call performance in this type of benchmark isn't interesting for applications that keep their main-loop code size under control; that's why I reported the stable ~8000-cycle figure. For people familiar with the Skylake performance characteristics, >30 runs are ample data.

2022.06.05 21:49:48 (1533536587488690176) from Daniel J. Bernstein:

I understand that many people aren't immersed in CPU microarchitecture, so I've now run a 3-second sequence of 1048576 calls to rdtsc+vqsort int32[256] on the same Skylake. An average call takes 8292 cycles, 6x slower than djbsort. (rdtsc and other loop overheads use <30 cycles.)

2022.06.05 22:25:29 (1533545567858462720) from Daniel J. Bernstein:

Tweaking the bench_sort from vqsort to use M = 256 reports 364 MB/s, i.e., 8.24 cycles/byte at 3GHz, which is around 8400 cycles. M = 1024 gives 645 MB/s, i.e., 4.68 cycles/byte, above 19000 cycles. Looks like a bit more timing overhead than my vqsort test, but basically matches.

Context

2022.06.05 10:00:28 (1533358075314323456) from "Jacob Christian Munch-Andersen (@NoHatCoder)":

What do you mean "very short"? There is a delay from you first start issuing 256 bit instructions until the core has powered on the relevant circuitry. Even if the clock doesn't go down you will generally take a hit of several ┬Ás.

2022.06.05 18:15:39 (1533482694482403328) from Daniel J. Bernstein, replying to "Jacob Christian Munch-Andersen (@NoHatCoder)" (1533358075314323456):

Ran a loop of 33 rdtsc+vqsort, each >8000 cycles for the smaller size that I mentioned. One always expects initial calls to be outliers (not just for AVX2 ramp-up; the big starting issue is code caching); djbsort's int32-speed (https://sorting.cr.yp.to/speed.html) says medians and quartiles.

2022.06.05 18:19:57 (1533483773773221888) from Daniel J. Bernstein:

AVX2 usage has also become so pervasive in typical code that it's not surprising for the CPU to always have the AVX2 unit warmed up; cooldown is triggered after millions of non-AVX2 cycles. But the more important point is to always check for variations across many measurements.

2022.06.05 20:43:46 (1533519967488024577) from "Jacob Christian Munch-Andersen (@NoHatCoder)":

So 33*8000 cycles, that is a tiny benchmark. I'm not sure why one algorithm would hit a consistent hiccup, and the other wouldn't, but stranger things have happened. As for AVX2, latest Steam hardware survey says 88% adoption. Most modern code is single path 128 bit.