2022.06.07 20:01:14 (1534234038621245440) from Daniel J. Bernstein, replying to "Cat (@eigma)" (1534159835201257474):
The 16x8 clarification is useful, but the lack of timings is surprising, as is the "some use cases may be interested in smaller arrays" comment. Handling smaller arrays faster would make vqsort faster for _all_ sizes. The vqsort paper+code spend serious effort on the base case.
2022.06.04 23:39:13 (1533201734369103872) from Daniel J. Bernstein:
Tried Google's new vectorized quicksort code vqsort on Skylake, and timed Sorter() as ~8000 cycles for int32[256] (big chunk of code for a size-specific sorting network), ~19000 cycles for int32[1024] (non-constant-time). djbsort is 1230, 6286 (ct). Did I misuse vqsort somehow?
2022.06.07 15:06:22 (1534159835201257474) from "Cat (@eigma)":
Response here: https://github.com/google/highway/issues/736