The cr.yp.to microblog: 2020.06.20 20:42:09

2020.06.20 20:42:09 (1274412239512981505) from Daniel J. Bernstein, replying to "Travis Downs (@trav_downs)" (1274077488617000962):

It's not just in sorting that papers systematically downplay the competition! For crypto we've built a centralized cross-platform benchmarking framework with an easy way for people to submit improved code. Sorting small int32 arrays is one part of this: https://bench.cr.yp.to/impl-sort/int32.html

2020.06.20 20:54:44 (1274415403838529536) from Daniel J. Bernstein:

The graphs there are for sorting int32[768], a typical size of interest in crypto. Min-max instructions and vectorization make sorting networks much faster than radix sort on Intel chips. For much larger arrays the main optimization game would be to reduce cache misses.

2020.06.20 21:03:17 (1274417555851972609) from Daniel J. Bernstein:

Am I correctly understanding that VxSort's AVX2 code takes 4ns on 2.8GHz Kaby Lake to sort int32[1000], around 11000 cycles? The graph shows 4897 Kaby Lake cycles (without Turbo Boost) for int32[768]; the same code (avx2 from djbsort) takes 6054 Kaby Lake cycles for int32[1024].

Context

2020.06.19 13:42:45 (1273944305363824645) from "damageboy (@damageboy)":

Does anyone know If there is a world record for single threaded primitive sorting? Or am I allowed to just proclaim the title and wait for someone to call my BS and have a throw down In a caged ring? (Asking for a friend who has a long night of #avx512 debugging ahead of...)

2020.06.19 22:26:31 (1274076112780492800) from "Travis Downs (@trav_downs)", replying to "damageboy (@damageboy)" (1273944305363824645):

claims djbsort holds (or held at the moment that was written) the record for in-memory 32-bit integer sorting: https://sorting.cr.yp.to/ IIUC it's AVX2, not sure if there is an AVX-512 version.

2020.06.19 22:31:59 (1274077488617000962) from "Travis Downs (@trav_downs)", replying to "Travis Downs (@trav_downs)" (1274076112780492800):

You can find many papers which claim their sort is fastest *out of the competitors they tried*, which is always a non-exhaustive list, and probably using a favorable input distribution. Here's an AVX-512 one with promising results: https://arxiv.org/pdf/1704.08579.pdf There are many more.