2018.02.08 18:53:55 (961659370382741506) from Daniel J. Bernstein, replying to "Allan MacKinnon (@pixelio)" (961644000682721280):
This is int32_sort_avx2(int32 *x,long long xlen). Works for any xlen; constant time for each xlen. I've seen previous AVX2 sorting networks for specific small xlen but nothing competitive for larger xlen such as 1000 (or 1000000). Of course people use sorting networks on GPUs.
2018.02.08 17:11:35 (961633619067527169) from Daniel J. Bernstein:
10000 Haswell cycles to sort 1024 32-bit integers: 13x faster than radix sort in Intel's Integrated Performance Primitives library, 2.6x faster than sorting code from NTRU Prime paper. Working on formal verification.
2018.02.08 17:52:50 (961644000682721280) from "Allan MacKinnon (@pixelio)":
I think you'll find others have implemented nearly ideal block sorters (you call constant time) using bitonic networks and AVX2. It's a critical primitive in a general purpose sorting algorithm. Many valid choices in *merging* sorted blocks follow...