Replace comparison-based pdqsort with LSD radix sort on the uint64
sort key. Radix sort is O(n) vs O(n log n) and avoids pointer-chasing
cache misses in the comparison function. Sort scratch buffer is reused
across iterations to reduce GC pressure.
Benchmark (single-threaded, Chromium file list):
- linux query (180K matches): ~16% faster
- src query (high match count): ~31% faster
- Rare matches: equivalent (falls back to pdqsort for n < 128)
Extend the uint64 rank comparison trick (comparing [4]uint16 as a
single uint64) to arm64 builds. ARM64 is little-endian like x86, so
the same unsafe.Pointer cast produces correct lexicographic ordering.
This replaces a 4-iteration loop with a single uint64 comparison,
speeding up the sort phase.
Chromium file list, single-threaded:
linux: 126ms -> 126ms (sort not dominant)
src: 462ms -> 438ms (-5%, sort-heavy)