Files
fzf/src/algo/SIMD.md
T
2026-03-02 15:49:28 +09:00

4.4 KiB
Raw Blame History

SIMD byte search: indexByteTwo / lastIndexByteTwo

What these functions do

indexByteTwo(s []byte, b1, b2 byte) int — returns the index of the first occurrence of b1 or b2 in s, or -1.

lastIndexByteTwo(s []byte, b1, b2 byte) int — returns the index of the last occurrence of b1 or b2 in s, or -1.

They are used by the fuzzy matching algorithm (algo.go) to skip ahead during case-insensitive search. Instead of calling bytes.IndexByte twice (once for lowercase, once for uppercase), a single SIMD pass finds both at once.

File layout

File Purpose
indexbyte2_arm64.go Go declarations (//go:noescape) for ARM64
indexbyte2_arm64.s ARM64 NEON assembly (32-byte aligned blocks, syndrome extraction)
indexbyte2_amd64.go Go declarations + AVX2 runtime detection for AMD64
indexbyte2_amd64.s AMD64 AVX2/SSE2 assembly with CPUID dispatch
indexbyte2_other.go Pure Go fallback for all other architectures
indexbyte2_test.go Unit tests, exhaustive tests, fuzz tests, and benchmarks

How the SIMD implementations work

ARM64 (NEON):

  • Broadcasts both needle bytes into NEON registers (VMOV).
  • Processes 32-byte aligned chunks. For each chunk, compares all bytes against both needles (VCMEQ), ORs the results (VORR), and builds a 64-bit syndrome with 2 bits per byte.
  • indexByteTwo uses RBIT + CLZ to find the lowest set bit (first match).
  • lastIndexByteTwo scans backward and uses CLZ on the raw syndrome to find the highest set bit (last match).
  • Handles alignment and partial first/last blocks with bit masking.
  • Adapted from Go's internal/bytealg/indexbyte_arm64.s.

AMD64 (AVX2 with SSE2 fallback):

  • At init time, cpuHasAVX2() checks CPUID + XGETBV for AVX2 and OS YMM support. The result is cached in _useAVX2.
  • AVX2 path (inputs >= 32 bytes, when available):
    • Broadcasts both needles via VPBROADCASTB.
    • Processes 32-byte blocks: VPCMPEQB against both needles, VPOR, then VPMOVMSKB to get a 32-bit mask.
    • 5 instructions per loop iteration (vs 7 for SSE2) at 2x the throughput.
    • VZEROUPPER before every return to avoid SSE/AVX transition penalties.
  • SSE2 fallback (inputs < 32 bytes, or CPUs without AVX2):
    • Broadcasts via PUNPCKLBW + PSHUFL.
    • Processes 16-byte blocks: PCMPEQB, POR, PMOVMSKB.
    • Small inputs (<16 bytes) are handled with page-boundary-safe loads.
  • Both paths use BSFL (forward) / BSRL (reverse) for bit scanning.
  • Adapted from Go's internal/bytealg/indexbyte_amd64.s.

Fallback (other platforms):

  • indexByteTwo uses two bytes.IndexByte calls with scope-limiting (search b1 first, then limit the b2 search to s[:i1]).
  • lastIndexByteTwo uses a simple backward for loop.

Running tests

# Unit + exhaustive tests
go test ./src/algo/ -run 'TestIndexByteTwo|TestLastIndexByteTwo' -v

# Fuzz tests (run for 10 seconds each)
go test ./src/algo/ -run '^$' -fuzz FuzzIndexByteTwo -fuzztime 10s
go test ./src/algo/ -run '^$' -fuzz FuzzLastIndexByteTwo -fuzztime 10s

# Cross-architecture: test amd64 on an arm64 Mac (via Rosetta)
GOARCH=amd64 go test ./src/algo/ -run 'TestIndexByteTwo|TestLastIndexByteTwo' -v
GOARCH=amd64 go test ./src/algo/ -run '^$' -fuzz FuzzIndexByteTwo -fuzztime 10s
GOARCH=amd64 go test ./src/algo/ -run '^$' -fuzz FuzzLastIndexByteTwo -fuzztime 10s

Running micro-benchmarks

# All indexByteTwo / lastIndexByteTwo benchmarks
go test ./src/algo/ -bench 'IndexByteTwo' -benchmem

# Specific size
go test ./src/algo/ -bench 'IndexByteTwo_1000'

Each benchmark compares the SIMD asm implementation against reference implementations (2xIndexByte using bytes.IndexByte, and a simple loop).

Correctness verification

The assembly is verified by three layers of testing:

  1. Table-driven tests — known inputs with expected outputs.
  2. Exhaustive tests — all lengths 0256, every match position, no-match cases, and both-bytes-present cases, compared against a simple loop reference.
  3. Fuzz tests — randomized inputs via testing.F, compared against the same loop reference.