diff --git a/src/algo/SIMD.md b/src/algo/SIMD.md new file mode 100644 index 00000000..0cffe612 --- /dev/null +++ b/src/algo/SIMD.md @@ -0,0 +1,99 @@ +# SIMD byte search: `indexByteTwo` / `lastIndexByteTwo` + +## What these functions do + +`indexByteTwo(s []byte, b1, b2 byte) int` — returns the index of the +**first** occurrence of `b1` or `b2` in `s`, or `-1`. + +`lastIndexByteTwo(s []byte, b1, b2 byte) int` — returns the index of the +**last** occurrence of `b1` or `b2` in `s`, or `-1`. + +They are used by the fuzzy matching algorithm (`algo.go`) to skip ahead +during case-insensitive search. Instead of calling `bytes.IndexByte` twice +(once for lowercase, once for uppercase), a single SIMD pass finds both at +once. + +## File layout + +| File | Purpose | +| ------ | --------- | +| `indexbyte2_arm64.go` | Go declarations (`//go:noescape`) for ARM64 | +| `indexbyte2_arm64.s` | ARM64 NEON assembly (32-byte aligned blocks, syndrome extraction) | +| `indexbyte2_amd64.go` | Go declarations + AVX2 runtime detection for AMD64 | +| `indexbyte2_amd64.s` | AMD64 AVX2/SSE2 assembly with CPUID dispatch | +| `indexbyte2_other.go` | Pure Go fallback for all other architectures | +| `indexbyte2_test.go` | Unit tests, exhaustive tests, fuzz tests, and benchmarks | + +## How the SIMD implementations work + +**ARM64 (NEON):** +- Broadcasts both needle bytes into NEON registers (`VMOV`). +- Processes 32-byte aligned chunks. For each chunk, compares all bytes + against both needles (`VCMEQ`), ORs the results (`VORR`), and builds a + 64-bit syndrome with 2 bits per byte. +- `indexByteTwo` uses `RBIT` + `CLZ` to find the lowest set bit (first match). +- `lastIndexByteTwo` scans backward and uses `CLZ` on the raw syndrome to + find the highest set bit (last match). +- Handles alignment and partial first/last blocks with bit masking. +- Adapted from Go's `internal/bytealg/indexbyte_arm64.s`. + +**AMD64 (AVX2 with SSE2 fallback):** +- At init time, `cpuHasAVX2()` checks CPUID + XGETBV for AVX2 and OS YMM + support. The result is cached in `_useAVX2`. +- **AVX2 path** (inputs >= 32 bytes, when available): + - Broadcasts both needles via `VPBROADCASTB`. + - Processes 32-byte blocks: `VPCMPEQB` against both needles, `VPOR`, then + `VPMOVMSKB` to get a 32-bit mask. + - 5 instructions per loop iteration (vs 7 for SSE2) at 2x the throughput. + - `VZEROUPPER` before every return to avoid SSE/AVX transition penalties. +- **SSE2 fallback** (inputs < 32 bytes, or CPUs without AVX2): + - Broadcasts via `PUNPCKLBW` + `PSHUFL`. + - Processes 16-byte blocks: `PCMPEQB`, `POR`, `PMOVMSKB`. + - Small inputs (<16 bytes) are handled with page-boundary-safe loads. +- Both paths use `BSFL` (forward) / `BSRL` (reverse) for bit scanning. +- Adapted from Go's `internal/bytealg/indexbyte_amd64.s`. + +**Fallback (other platforms):** +- `indexByteTwo` uses two `bytes.IndexByte` calls with scope-limiting + (search `b1` first, then limit the `b2` search to `s[:i1]`). +- `lastIndexByteTwo` uses a simple backward for loop. + +## Running tests + +```bash +# Unit + exhaustive tests +go test ./src/algo/ -run 'TestIndexByteTwo|TestLastIndexByteTwo' -v + +# Fuzz tests (run for 10 seconds each) +go test ./src/algo/ -run '^$' -fuzz FuzzIndexByteTwo -fuzztime 10s +go test ./src/algo/ -run '^$' -fuzz FuzzLastIndexByteTwo -fuzztime 10s + +# Cross-architecture: test amd64 on an arm64 Mac (via Rosetta) +GOARCH=amd64 go test ./src/algo/ -run 'TestIndexByteTwo|TestLastIndexByteTwo' -v +GOARCH=amd64 go test ./src/algo/ -run '^$' -fuzz FuzzIndexByteTwo -fuzztime 10s +GOARCH=amd64 go test ./src/algo/ -run '^$' -fuzz FuzzLastIndexByteTwo -fuzztime 10s +``` + +## Running micro-benchmarks + +```bash +# All indexByteTwo / lastIndexByteTwo benchmarks +go test ./src/algo/ -bench 'IndexByteTwo' -benchmem + +# Specific size +go test ./src/algo/ -bench 'IndexByteTwo_1000' +``` + +Each benchmark compares the SIMD `asm` implementation against reference +implementations (`2xIndexByte` using `bytes.IndexByte`, and a simple `loop`). + +## Correctness verification + +The assembly is verified by three layers of testing: + +1. **Table-driven tests** — known inputs with expected outputs. +2. **Exhaustive tests** — all lengths 0–256, every match position, no-match + cases, and both-bytes-present cases, compared against a simple loop + reference. +3. **Fuzz tests** — randomized inputs via `testing.F`, compared against the + same loop reference.