You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Follow-up to PR #978 (which resolves #975 — CPU flash-attention regression).
PR #978 fixes the CPU-FA regression on MoE models (beats upstream on Qwen3-30B-A3B-Q4_K_M by +12 % pp / +3 % tg on Xeon Gold 6338 with -fa on). On the dense Llama-3.1-8B-Instruct-Q4_K_M reproducer from #975, however, our build still trails upstream by ~22 % pp and ~7 % tg — and the gap is present equally with -fa on and -fa off, so it's not FA-related.
upstream llama.cpp @ same submodule SHA (7b8443ac7)
93.72 ± 0.26 / 12.73 ± 0.06
87.59 ± 1.76 / 12.11 ± 0.37
The #975 FA regression is genuinely fixed (FA-on pp 30.54 → 72.93 = +139 %, FA-on tg 6.12 → 11.85 = +94 %) and FA is again a slight pp win in our build, matching upstream's shape. So this isn't an FA issue — it's a separate dense-path issue exposed once FA is fixed.
Hypothesis
The same root cause #978 fixed for three helpers — cosmocc compiles ggml-cpu/ops.cpp and vec.cpp at baseline x86_64 ISA (AVX2 at best, no AVX-512) so the APE binary stays portable — applies to many other helpers in those files. Upstream's cmake -DGGML_NATIVE=ON build gets -march=native → AVX-512 for all of them. On MoE the MoE-matmul (handled by our iqk kernels with proper AVX-512) dominates the profile, hiding the rest. On dense, the residual helpers take a bigger share.
Candidate helpers (in priority order, guesses pending perf-record)
tinyBLAS<f32, f32, f32>::gemm — the f32×f32 sgemm path used by attention's Q@K^T and softmax@V matmuls. Already showed up at 0.49 % in the patched-FA-on Xeon perf; on dense it's probably much higher.
ggml_compute_forward_rms_norm — per-token RMS norm, runs once per layer. Already showed up at 0.18 % on MoE; relatively larger on dense.
ggml_vec_soft_max_f32 — softmax inner loop. Showed up at 0.67 % on MoE pp.
ggml_compute_forward_rope_flt<float> — RoPE applied to Q and K per layer.
Smaller ones: ggml_vec_scale_f32, ggml_vec_add_f32, ggml_compute_forward_mul, etc.
Proposed approach
Same architectural pattern as #978's fa_helpers and fa_simd_gemm:
Profile dense Llama-3.1-8B -fa off on Xeon (perf record -F 99 -g --call-graph dwarf); diff top symbols vs upstream.
For each helper that dominates and isn't getting AVX-512 codegen in our build, add a llamafile-side multi-arch wrapper (compiled with -Xx86_64-mavx512f etc. via cosmocc per-target flags) and a thin hook in the call site. Runtime-dispatched via the existing sgemm.cppGemmFuncs struct. APE-portable.
Add numerical-equivalence tests to tests/fa_helpers_test.cpp for each new wrapper.
Target: close the ~22 % pp gap on dense models to match upstream's CPU performance.
Follow-up to PR #978 (which resolves #975 — CPU flash-attention regression).
PR #978 fixes the CPU-FA regression on MoE models (beats upstream on Qwen3-30B-A3B-Q4_K_M by +12 % pp / +3 % tg on Xeon Gold 6338 with
-fa on). On the dense Llama-3.1-8B-Instruct-Q4_K_M reproducer from #975, however, our build still trails upstream by ~22 % pp and ~7 % tg — and the gap is present equally with-fa onand-fa off, so it's not FA-related.Bench (Xeon Gold 6338, Llama-3.1-8B-Instruct-Q4_K_M,
--gpu disable)-fa on(pp / tg)-fa off(pp / tg)7b8443ac7)The #975 FA regression is genuinely fixed (FA-on pp 30.54 → 72.93 = +139 %, FA-on tg 6.12 → 11.85 = +94 %) and FA is again a slight pp win in our build, matching upstream's shape. So this isn't an FA issue — it's a separate dense-path issue exposed once FA is fixed.
Hypothesis
The same root cause #978 fixed for three helpers —
cosmocccompilesggml-cpu/ops.cppandvec.cppat baseline x86_64 ISA (AVX2 at best, no AVX-512) so the APE binary stays portable — applies to many other helpers in those files. Upstream'scmake -DGGML_NATIVE=ONbuild gets-march=native→ AVX-512 for all of them. On MoE the MoE-matmul (handled by our iqk kernels with proper AVX-512) dominates the profile, hiding the rest. On dense, the residual helpers take a bigger share.Candidate helpers (in priority order, guesses pending perf-record)
tinyBLAS<f32, f32, f32>::gemm— the f32×f32 sgemm path used by attention's Q@K^T and softmax@V matmuls. Already showed up at 0.49 % in the patched-FA-on Xeon perf; on dense it's probably much higher.ggml_compute_forward_rms_norm— per-token RMS norm, runs once per layer. Already showed up at 0.18 % on MoE; relatively larger on dense.ggml_vec_soft_max_f32— softmax inner loop. Showed up at 0.67 % on MoE pp.ggml_compute_forward_rope_flt<float>— RoPE applied to Q and K per layer.ggml_vec_scale_f32,ggml_vec_add_f32,ggml_compute_forward_mul, etc.Proposed approach
Same architectural pattern as #978's
fa_helpersandfa_simd_gemm:-fa offon Xeon (perf record -F 99 -g --call-graph dwarf); diff top symbols vs upstream.-Xx86_64-mavx512fetc. via cosmocc per-target flags) and a thin hook in the call site. Runtime-dispatched via the existingsgemm.cppGemmFuncsstruct. APE-portable.tests/fa_helpers_test.cppfor each new wrapper.Target: close the ~22 % pp gap on dense models to match upstream's CPU performance.
Hardware / artifacts
/tmp/bench-llama8b.shon the Xeon box.7b8443ac7checked out at~/workspace/llama.cpp/(the same submodule SHA we ship, for apples-to-apples).Related