Anonymous View
Skip to content

Follow-up to #978: dense-model CPU performance gap vs upstream llama.cpp #980

@aittalam

Description

@aittalam

Follow-up to PR #978 (which resolves #975 — CPU flash-attention regression).

PR #978 fixes the CPU-FA regression on MoE models (beats upstream on Qwen3-30B-A3B-Q4_K_M by +12 % pp / +3 % tg on Xeon Gold 6338 with -fa on). On the dense Llama-3.1-8B-Instruct-Q4_K_M reproducer from #975, however, our build still trails upstream by ~22 % pp and ~7 % tg — and the gap is present equally with -fa on and -fa off, so it's not FA-related.

Bench (Xeon Gold 6338, Llama-3.1-8B-Instruct-Q4_K_M, --gpu disable)

build -fa on (pp / tg) -fa off (pp / tg)
llamafile 0.9.3 (from #975) 63.90 / 13.04 76.29 / 16.32
llamafile 0.10.1 + #974 (from #975) 30.54 / 6.12 63.70 / 13.84
llamafile + #978 72.93 ± 0.17 / 11.85 ± 0.22 68.31 ± 1.48 / 11.43 ± 0.23
upstream llama.cpp @ same submodule SHA (7b8443ac7) 93.72 ± 0.26 / 12.73 ± 0.06 87.59 ± 1.76 / 12.11 ± 0.37

The #975 FA regression is genuinely fixed (FA-on pp 30.54 → 72.93 = +139 %, FA-on tg 6.12 → 11.85 = +94 %) and FA is again a slight pp win in our build, matching upstream's shape. So this isn't an FA issue — it's a separate dense-path issue exposed once FA is fixed.

Hypothesis

The same root cause #978 fixed for three helpers — cosmocc compiles ggml-cpu/ops.cpp and vec.cpp at baseline x86_64 ISA (AVX2 at best, no AVX-512) so the APE binary stays portable — applies to many other helpers in those files. Upstream's cmake -DGGML_NATIVE=ON build gets -march=native → AVX-512 for all of them. On MoE the MoE-matmul (handled by our iqk kernels with proper AVX-512) dominates the profile, hiding the rest. On dense, the residual helpers take a bigger share.

Candidate helpers (in priority order, guesses pending perf-record)

  1. tinyBLAS<f32, f32, f32>::gemm — the f32×f32 sgemm path used by attention's Q@K^T and softmax@V matmuls. Already showed up at 0.49 % in the patched-FA-on Xeon perf; on dense it's probably much higher.
  2. ggml_compute_forward_rms_norm — per-token RMS norm, runs once per layer. Already showed up at 0.18 % on MoE; relatively larger on dense.
  3. ggml_vec_soft_max_f32 — softmax inner loop. Showed up at 0.67 % on MoE pp.
  4. ggml_compute_forward_rope_flt<float> — RoPE applied to Q and K per layer.
  5. Smaller ones: ggml_vec_scale_f32, ggml_vec_add_f32, ggml_compute_forward_mul, etc.

Proposed approach

Same architectural pattern as #978's fa_helpers and fa_simd_gemm:

  • Profile dense Llama-3.1-8B -fa off on Xeon (perf record -F 99 -g --call-graph dwarf); diff top symbols vs upstream.
  • For each helper that dominates and isn't getting AVX-512 codegen in our build, add a llamafile-side multi-arch wrapper (compiled with -Xx86_64-mavx512f etc. via cosmocc per-target flags) and a thin hook in the call site. Runtime-dispatched via the existing sgemm.cpp GemmFuncs struct. APE-portable.
  • Add numerical-equivalence tests to tests/fa_helpers_test.cpp for each new wrapper.

Target: close the ~22 % pp gap on dense models to match upstream's CPU performance.

Hardware / artifacts

  • Reproducer environment: Xeon Gold 6338 (Ice Lake-SP, AVX-512F + AVX-512-VNNI, no AVX-512-FP16), 16 threads. Same machine used for CPU flash-attention fixes for #975 (workaround + AVX-512 helpers + simd_gemm) #978's perf work.
  • Existing bench script: /tmp/bench-llama8b.sh on the Xeon box.
  • Upstream llama.cpp at 7b8443ac7 checked out at ~/workspace/llama.cpp/ (the same submodule SHA we ship, for apples-to-apples).

Related

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions