Skip to content

Conversation

@jdemeule
Copy link

@jdemeule jdemeule commented Dec 8, 2025

With #15906, I noticed on important regression when using metal backend on eGPU.
This commit restore the previous behavior and add an option to force its activation.

Before #15906, llama-bench on gemma 3 give me this kind of result:

$ ./llama-bench --model ggml-org_gemma-3-4b-it-GGUF_gemma-3-4b-it-Q4_K_M.gguf -r 1 --no-warmup
| model                          |       size |     params | backend    | threads |            test |                  t/s |
| ------------------------------ | ---------: | ---------: | ---------- | ------: | --------------: | -------------------: |
| gemma3 4B Q4_K - Medium        |   2.31 GiB |     3.88 B | Metal,BLAS |       6 |           pp512 |         48.72 ± 0.00 |
| gemma3 4B Q4_K - Medium        |   2.31 GiB |     3.88 B | Metal,BLAS |       6 |           tg128 |          5.95 ± 0.00 |

build: 33daece86 (6440)

So above 45t/s on pp test, and more than 5t/s on tg test.

After #15906, pp test has improved but tg test has been divided by 2.

$ ./llama-bench --model ggml-org_gemma-3-4b-it-GGUF_gemma-3-4b-it-Q4_K_M.gguf -r 1 --no-warmup
| model                          |       size |     params | backend    | threads |            test |                  t/s |
| ------------------------------ | ---------: | ---------: | ---------- | ------: | --------------: | -------------------: |
| gemma3 4B Q4_K - Medium        |   2.31 GiB |     3.88 B | Metal,BLAS |       6 |           pp512 |         60.66 ± 0.00 |
| gemma3 4B Q4_K - Medium        |   2.31 GiB |     3.88 B | Metal,BLAS |       6 |           tg128 |          2.84 ± 0.00 |

build: 0f0a3c285 (6441)

Launching the benchmark with "Metal System Trace" in Instruments.app, reveals some usage of the DMA1 channel which introduced lot of latency (at least, this is how I interpreted it).

With this PR, the performance are back as before on eGPU and should not impact any other configuration (dGPU and M1-M5).

# ./llama-bench --model ggml-org_gemma-3-4b-it-GGUF_gemma-3-4b-it-Q4_K_M.gguf -r 1 --no-warmup
| model                          |       size |     params | backend    | threads |            test |                  t/s |
| ------------------------------ | ---------: | ---------: | ---------- | ------: | --------------: | -------------------: |
| gemma3 4B Q4_K - Medium        |   2.31 GiB |     3.88 B | Metal,BLAS |       6 |           pp512 |         47.24 ± 0.00 |
| gemma3 4B Q4_K - Medium        |   2.31 GiB |     3.88 B | Metal,BLAS |       6 |           tg128 |          6.07 ± 0.00 |

build: b0db6483b (7327)

With ggml-org#15906, I noticed on important regression when using metal backend on eGPU.
This commit restore the previous behavior and add an option to force its activation.
@jdemeule jdemeule requested a review from ggerganov as a code owner December 8, 2025 18:04
@github-actions github-actions bot added ggml changes relating to the ggml tensor library for machine learning Apple Metal https://en.wikipedia.org/wiki/Metal_(API) labels Dec 8, 2025
@ggerganov
Copy link
Member

I'm not familiar with the concept of eGPU - is this running on an Intel Mac?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Apple Metal https://en.wikipedia.org/wiki/Metal_(API) ggml changes relating to the ggml tensor library for machine learning

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants