Server: router per model config #17859

ServeurpersoCom · 2025-12-08T10:23:18Z

Make sure to read the contributing guidelines before submitting a PR

Summary

This PR implements INI-based per-model configuration for llama-server router mode, as discussed in #17850.

Motivation

This POC addresses multi-model inference servers targeting small/medium teams with declarative, user-friendly configuration and zero operational friction.

Implementation

Core Features

Auto-generated config.ini: Created at /config.ini on first run, one [vendor/model] section per discovered model (HF-style layout)
CLI to INI templating: All router flags (except blacklist: --port, -m, etc.) are converted to LLAMA_ARG_* env var names and injected as initial config template for each model
Standard INI format: Booleans stored as =true, regular values as =value, users can override with =false for explicit opt-out
Config priority over CLI: Existing user modifications in INI are preserved (never overwritten by new CLI args)
Hot-sync new args: When operators add CLI flags to the router, they're automatically synced to all model sections in INI (if not already present)
Env var passthrough: Child processes receive config as LLAMA_ARG_*= environment variables (empty for bools), respecting llama.cpp's native conventions
Per-model customization: Operators edit INI to override any parameter per model (e.g., desired quantization, --n-cpu-moe, --ctx-size)
Adding a new flag to the router CLI automatically propagates it to all existing models in config.ini (if not already set), making it easy to discover and apply llama.cpp arguments without manual editing.

Technical Details

Uses existing PEG parser from common/peg-parser.h (thanks @aldehir for the grammar suggestion in Proposal: allow arg.cpp to import/export configs from/to INI file #17850)
LLAMA_ARG_* env var naming simplifies CLI->INI conversion by avoiding ambiguity between short/long form flags (thanks @ngxson for pointing this out)
New API: common_arg_get_env_name() to map CLI flags to env var names
Improved model discovery:
- Recursive scan supporting vendor/model/*.gguf layouts
- Picks smallest GGUF per directory (for quantization variants)
- Auto-detects mmproj with priority: BF16 > F16 > F32

Example config.ini

LLAMA_CONFIG_VERSION=1

[ggml-org/gemma-3-4b-it-qat-GGUF]
LLAMA_ARG_MODEL=ggml-org/gemma-3-4b-it-qat-GGUF/gemma-3-4b-it.Q6_K.gguf
LLAMA_ARG_N_GPU_LAYERS=999
LLAMA_ARG_CTX_SIZE=32768
LLAMA_ARG_FLASH_ATTN=true

[ggml-org/gemma-3-12b-it-qat-GGUF]
LLAMA_ARG_MODEL=ggml-org/gemma-3-12b-it-qat-GGUF/gemma-3-12b-it.Q4_K_M.gguf
LLAMA_ARG_N_GPU_LAYERS=50
LLAMA_ARG_CTX_SIZE=16384
LLAMA_ARG_FLASH_ATTN=false  # Override: disable for this model

Use Case Example

Small dev team runs inference server with 10+ models. Sysadmin sets global defaults via router CLI:

llama-server --models-dir ./models -ngl 999 -fa -ctk q8_0 -ctv q8_0

Then fine-tunes per-model settings in config.ini:

Adjust quantization and context sizes for different models
Disable flash-attn for models with compatibility issues

With this system, you can override any parameter per model to optimize each configuration for your GPU, and reset by just deleting the ini file! It also allows beginners to discover llama.cpp arguments as they go along.

Testing

Tested with personal GGUF collection:

Multiple vendors/models with various quantizations
mmproj auto-detection working correctly
Embedding pipeline use case testing in progress

Future Work

This is a POC. Potential improvements:

EDIT : Auto-reload will be separate PR. Core PR focuses on INI-based per-model config generation.
Support for --config mycfg.ini CLI arg (alternative to --models-dir root)
GUI administration interface for editing config (env var passthrough eliminates shell injection risks from user-edited INI values)
Validation of LLAMA_ARG_* keys against actual arg definitions

Replace flat directory scan with recursive traversal using std::filesystem::recursive_directory_iterator. Support for nested vendor/model layouts (e.g. vendor/model/*.gguf). Model name now reflects the relative path within --models-dir instead of just the filename. Aggregate files by parent directory via std::map before constructing local_model

tools/server/server-config.cpp

ngxson

this looks interesting, I will clean this up a bit and push a commit

common/arg.h

ngxson · 2025-12-08T10:48:33Z

tools/server/server-config.cpp

Will be nice if we can move part of this file into common/preset.cpp|h, so it can be reused by other tools

tools/server/server-models.cpp

ngxson · 2025-12-08T10:52:22Z

tools/server/server-models.cpp

+            if (value == "false") {
+                continue;
+            }
+
+            if (value == "true" || value.empty()) {
+                child_env.push_back(key + "=");


I think leaving the original value for bool should be good? We can already handle these values using is_falsey / is_truthy in arg.cpp

Good point! I'll simplify the bool handling to pass through the original values (=true/=false) and let is_truthy/is_falsey handle the conversion

tools/server/server-config.cpp

aldehir · 2025-12-08T11:22:40Z

@ServeurpersoCom

Here is a line-oriented approach for the parser:

static const auto ini_parser = build_peg_parser([](auto & p) {
    // newline ::= "\r\n" / "\n" / "\r"
    auto newline = p.rule("newline", p.literal("\r\n") | p.literal("\n") | p.literal("\r"));

    // ws ::= [ \t]*
    auto ws = p.rule("ws", p.chars("[ \t]", 0, -1));

    // comment ::= [;#] (!newline .)*
    auto comment = p.rule("comment", p.chars("[;#]", 1, 1) + p.zero_or_more(p.negate(newline) + p.any()));

    // eol ::= ws comment? (newline / EOF)
    auto eol = p.rule("eol", ws + p.optional(comment) + (newline | p.end()));

    // ident ::= [a-zA-Z_] [a-zA-Z0-9_.-]*
    auto ident = p.rule("ident", p.chars("[a-zA-Z_]", 1, 1) + p.chars("[a-zA-Z0-9_.-]", 0, -1));

    // value ::= (!eol-start .)*
    auto eol_start = p.rule("eol-start", ws + (p.chars("[;#]", 1, 1) | newline | p.end()));
    auto value = p.rule("value", p.zero_or_more(p.negate(eol_start) + p.any()));

    // header-line ::= "[" ws ident ws "]" eol
    auto header_line = p.rule("header-line", "[" + ws + p.tag("section-name", p.chars("[^]]")) + ws + "]" + eol);

    // kv-line ::= ident ws "=" ws value eol
    auto kv_line = p.rule("kv-line", p.tag("key", ident) + ws + "=" + ws + p.tag("value", value) + eol);

    // comment-line ::= ws comment (newline / EOF)
    auto comment_line = p.rule("comment-line", ws + comment + (newline | p.end()));

    // blank-line ::= ws (newline / EOF)
    auto blank_line = p.rule("blank-line", ws + (newline | p.end()));

    // line ::= header-line / kv-line / comment-line / blank-line
    auto line = p.rule("line", header_line | kv_line | comment_line | blank_line);

    // ini ::= line* EOF
    auto ini = p.rule("ini", p.zero_or_more(line) + p.end());

    return ini;
});

I assume the changes were because of the weirdness in consuming spaces/comments. This should alleviate those concerns.

And the visitor can really be something as simple as this:

std::map<std::string, std::map<std::string, std::string>> cfg;

std::string current_section = "default";
std::string current_key;

ctx.ast.visit(result, [&](const auto & node) {
    if (node.tag == "section-name") {
        current_section = std::string(node.text);
        cfg[current_section] = {};
    } else if (node.tag == "key") {
        current_key = std::string(node.text);
    } else if (node.tag == "value" && !current_key.empty()) {
        cfg[current_section][current_key] = std::string(node.text);
        current_key.clear();
    }
});

@ngxson

PEG parser usage improvements: - Simplify parser instantiation (remove arena indirection) - Optimize grammar usage (ws instead of zero_or_more, remove optional wrapping) - Fix last line without newline bug (+ operator instead of <<) - Remove redundant end position check Feature scope: - Remove auto-reload feature (will be separate PR per @ngxson) - Keep config.ini auto-creation and template generation - Preserve per-model customization logic Co-authored-by: aldehir <aldehir@users.noreply.github.com> Co-authored-by: ngxson <ngxson@users.noreply.github.com>

Complete rewrite of INI parser grammar and visitor: - Use p.chars(), p.negate(), p.any() instead of p.until() - Support end-of-line comments (key=value # comment) - Handle EOF without trailing newline correctly - Strict identifier validation ([a-zA-Z_][a-zA-Z0-9_.-]*) - Simplified visitor (no pending state, no trim needed) - Grammar handles whitespace natively via eol rule Business validation preserved: - Reject section names starting with LLAMA_ARG_* - Accept only keys starting with LLAMA_ARG_* - Require explicit section before key-value pairs Co-authored-by: aldehir <aldehir@users.noreply.github.com>

aldehir

Looks good as far as parsing is concerned!

I will need to add an expect() helper to provide helpful error messages to users when they make a mistake. I can do that separately in an another PR.

Children now receive minimal CLI args (executable, model, port, alias) instead of inheriting all router args. Global settings pass through LLAMA_ARG_* environment variables only, eliminating duplicate config warnings. Fixes: Router args like -ngl, -fa were passed both via CLI and env, causing 'will be overwritten' warnings on every child spawn

ServeurpersoCom · 2025-12-08T12:32:44Z

Now it in a basic working state, with the new line-based PEG parser, I'm testing with my entire per models configuration on the server to test some edge case, and then there are the @ngxson refactoring to do.

ServeurpersoCom · 2025-12-08T12:44:15Z

Missing sampling parameters need .set_env() in common/arg.cpp (--temp, --top-p, --top-k, --min-p have no LLAMA_ARG_ env vars yet). Successfully migrated llama-swap config (YAML) to config.ini via LLM: llama-server preserved all custom parameters (ctx-size, n-cpu-moe, mmproj, -m ....Q6_K), applied global CLI defaults (-ngl 999, -fa, --mlock, -ctk/-ctv etc...) to all models, and automatically reorganized sections/keys alphabetically to maintain normalized format

ngxson · 2025-12-08T13:59:37Z

Missing sampling parameters need .set_env() in common/arg.cpp (--temp, --top-p, --top-k, --min-p have no LLAMA_ARG_ env vars yet).

Hmm yeah I didn't notice that some env vars are missing. I think it will be cleaner if we default to using the longest arg (for example, --ctx-size instead of -c)

Internally, the parser can accept all 3 forms: env, short arg and long arg ; there is no chance that they will collide anyway. I'll push the change for this

ServeurpersoCom · 2025-12-08T14:06:45Z

Yes look it just need missing .set_env("LLAMA_ARG_TEMP")); etc... I wait your change while I run some tests

ngxson · 2025-12-08T14:41:27Z

tools/server/README.md

 llama-server --models-dir ./models_directory
 ```

+The directory is scanned recursively, so nested vendor/model layouts such as `vendor_name/model_name/*.gguf` are supported. The model name in the router UI matches the relative path inside `--models-dir` (for example, `vendor_name/model_name`).


For visibility, I will remove recursive support from this PR because it's not related to config support - it should be added later via a dedicated PR

Yes no worries, (I have to keep it on my side otherwise it breaks my integration server) -> I would adapt the configuration on my side to test this feature-atomic PR if necessary

tools/server/server-models.cpp

ngxson · 2025-12-08T17:55:58Z

I moved most of the code inside server-config.cpp to common/preset.cpp

We're now using the term "preset", so I think it's easier to make the file name presets.ini now (it can be extended to use outside of server)

Since I'm now using the same common_arg to handle everything, including parsing and merging args, edge cases like deduplication of short form -a and long form --abc is also handled

We don't yet support repeated args or args with 2 values (like --lora-scaled) but it can be added in the future

API endpoint /v1/models also extended to include the args and INI preset, which will be quite useful for debugging

Things that still need to improve:

add falsey and truthy check for input from ini
add documentation and example

ngxson · 2025-12-08T20:56:40Z

tools/server/README.md


+Alternatively, you can also add GGUF based preset (see next section)
+
+### Model presets


@ServeurpersoCom I updated the docs with an example - lmk if this works in your case

- Sanitize model names: replace / and \ with _ for display - Recursive directory scan with relative path storage - Convert relative paths to absolute when spawning children - Filter router control args from child processes - Refresh args after port assignment for correct port value - Fallback preset lookup for compatibility - Fix missing argv[0]: store server binary path before base_args parsing

ngxson · 2025-12-08T21:08:37Z

tools/server/server-models.cpp

-                    first_shard_file = file;
-                } else {
-                    model_file = file;
+    std::function<void(const std::string &, const std::string &)> scan_subdir =


Please remove the recursive implementation - it's unrelated to the current PR, and it's also unsafe as it doesn't handle the case where there's a circular symlink or circular mount points

You can retrieve the rest (except the recursion) and push --force; I won't touch the branch before tomorrow/rebase/test.
A two-level browsing system will be perfect for all cases (separate PR)

This reverts commit e3832b4.

emjomi · 2025-12-08T21:50:35Z

Hey guys! Sorry to interrupt, but are the LLAMA_ARG_ prefixes required? I think they make the config a bit noisy.

One more thing: maybe it's better to put the config in ~/.config/llama.cpp/ on Linux, as specified in https://specifications.freedesktop.org/basedir/latest/?

Thank you so much for what you're doing!

ServeurpersoCom · 2025-12-08T22:41:27Z

Hey guys! Sorry to interrupt, but are the LLAMA_ARG_ prefixes required? I think they make the config a bit noisy.

One more thing: maybe it's better to put the config in ~/.config/llama.cpp/ on Linux, as specified in https://specifications.freedesktop.org/basedir/latest/?

Thank you so much for what you're doing!

No worries! with the last refactor the LLAMA_ARG_ prefixes are optional: you can use the short argument forms (e.g., ngl, c) or long forms with dashes (e.g., n-gpu-layers, ctx-size) instead. All three formats are supported.

Regarding config location: the preset file path is fully customizable via --models-preset , so you can place it wherever you prefer, including ~/.config/llama.cpp/presets.ini if that fits your workflow better.

This is a WIP, I update the first message soon

Co-authored-by: aldehir <hello@alde.dev>

ServeurpersoCom added 2 commits December 8, 2025 11:12

server : router config POC (INI-based per-model settings)

a488c5e

ServeurpersoCom requested review from ggerganov and ngxson as code owners December 8, 2025 10:23

github-actions bot added examples server labels Dec 8, 2025

loci-dev mentioned this pull request Dec 8, 2025

UPSTREAM PR #17859: Server: router per model config auroralabs-loci/llama.cpp#486

Open

aldehir requested changes Dec 8, 2025

View reviewed changes

tools/server/server-config.cpp Outdated Show resolved Hide resolved

tools/server/server-config.cpp Outdated Show resolved Hide resolved

tools/server/server-config.cpp Outdated Show resolved Hide resolved

ngxson reviewed Dec 8, 2025

View reviewed changes

aldehir reviewed Dec 8, 2025

View reviewed changes

tools/server/server-config.cpp Outdated Show resolved Hide resolved

ngxson reviewed Dec 8, 2025

View reviewed changes

tools/server/server-config.cpp Outdated Show resolved Hide resolved

ServeurpersoCom and others added 2 commits December 8, 2025 12:29

aldehir reviewed Dec 8, 2025

View reviewed changes

ServeurpersoCom marked this pull request as draft December 8, 2025 11:57

ServeurpersoCom marked this pull request as ready for review December 8, 2025 12:34

ngxson reviewed Dec 8, 2025

View reviewed changes

tools/server/server-models.cpp Show resolved Hide resolved

add common/preset.cpp

7ff71a8

ngxson added 4 commits December 8, 2025 18:59

fix compile

65528f6

cont

3de65f8

allow custom-path models

78ffec5

add falsey check

7f203b0

ngxson reviewed Dec 8, 2025

View reviewed changes

ngxson added 4 commits December 8, 2025 22:14

Revert "server: fix router model discovery and child process spawning"

0074098

This reverts commit e3832b4.

clarify about "no-" prefix

602084e

correct render_args() to include binary path

6f56b03

also remove arg LLAMA_ARG_MODELS_PRESET for child

92e7470

add co-author for ini parser code

3bedbf2

Co-authored-by: aldehir <hello@alde.dev>


		Alternatively, you can also add GGUF based preset (see next section)

		### Model presets

Server: router per model config #17859

Are you sure you want to change the base?

Server: router per model config #17859

Conversation

ServeurpersoCom commented Dec 8, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Motivation

Implementation

Technical Details

Example config.ini

Use Case Example

Testing

Future Work

Related

Uh oh!

Uh oh!

Uh oh!

Uh oh!

ngxson left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

ngxson Dec 8, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

ngxson Dec 8, 2025

Choose a reason for hiding this comment

Uh oh!

ServeurpersoCom Dec 8, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

aldehir commented Dec 8, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

aldehir left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

ServeurpersoCom commented Dec 8, 2025

Uh oh!

ServeurpersoCom commented Dec 8, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

ngxson commented Dec 8, 2025

Uh oh!

ServeurpersoCom commented Dec 8, 2025

Uh oh!

ngxson Dec 8, 2025

Choose a reason for hiding this comment

Uh oh!

ServeurpersoCom Dec 8, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

ngxson commented Dec 8, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

ngxson Dec 8, 2025

Choose a reason for hiding this comment

Uh oh!

ngxson Dec 8, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

ServeurpersoCom Dec 8, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

emjomi commented Dec 8, 2025

Uh oh!

ServeurpersoCom commented Dec 8, 2025

Uh oh!

Reviewers

Assignees

Labels

ServeurpersoCom commented Dec 8, 2025 •

edited

Loading

aldehir commented Dec 8, 2025 •

edited

Loading

aldehir left a comment •

edited

Loading

ServeurpersoCom commented Dec 8, 2025 •

edited

Loading

ServeurpersoCom Dec 8, 2025 •

edited

Loading

ngxson commented Dec 8, 2025 •

edited

Loading

ngxson Dec 8, 2025 •

edited

Loading

ServeurpersoCom Dec 8, 2025 •

edited

Loading