Add Support for Guided Decoding to On Device Sampling #624

quic-sanising · 2025-11-19T01:59:19Z

✨ Add Support for Guided Decoding to On Device Sampling

📌 Overview

This PR introduces guided decoding capabilities in On Device Sampling for QEffForCausalLM and QEffCausalLMForTextImageToTextModel models.

🚀 Motivation

As outlined in this blog on structured decoding, structured decoding represents a fundamental shift in controlling LLM outputs. Instead of relying on post-processing, constraints are enforced during token generation via logits manipulation. This approach ensures:

Format compliance at generation time.
Reduced error rates for structured outputs.
Performance improvements through optimized backends like XGrammar, which can deliver up to 5× faster token generation under load.

The constraints are provided through token_bitmasks which is a Boolean matrix of shape (batch_size, vocab_size). Here, each element indicates whether a token should be kept (1) or masked (0). During sampling, this mask is applied to the logits before token selection, ensuring that only allowed tokens are considered.

By performing this operation directly on the device, we eliminate host-device transfers, reduce latency, and improve throughput for structured decoding workloads.

🛠️ Implementation Details

The guided decoding logic is injected via include_guided_decoding=True during model loading. No changes to the model architecture are required.

from QEfficient import QEFFAutoModelForCausalLM as AutoModelForCausalLM

# Load model with On Device Sampler enabled
qeff_model = AutoModelForCausalLM.from_pretrained(
    "meta-llama/Llama-3.1-8B",
    continuous_batching=True,
    qaic_config={
        "include_sampler": True,
        "return_pdfs": False,
        "max_top_k_ids": 512,
        "include_guided_decoding": True,
    },
)

# Compile as usual
qeff_model.compile(
    prefill_seq_length=128,
    ctx_len=256,
    full_batch_size=16,
    num_devices=4,
    num_speculative_tokens=0,
    mxint8_kv_cache=True,
    mxfp6_matmul=True,
)

To disable guided decoding, simply set include_guided_decoding=False.

Signed-off-by: quic-xiyushi <xiyushi@qti.qualcomm.com>

Signed-off-by: quic-sanising <sanising@qti.qualcomm.com> Signed-off-by: sanising <sanising@qti.qualcomm.com>

Signed-off-by: quic-xiyushi <xiyushi@qti.qualcomm.com>

Signed-off-by: sanising <sanising@qti.qualcomm.com>

quic-sanising · 2025-11-19T01:59:46Z

Depends on #597

Signed-off-by: sanising <sanising@qti.qualcomm.com>

Signed-off-by: quic-xiyushi <xiyushi@qti.qualcomm.com>

Signed-off-by: sanising <sanising@qti.qualcomm.com>

Signed-off-by: quic-xiyushi <xiyushi@qti.qualcomm.com>

Signed-off-by: sanising <sanising@qti.qualcomm.com>

Signed-off-by: quic-xiyushi <xiyushi@qti.qualcomm.com>

Signed-off-by: sanising <sanising@qti.qualcomm.com>

quic-sanising · 2025-11-21T21:04:54Z

Ready for review

Signed-off-by: sanising <sanising@qti.qualcomm.com>

Signed-off-by: quic-xiyushi <xiyushi@qti.qualcomm.com>

Signed-off-by: Mamta Singh <168400541+quic-mamta@users.noreply.github.com>

QEfficient/transformers/sampler/sampler.py

tests/transformers/sampler/test_sampler.py

examples/performance/on_device_sampling.py

QEfficient/transformers/models/pytorch_transforms.py

Signed-off-by: Mamta Singh <168400541+quic-mamta@users.noreply.github.com>

Signed-off-by: quic-xiyushi <xiyushi@qti.qualcomm.com>

quic-hemagnih

LGTM

quic-xiyushi and others added 13 commits November 10, 2025 09:16

Extend on-device sampling support for dual QPC VLMs

409da24

Signed-off-by: quic-xiyushi <xiyushi@qti.qualcomm.com>

Fix random_numbers shape

e06e175

Signed-off-by: quic-xiyushi <xiyushi@qti.qualcomm.com>

Update example with new random sampling logic

3e242ce

Signed-off-by: quic-sanising <sanising@qti.qualcomm.com> Signed-off-by: sanising <sanising@qti.qualcomm.com>

Update to align with recent VLM CB changes

1a01d57

Signed-off-by: quic-xiyushi <xiyushi@qti.qualcomm.com>

Update tests with new random sampling logic

30d6061

Signed-off-by: sanising <sanising@qti.qualcomm.com>

Add code to perform guided decoding

78ef180

Signed-off-by: sanising <sanising@qti.qualcomm.com>

Add bitmask to example inputs and dynamic axes

1fafcdb

Signed-off-by: sanising <sanising@qti.qualcomm.com>

Rename bitmask to token_bitmasks

18ab856

Signed-off-by: sanising <sanising@qti.qualcomm.com>

Fix typo

b1c049c

Signed-off-by: sanising <sanising@qti.qualcomm.com>

Merge branch 'main' into guided_decoding_simple

e16e846

Add flag to enable guided decoding

1515497

Signed-off-by: sanising <sanising@qti.qualcomm.com>

Merge remote-tracking branch 'origin/main' into HEAD

d02d04d

Add flag to enable guided decoding

97e4baf

Signed-off-by: sanising <sanising@qti.qualcomm.com>

sanising and others added 6 commits November 19, 2025 14:05

Update test_sampler_transform for guided decoding

7b7677b

Signed-off-by: sanising <sanising@qti.qualcomm.com>

Refactor

7cf106e

Signed-off-by: quic-xiyushi <xiyushi@qti.qualcomm.com>

Add unit tests

45aed11

Signed-off-by: quic-xiyushi <xiyushi@qti.qualcomm.com>

Clean up

6273ab5

Signed-off-by: quic-xiyushi <xiyushi@qti.qualcomm.com>

Merge remote-tracking branch 'origin/main' into HEAD

ef9ae14

Add test for guided decoding

60312b3

Signed-off-by: sanising <sanising@qti.qualcomm.com>

quic-sanising marked this pull request as ready for review November 20, 2025 19:37

quic-sanising requested review from ochougul, quic-amitraj, quic-hemagnih and quic-rishinr as code owners November 20, 2025 19:37

quic-sanising changed the title ~~Add Guided Decoding~~ Add Support for Guided Decoding to On Device Sampling Nov 20, 2025

quic-xiyushi and others added 4 commits November 20, 2025 13:24

Update test_sampler.py

3789d5a

Signed-off-by: quic-xiyushi <xiyushi@qti.qualcomm.com>

Merge branch 'on-device-sampling-vlm' into guided_decoding_simple

251099f

Enable guided decoding in vlm generation

a24a55d

Signed-off-by: sanising <sanising@qti.qualcomm.com>

Fix bug

55e76e9

Signed-off-by: sanising <sanising@qti.qualcomm.com>

sanising and others added 5 commits November 20, 2025 17:48

Fix bug

f9355d4

Signed-off-by: sanising <sanising@qti.qualcomm.com>

Fix hash for VLM's language decoder to include qaic_config

5e2afb7

Signed-off-by: quic-xiyushi <xiyushi@qti.qualcomm.com>

Merge branch 'on-device-sampling-vlm' into guided_decoding_simple

e672701

Enable guided decoding test for vlms

eee5314

Signed-off-by: sanising <sanising@qti.qualcomm.com>

Use different config for each vlm

60cf5ec

Signed-off-by: sanising <sanising@qti.qualcomm.com>

sanising and others added 6 commits November 21, 2025 15:07

Update type

a71ee65

Signed-off-by: sanising <sanising@qti.qualcomm.com>

Merge remote-tracking branch 'origin/main' into HEAD

df06617

Fix bug in getting vocab_size and missing ccl in forward

10990a9

Signed-off-by: quic-xiyushi <xiyushi@qti.qualcomm.com>

Merge branch 'on-device-sampling-vlm' into guided_decoding_simple

b47b633

Merge branch 'main' into guided_decoding_simple

b5a7b99

Signed-off-by: Mamta Singh <168400541+quic-mamta@users.noreply.github.com>

Merge branch 'main' into guided_decoding_simple

3fcd9eb

Signed-off-by: Mamta Singh <168400541+quic-mamta@users.noreply.github.com>

quic-mamta reviewed Dec 8, 2025

View reviewed changes

quic-mamta and others added 5 commits December 10, 2025 11:59

Merge branch 'main' into on-device-sampling-vlm

98cfadf

Signed-off-by: Mamta Singh <168400541+quic-mamta@users.noreply.github.com>

Merge branch 'main' into on-device-sampling-vlm

a60e7ce

Support prefix-caching with on-device sampling

b22af54

Signed-off-by: quic-xiyushi <xiyushi@qti.qualcomm.com>

Modify tests to use internvl 1b for quicker CI

2533262

Signed-off-by: quic-xiyushi <xiyushi@qti.qualcomm.com>

Merge remote-tracking branch 'origin/on-device-sampling-vlm' into HEAD

5457075

quic-xiyushi force-pushed the guided_decoding_simple branch from ac48615 to 5457075 Compare December 16, 2025 07:07

quic-xiyushi added 4 commits December 15, 2025 23:08

Merge branch 'main' into on-device-sampling-vlm

8698651

Fix compilation error on Llama3.1 8B due to changes in presence penalty

86aaad2

Signed-off-by: quic-xiyushi <xiyushi@qti.qualcomm.com>

Update tests

a2d4fb4

Signed-off-by: quic-xiyushi <xiyushi@qti.qualcomm.com>

Merge remote-tracking branch 'origin/on-device-sampling-vlm' into HEAD

eaf21c0

quic-xiyushi force-pushed the guided_decoding_simple branch from b4ff8a0 to eaf21c0 Compare December 16, 2025 19:31

quic-xiyushi added 2 commits December 16, 2025 21:20

Extend on-device sampling support to llava, garnite, gemma, and llama4

feeaa37

Signed-off-by: quic-xiyushi <xiyushi@qti.qualcomm.com>

Merge remote-tracking branch 'origin/main' into HEAD

5f716ef

quic-xiyushi force-pushed the guided_decoding_simple branch from e9b2d4f to 5f716ef Compare December 17, 2025 05:37

quic-hemagnih approved these changes Dec 18, 2025

View reviewed changes

Merge branch 'main' into guided_decoding_simple

96e13a8

quic-hemagnih merged commit 46ed92b into quic:main Dec 18, 2025
4 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Add Support for Guided Decoding to On Device Sampling #624

Add Support for Guided Decoding to On Device Sampling #624

Uh oh!

quic-sanising commented Nov 19, 2025 •

edited

Loading

Uh oh!

quic-sanising commented Nov 19, 2025

Uh oh!

quic-sanising commented Nov 21, 2025

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

quic-hemagnih left a comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Add Support for Guided Decoding to On Device Sampling #624

Add Support for Guided Decoding to On Device Sampling #624

Uh oh!

Conversation

quic-sanising commented Nov 19, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!