Skip to content

Conversation

@quic-sanising
Copy link
Contributor

@quic-sanising quic-sanising commented Nov 19, 2025

✨ Add Support for Guided Decoding to On Device Sampling

📌 Overview

This PR introduces guided decoding capabilities in On Device Sampling for QEffForCausalLM and QEffCausalLMForTextImageToTextModel models.



🚀 Motivation

As outlined in this blog on structured decoding, structured decoding represents a fundamental shift in controlling LLM outputs. Instead of relying on post-processing, constraints are enforced during token generation via logits manipulation. This approach ensures:

  • Format compliance at generation time.
  • Reduced error rates for structured outputs.
  • Performance improvements through optimized backends like XGrammar, which can deliver up to 5× faster token generation under load.

The constraints are provided through token_bitmasks which is a Boolean matrix of shape (batch_size, vocab_size). Here, each element indicates whether a token should be kept (1) or masked (0). During sampling, this mask is applied to the logits before token selection, ensuring that only allowed tokens are considered.

By performing this operation directly on the device, we eliminate host-device transfers, reduce latency, and improve throughput for structured decoding workloads.



🛠️ Implementation Details

The guided decoding logic is injected via include_guided_decoding=True during model loading. No changes to the model architecture are required.

from QEfficient import QEFFAutoModelForCausalLM as AutoModelForCausalLM

# Load model with On Device Sampler enabled
qeff_model = AutoModelForCausalLM.from_pretrained(
    "meta-llama/Llama-3.1-8B",
    continuous_batching=True,
    qaic_config={
        "include_sampler": True,
        "return_pdfs": False,
        "max_top_k_ids": 512,
        "include_guided_decoding": True,
    },
)

# Compile as usual
qeff_model.compile(
    prefill_seq_length=128,
    ctx_len=256,
    full_batch_size=16,
    num_devices=4,
    num_speculative_tokens=0,
    mxint8_kv_cache=True,
    mxfp6_matmul=True,
)

To disable guided decoding, simply set include_guided_decoding=False.

quic-xiyushi and others added 13 commits November 10, 2025 09:16
Signed-off-by: quic-xiyushi <xiyushi@qti.qualcomm.com>
Signed-off-by: quic-xiyushi <xiyushi@qti.qualcomm.com>
Signed-off-by: quic-sanising <sanising@qti.qualcomm.com>
Signed-off-by: sanising <sanising@qti.qualcomm.com>
Signed-off-by: quic-xiyushi <xiyushi@qti.qualcomm.com>
Signed-off-by: sanising <sanising@qti.qualcomm.com>
Signed-off-by: sanising <sanising@qti.qualcomm.com>
Signed-off-by: sanising <sanising@qti.qualcomm.com>
Signed-off-by: sanising <sanising@qti.qualcomm.com>
Signed-off-by: sanising <sanising@qti.qualcomm.com>
Signed-off-by: sanising <sanising@qti.qualcomm.com>
Signed-off-by: sanising <sanising@qti.qualcomm.com>
@quic-sanising
Copy link
Contributor Author

Depends on #597

sanising and others added 6 commits November 19, 2025 14:05
Signed-off-by: sanising <sanising@qti.qualcomm.com>
Signed-off-by: quic-xiyushi <xiyushi@qti.qualcomm.com>
Signed-off-by: quic-xiyushi <xiyushi@qti.qualcomm.com>
Signed-off-by: quic-xiyushi <xiyushi@qti.qualcomm.com>
Signed-off-by: sanising <sanising@qti.qualcomm.com>
@quic-sanising quic-sanising marked this pull request as ready for review November 20, 2025 19:37
@quic-sanising quic-sanising changed the title Add Guided Decoding Add Support for Guided Decoding to On Device Sampling Nov 20, 2025
quic-xiyushi and others added 4 commits November 20, 2025 13:24
Signed-off-by: quic-xiyushi <xiyushi@qti.qualcomm.com>
Signed-off-by: sanising <sanising@qti.qualcomm.com>
Signed-off-by: sanising <sanising@qti.qualcomm.com>
sanising and others added 5 commits November 20, 2025 17:48
Signed-off-by: sanising <sanising@qti.qualcomm.com>
Signed-off-by: quic-xiyushi <xiyushi@qti.qualcomm.com>
Signed-off-by: sanising <sanising@qti.qualcomm.com>
Signed-off-by: sanising <sanising@qti.qualcomm.com>
@quic-sanising
Copy link
Contributor Author

Ready for review

sanising and others added 6 commits November 21, 2025 15:07
Signed-off-by: sanising <sanising@qti.qualcomm.com>
Signed-off-by: quic-xiyushi <xiyushi@qti.qualcomm.com>
Signed-off-by: Mamta Singh <168400541+quic-mamta@users.noreply.github.com>
Signed-off-by: Mamta Singh <168400541+quic-mamta@users.noreply.github.com>
quic-mamta and others added 5 commits December 10, 2025 11:59
Signed-off-by: Mamta Singh <168400541+quic-mamta@users.noreply.github.com>
Signed-off-by: quic-xiyushi <xiyushi@qti.qualcomm.com>
Signed-off-by: quic-xiyushi <xiyushi@qti.qualcomm.com>
@quic-xiyushi quic-xiyushi force-pushed the guided_decoding_simple branch from ac48615 to 5457075 Compare December 16, 2025 07:07
@quic-xiyushi quic-xiyushi force-pushed the guided_decoding_simple branch from b4ff8a0 to eaf21c0 Compare December 16, 2025 19:31
@quic-xiyushi quic-xiyushi force-pushed the guided_decoding_simple branch from e9b2d4f to 5f716ef Compare December 17, 2025 05:37
Copy link
Contributor

@quic-hemagnih quic-hemagnih left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@quic-hemagnih quic-hemagnih merged commit 46ed92b into quic:main Dec 18, 2025
4 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants