Extend on-device sampling support for dual QPC VLMs #597

quic-xiyushi · 2025-10-24T00:01:48Z

Overview

On-device sampling can significantly reduce host overhead and improve inference throughput; however, so far it has only been implemented for QEffForCausalLM models. This PR extends on-device sampling support to the language decoder of dual QPC vision language models, QEffCausalLMForTextImageToTextModel. In addition, it fixes the bug in gumbel noise so that it correctly simulates a multinomial distribution for random sampling.

Implementation details

class _QEffAutoModelForImageTextToTextDualQPC:

def __init__(
        self,
        model: nn.Module,
        continuous_batching: bool = False,
        qaic_config: Optional[dict] = None,
        **kwargs,
    ):
        # Omitting unchanged parts
        self.lang_model = QEffCausalLMForTextImageToTextModel(model, qaic_config=qaic_config, **kwargs)
        # ---Sampling---
        # Note: SamplerTransform should be applied after all other transforms
        # are done. The role of the sampler is to just add nodes at the output of the
        # previous transform function.
        self.lang_model.model, _ = SamplerTransform.apply(self.lang_model.model, qaic_config, **kwargs)

Usage

The usage is the similar to enable on-device sampling for QEffForCausalLM.

from QEfficient import QEFFAutoModelForImageTextToText

model_id = "Qwen/Qwen2.5-VL-3B-Instruct"

qeff_model = QEFFAutoModelForImageTextToText.from_pretrained(
    model_id,
    attn_implementation="eager",
    kv_offload=True,
    continuous_batching=True,
    qaic_config={
        "include_sampler": True,
        "return_pdfs": False,
        "max_top_k_ids": 512,
    },
)

quic-hemagnih

Can you please add the CI test cases.

Signed-off-by: quic-xiyushi <xiyushi@qti.qualcomm.com>

Signed-off-by: quic-sanising <sanising@qti.qualcomm.com> Signed-off-by: sanising <sanising@qti.qualcomm.com>

Signed-off-by: quic-xiyushi <xiyushi@qti.qualcomm.com>

Signed-off-by: sanising <sanising@qti.qualcomm.com>

QEfficient/transformers/models/modeling_auto.py

tests/transformers/sampler/test_sampler.py

ochougul · 2025-11-12T09:11:06Z

QEfficient/transformers/models/pytorch_transforms.py

        QEffGPTJForCausalLM,
        QEffGraniteForCausalLM,
        QEffGraniteMoeForCausalLM,
+        QEffInternDecoderWrapper,


Does this mean we are enabling sampling only for intern model?
Will other VLMs also be supported?

Other VLMs are also supposed to be supported. But currently only InternVL and Qwen VL 2.5 have been tested.

Signed-off-by: quic-xiyushi <xiyushi@qti.qualcomm.com>

tests/transformers/sampler/test_sampler.py

Signed-off-by: quic-xiyushi <xiyushi@qti.qualcomm.com>

quic-xiyushi · 2025-11-21T02:17:15Z

Can you please add the CI test cases.

@quic-hemagnih CI added. Please review this PR again. Thank you!

Signed-off-by: quic-xiyushi <xiyushi@qti.qualcomm.com>

asmigosw · 2025-12-02T10:07:29Z

QEfficient/transformers/sampler/sampler.py

    top_ps: Optional[torch.Tensor] = None,
    min_ps: Optional[torch.Tensor] = None,
    random_numbers: Optional[torch.Tensor] = None,
+    vision_embeds: Optional[torch.Tensor] = None,


Please add these both vision_embeds and image_idx in docs Args list.

Since vision_embeds and image_idx come from the original forward method for VLMs and are not specifically added for on-device sampling, and because the docstring is intended only for parameters introduced for on-device sampling support, function docsstring was not added for vision_embeds and image_idx. Instead, I added added a note stating:
"The vision_embeds and image_idx parameters are optional and are used only for VLMs when supported by the original forward function."

In addition, to make this clearer, I reordered the arguments so that vision_embeds and image_idx appear right after num_logits_to_keep, before the on-device sampling arguments.

quic-mamta · 2025-11-28T06:25:40Z

QEfficient/transformers/sampler/sampler.py

    min_ps: Optional[torch.Tensor] = None,
    random_numbers: Optional[torch.Tensor] = None,
+    vision_embeds: Optional[torch.Tensor] = None,
+    image_idx: Optional[torch.Tensor] = None,


please keep dtype of these 2 consistent as per lines 27-28. also update function docstring for these newly added args.

I have updated the dtype of these two parameters so that they are consistent with lines 27–28. For the function docstring, since vision_embeds and image_idx come from the original forward method for VLMs and are not specifically added for on-device sampling, and because the docstring is intended only for parameters introduced for on-device sampling support, function docsstring was not added for vision_embeds and image_idx. Instead, I added added a note stating:
"The vision_embeds and image_idx parameters are optional and are used only for VLMs when supported by the original forward function."

In addition, to make this clearer, I reordered the arguments so that vision_embeds and image_idx appear right after num_logits_to_keep, before the on-device sampling arguments.

quic-mamta · 2025-12-05T05:34:31Z

Please resolve the conflicts.

ochougul

merge if CI is passing

Signed-off-by: Mamta Singh <168400541+quic-mamta@users.noreply.github.com>

Signed-off-by: quic-xiyushi <xiyushi@qti.qualcomm.com>

quic-xiyushi · 2025-12-16T06:17:37Z

Please resolve the conflicts.

Done.

Signed-off-by: quic-xiyushi <xiyushi@qti.qualcomm.com>

quic-xiyushi requested review from ochougul, quic-amitraj, quic-hemagnih and quic-rishinr as code owners October 24, 2025 00:01

quic-xiyushi force-pushed the on-device-sampling-vlm branch 2 times, most recently from af8e673 to df3501a Compare October 30, 2025 07:13

quic-hemagnih requested changes Oct 30, 2025

View reviewed changes

Extend on-device sampling support for dual QPC VLMs

409da24

Signed-off-by: quic-xiyushi <xiyushi@qti.qualcomm.com>

quic-xiyushi force-pushed the on-device-sampling-vlm branch from df3501a to d722a5a Compare November 10, 2025 17:22

Fix random_numbers shape

e06e175

Signed-off-by: quic-xiyushi <xiyushi@qti.qualcomm.com>

quic-xiyushi force-pushed the on-device-sampling-vlm branch from d722a5a to e06e175 Compare November 10, 2025 17:25

Update example with new random sampling logic

3e242ce

Signed-off-by: quic-sanising <sanising@qti.qualcomm.com> Signed-off-by: sanising <sanising@qti.qualcomm.com>

quic-sanising force-pushed the on-device-sampling-vlm branch from 900aee5 to 3e242ce Compare November 11, 2025 00:14

quic-xiyushi and others added 2 commits November 10, 2025 16:35

Update to align with recent VLM CB changes

1a01d57

Signed-off-by: quic-xiyushi <xiyushi@qti.qualcomm.com>

Update tests with new random sampling logic

30d6061

Signed-off-by: sanising <sanising@qti.qualcomm.com>

ochougul requested changes Nov 12, 2025

View reviewed changes

ochougul assigned quic-xiyushi Nov 12, 2025

ochougul added the enhancement New feature or request label Nov 12, 2025

Merge remote-tracking branch 'origin/main' into HEAD

d02d04d

quic-sanising mentioned this pull request Nov 19, 2025

Add Support for Guided Decoding to On Device Sampling #624

Merged

quic-xiyushi added 4 commits November 20, 2025 11:28

Refactor

7cf106e

Signed-off-by: quic-xiyushi <xiyushi@qti.qualcomm.com>

Add unit tests

45aed11

Signed-off-by: quic-xiyushi <xiyushi@qti.qualcomm.com>

Clean up

6273ab5

Signed-off-by: quic-xiyushi <xiyushi@qti.qualcomm.com>

Merge remote-tracking branch 'origin/main' into HEAD

ef9ae14

quic-xiyushi force-pushed the on-device-sampling-vlm branch from cc44ad0 to ef9ae14 Compare November 20, 2025 19:31

quic-xiyushi requested a review from quic-sanising November 20, 2025 19:33

quic-sanising suggested changes Nov 20, 2025

View reviewed changes

tests/transformers/sampler/test_sampler.py Outdated Show resolved Hide resolved

tests/transformers/sampler/test_sampler.py Outdated Show resolved Hide resolved

quic-xiyushi added 2 commits November 20, 2025 13:24

Update test_sampler.py

3789d5a

Signed-off-by: quic-xiyushi <xiyushi@qti.qualcomm.com>

Fix hash for VLM's language decoder to include qaic_config

5e2afb7

Signed-off-by: quic-xiyushi <xiyushi@qti.qualcomm.com>

quic-xiyushi force-pushed the on-device-sampling-vlm branch from 8d00cb1 to 5e2afb7 Compare November 21, 2025 02:15

quic-xiyushi requested review from ochougul and quic-hemagnih November 22, 2025 01:42

Merge remote-tracking branch 'origin/main' into HEAD

df06617

quic-xiyushi force-pushed the on-device-sampling-vlm branch from 7d06a75 to a0716fa Compare November 25, 2025 22:19

Fix bug in getting vocab_size and missing ccl in forward

10990a9

Signed-off-by: quic-xiyushi <xiyushi@qti.qualcomm.com>

quic-xiyushi force-pushed the on-device-sampling-vlm branch from a0716fa to 10990a9 Compare November 25, 2025 22:21

quic-hemagnih requested review from asmigosw and quic-mamta November 26, 2025 09:18

asmigosw suggested changes Dec 2, 2025

View reviewed changes

quic-mamta reviewed Dec 5, 2025

View reviewed changes

ochougul approved these changes Dec 9, 2025

View reviewed changes

quic-mamta and others added 4 commits December 10, 2025 11:59

Merge branch 'main' into on-device-sampling-vlm

98cfadf

Signed-off-by: Mamta Singh <168400541+quic-mamta@users.noreply.github.com>

Merge branch 'main' into on-device-sampling-vlm

a60e7ce

Support prefix-caching with on-device sampling

b22af54

Signed-off-by: quic-xiyushi <xiyushi@qti.qualcomm.com>

Modify tests to use internvl 1b for quicker CI

2533262

Signed-off-by: quic-xiyushi <xiyushi@qti.qualcomm.com>

quic-xiyushi requested review from asmigosw, quic-mamta and quic-sanising December 16, 2025 07:02

quic-xiyushi force-pushed the on-device-sampling-vlm branch from 0c1fddf to ac48615 Compare December 16, 2025 07:03

quic-mamta approved these changes Dec 16, 2025

View reviewed changes

quic-xiyushi force-pushed the on-device-sampling-vlm branch from e5e509f to 2533262 Compare December 16, 2025 07:07

quic-xiyushi added 2 commits December 15, 2025 23:08

Merge branch 'main' into on-device-sampling-vlm

8698651

Fix compilation error on Llama3.1 8B due to changes in presence penalty

86aaad2

Signed-off-by: quic-xiyushi <xiyushi@qti.qualcomm.com>

quic-xiyushi force-pushed the on-device-sampling-vlm branch from 0d0d04d to 86aaad2 Compare December 16, 2025 07:33

quic-hemagnih approved these changes Dec 16, 2025

View reviewed changes

Update tests

a2d4fb4

Signed-off-by: quic-xiyushi <xiyushi@qti.qualcomm.com>

quic-hemagnih merged commit e5b4595 into quic:main Dec 17, 2025
4 of 5 checks passed

Extend on-device sampling support for dual QPC VLMs #597

Extend on-device sampling support for dual QPC VLMs #597

Uh oh!

Conversation

quic-xiyushi commented Oct 24, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

quic-hemagnih left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

ochougul Nov 12, 2025

Choose a reason for hiding this comment

Uh oh!

quic-xiyushi Nov 20, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

quic-xiyushi commented Nov 21, 2025

Uh oh!

asmigosw Dec 2, 2025

Choose a reason for hiding this comment

Uh oh!

quic-xiyushi Dec 16, 2025

Choose a reason for hiding this comment

Uh oh!

quic-mamta Nov 28, 2025

Choose a reason for hiding this comment

Uh oh!

quic-xiyushi Dec 16, 2025

Choose a reason for hiding this comment

Uh oh!

quic-mamta commented Dec 5, 2025

Uh oh!

ochougul left a comment

Choose a reason for hiding this comment

Uh oh!

quic-xiyushi commented Dec 16, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

6 participants

quic-xiyushi commented Oct 24, 2025 •

edited

Loading