Skip to content

Conversation

@quic-xiyushi
Copy link
Contributor

@quic-xiyushi quic-xiyushi commented Oct 24, 2025

Overview

On-device sampling can significantly reduce host overhead and improve inference throughput; however, so far it has only been implemented for QEffForCausalLM models. This PR extends on-device sampling support to the language decoder of dual QPC vision language models, QEffCausalLMForTextImageToTextModel. In addition, it fixes the bug in gumbel noise so that it correctly simulates a multinomial distribution for random sampling.

Implementation details

class _QEffAutoModelForImageTextToTextDualQPC:

def __init__(
        self,
        model: nn.Module,
        continuous_batching: bool = False,
        qaic_config: Optional[dict] = None,
        **kwargs,
    ):
        # Omitting unchanged parts
        self.lang_model = QEffCausalLMForTextImageToTextModel(model, qaic_config=qaic_config, **kwargs)
        # ---Sampling---
        # Note: SamplerTransform should be applied after all other transforms
        # are done. The role of the sampler is to just add nodes at the output of the
        # previous transform function.
        self.lang_model.model, _ = SamplerTransform.apply(self.lang_model.model, qaic_config, **kwargs)

Usage

The usage is the similar to enable on-device sampling for QEffForCausalLM.

from QEfficient import QEFFAutoModelForImageTextToText

model_id = "Qwen/Qwen2.5-VL-3B-Instruct"

qeff_model = QEFFAutoModelForImageTextToText.from_pretrained(
    model_id,
    attn_implementation="eager",
    kv_offload=True,
    continuous_batching=True,
    qaic_config={
        "include_sampler": True,
        "return_pdfs": False,
        "max_top_k_ids": 512,
    },
)

@quic-xiyushi quic-xiyushi force-pushed the on-device-sampling-vlm branch 2 times, most recently from af8e673 to df3501a Compare October 30, 2025 07:13
Copy link
Contributor

@quic-hemagnih quic-hemagnih left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you please add the CI test cases.

Signed-off-by: quic-xiyushi <xiyushi@qti.qualcomm.com>
@quic-xiyushi quic-xiyushi force-pushed the on-device-sampling-vlm branch from df3501a to d722a5a Compare November 10, 2025 17:22
Signed-off-by: quic-xiyushi <xiyushi@qti.qualcomm.com>
@quic-xiyushi quic-xiyushi force-pushed the on-device-sampling-vlm branch from d722a5a to e06e175 Compare November 10, 2025 17:25
Signed-off-by: quic-sanising <sanising@qti.qualcomm.com>
Signed-off-by: sanising <sanising@qti.qualcomm.com>
quic-xiyushi and others added 2 commits November 10, 2025 16:35
Signed-off-by: quic-xiyushi <xiyushi@qti.qualcomm.com>
Signed-off-by: sanising <sanising@qti.qualcomm.com>
QEffGPTJForCausalLM,
QEffGraniteForCausalLM,
QEffGraniteMoeForCausalLM,
QEffInternDecoderWrapper,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does this mean we are enabling sampling only for intern model?
Will other VLMs also be supported?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Other VLMs are also supposed to be supported. But currently only InternVL and Qwen VL 2.5 have been tested.

@ochougul ochougul added the enhancement New feature or request label Nov 12, 2025
Signed-off-by: quic-xiyushi <xiyushi@qti.qualcomm.com>
Signed-off-by: quic-xiyushi <xiyushi@qti.qualcomm.com>
Signed-off-by: quic-xiyushi <xiyushi@qti.qualcomm.com>
Signed-off-by: quic-xiyushi <xiyushi@qti.qualcomm.com>
Signed-off-by: quic-xiyushi <xiyushi@qti.qualcomm.com>
@quic-xiyushi quic-xiyushi force-pushed the on-device-sampling-vlm branch from 8d00cb1 to 5e2afb7 Compare November 21, 2025 02:15
@quic-xiyushi
Copy link
Contributor Author

Can you please add the CI test cases.

@quic-hemagnih CI added. Please review this PR again. Thank you!

@quic-xiyushi quic-xiyushi force-pushed the on-device-sampling-vlm branch from 7d06a75 to a0716fa Compare November 25, 2025 22:19
Signed-off-by: quic-xiyushi <xiyushi@qti.qualcomm.com>
top_ps: Optional[torch.Tensor] = None,
min_ps: Optional[torch.Tensor] = None,
random_numbers: Optional[torch.Tensor] = None,
vision_embeds: Optional[torch.Tensor] = None,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please add these both vision_embeds and image_idx in docs Args list.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since vision_embeds and image_idx come from the original forward method for VLMs and are not specifically added for on-device sampling, and because the docstring is intended only for parameters introduced for on-device sampling support, function docsstring was not added for vision_embeds and image_idx. Instead, I added added a note stating:
"The vision_embeds and image_idx parameters are optional and are used only for VLMs when supported by the original forward function."

In addition, to make this clearer, I reordered the arguments so that vision_embeds and image_idx appear right after num_logits_to_keep, before the on-device sampling arguments.

min_ps: Optional[torch.Tensor] = None,
random_numbers: Optional[torch.Tensor] = None,
vision_embeds: Optional[torch.Tensor] = None,
image_idx: Optional[torch.Tensor] = None,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

please keep dtype of these 2 consistent as per lines 27-28. also update function docstring for these newly added args.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have updated the dtype of these two parameters so that they are consistent with lines 27–28. For the function docstring, since vision_embeds and image_idx come from the original forward method for VLMs and are not specifically added for on-device sampling, and because the docstring is intended only for parameters introduced for on-device sampling support, function docsstring was not added for vision_embeds and image_idx. Instead, I added added a note stating:
"The vision_embeds and image_idx parameters are optional and are used only for VLMs when supported by the original forward function."

In addition, to make this clearer, I reordered the arguments so that vision_embeds and image_idx appear right after num_logits_to_keep, before the on-device sampling arguments.

@quic-mamta
Copy link
Contributor

Please resolve the conflicts.

Copy link
Contributor

@ochougul ochougul left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

merge if CI is passing

quic-mamta and others added 4 commits December 10, 2025 11:59
Signed-off-by: Mamta Singh <168400541+quic-mamta@users.noreply.github.com>
Signed-off-by: quic-xiyushi <xiyushi@qti.qualcomm.com>
Signed-off-by: quic-xiyushi <xiyushi@qti.qualcomm.com>
@quic-xiyushi
Copy link
Contributor Author

Please resolve the conflicts.

Done.

@quic-xiyushi quic-xiyushi force-pushed the on-device-sampling-vlm branch from e5e509f to 2533262 Compare December 16, 2025 07:07
@quic-xiyushi quic-xiyushi force-pushed the on-device-sampling-vlm branch from 0d0d04d to 86aaad2 Compare December 16, 2025 07:33
Signed-off-by: quic-xiyushi <xiyushi@qti.qualcomm.com>
@quic-hemagnih quic-hemagnih merged commit e5b4595 into quic:main Dec 17, 2025
4 of 5 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

enhancement New feature or request

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants