How to analyze etdump results for QNN backend? #16285

liu-mengyang · 2025-12-17T01:41:21Z

liu-mengyang
Dec 17, 2025

I have obtained etdump results from a QNN‑LLM execution and used the Inspector API to generate the profiling table. However, I’m not sure how to interpret this table. Specifically, I have the following questions:

After sorting the results, I see Accelerator (execute) time (cycles), Method::execute, and DELEGATE_CALL at the top. Their latencies are quite similar. Are these essentially representing the same thing?
The remaining entries all start with aten_, which I assume correspond to different operators. How can I determine which of these are running on the CPU due to fallback, and which are actually running on the NPU?
Is there a way to know whether each operator uses HMX or HVX inside the NPU?
Finally, can I extract more detailed profiling information such as memory consumption, TCM usage, or overall NPU utilization?

Any guidance or documentation pointers would be greatly appreciated. Thank you!

Answered by shewu-quic

Dec 22, 2025

Hi @kimminsu38oo
Yes, the ExecuTorch QNN Intermediate Output Debugger is used to debug accuracy issues by comparing per-tensor outputs with CPU results. QHAS and optrace are used for performance analysis, using dumps from pte.

You can refer to the section about generate-optrace-and-qhas
Please note that the input order in the context binary may differ from the source model. You can check the input order in the JSON file using <QNN_SDK_ROOT>//bin/x86_64-linux-clang/qnn-context-binary-utility --context_binary $1 --json_file $2.

The following show how to generate optrace and QHAS in llama.py for stories260K.

Reproduce command

python examples/qualcomm/oss_scripts/llama/llama.py -b build-andro…

View full answer

yujiaoliang · 2025-12-18T02:01:30Z

yujiaoliang
Dec 18, 2025

For QNN profiling, I would mainly look at two aspects:

QNN runtime profiling itself already provides useful information about NPU execution and TCM usage. By enabling the built-in runtime profiling options, you can inspect per-op execution time and memory behavior, which is usually the primary source for understanding performance and resource utilization.
As an additional reference, this document might be helpful:
https://github.com/pytorch/executorch/blob/main/backends/qualcomm/debugger/README.md

It describes some related debugging and profiling workflows around the Qualcomm backend, though you may already be familiar with it.

0 replies

cccclai · 2025-12-18T19:00:23Z

cccclai
Dec 18, 2025
Collaborator

After sorting the results, I see Accelerator (execute) time (cycles), Method::execute, and DELEGATE_CALL at the top. Their latencies are quite similar. Are these essentially representing the same thing?

Delegate_CALL is inside Method::execute call. If the time is similar, that means the delegate call (the execution time in HTP) is dominant.

The remaining entries all start with aten_, which I assume correspond to different operators. How can I determine which of these are running on the CPU due to fallback, and which are actually running on the NPU?

DELEGATE_CALL is everything inside HTP, and the individual operator call (like aten_bmm) means they fall back to cpu.

Is there a way to know whether each operator uses HMX or HVX inside the NPU?

We can use optrace or the debugger https://github.com/pytorch/executorch/tree/main/backends/qualcomm/debugger#qairt-visualizer as shared by @yujiaoliang

Finally, can I extract more detailed profiling information such as memory consumption, TCM usage, or overall NPU utilization?

@haowhsu-quic @shewu-quic @winskuo-quic @DannyYuyang-quic do we have guidance on this?

0 replies

shewu-quic · 2025-12-19T01:35:32Z

shewu-quic
Dec 19, 2025
Collaborator

Hi @yujiaoliang

Finally, can I extract more detailed profiling information such as memory consumption, TCM usage, or overall NPU utilization?

Yes, you can dump QHAS and optrace to observe the NPU utilization and TCM usage.:
https://github.com/pytorch/executorch/tree/main/backends/qualcomm/debugger#2-generate-optrace-and-qhas

0 replies

kimminsu38oo · 2025-12-21T12:19:57Z

kimminsu38oo
Dec 21, 2025

Hello
@shewu-quic

https://github.com/pytorch/executorch/tree/main/backends/qualcomm/debugger#limitation
According to the Limitation, the current debugger does not support LLM models.

Is is currently possible to obtain the Qualcomm HTP Analysis Summary for LLM models, similar to the image you shared?

4 replies

shewu-quic Dec 22, 2025
Collaborator

Hi @kimminsu38oo
Yes, the ExecuTorch QNN Intermediate Output Debugger is used to debug accuracy issues by comparing per-tensor outputs with CPU results. QHAS and optrace are used for performance analysis, using dumps from pte.

You can refer to the section about generate-optrace-and-qhas
Please note that the input order in the context binary may differ from the source model. You can check the input order in the JSON file using <QNN_SDK_ROOT>//bin/x86_64-linux-clang/qnn-context-binary-utility --context_binary $1 --json_file $2.

The following show how to generate optrace and QHAS in llama.py for stories260K.

Reproduce command

python examples/qualcomm/oss_scripts/llama/llama.py -b build-android -H <HOST> -s <adb device> -m SM8750 --temperature 0 --checkpoint stories260K.pt --params params.json --tokenizer_model  tokenizer.model  --tokenizer_bin  tokenizer.bin --decoder_model stories260k --model_mode kv --max_seq_len 1024 --prompt "Once" --artifact test_llm_optrace

The following files will be generated.

# ls test_llm_optrace
backend_extension_config.json  optrace_0.json                           outputs
config.json                    optrace_0_qnn_htp_analysis_summary.html  qnn-profiling-data_0.log
forward_0.bin                  optrace_0_qnn_htp_analysis_summary.json  stories260k_kv_llama_qnn.pte
forward_0.json                 optrace_0_runtrace.json

generate_llm_optrace.patch

diff --git a/examples/qualcomm/oss_scripts/llama/llama.py b/examples/qualcomm/oss_scripts/llama/llama.py
index 70836f2374..5c6dc6d3e8 100755
--- a/examples/qualcomm/oss_scripts/llama/llama.py
+++ b/examples/qualcomm/oss_scripts/llama/llama.py
@@ -86,6 +86,7 @@ from executorch.examples.qualcomm.oss_scripts.llama.range_setting_pt2e import (
 from executorch.examples.qualcomm.oss_scripts.llama.static_llm_quant_recipe import (
     StaticLLMQuantRecipe,
 )
+from executorch.backends.qualcomm.debugger.utils import generate_optrace
 
 from executorch.examples.qualcomm.utils import (
     make_output_dir,
@@ -347,6 +348,7 @@ class SingleLlama:
 
     def lowering_modules(
         self,
+        args,
         work_space,
         use_fp16=False,
         soc_model=QcomChipset.SM8650,
@@ -375,6 +377,7 @@ class SingleLlama:
                 backend_options=backend_options,
                 shared_buffer=shared_buffer,
                 use_mha2sha=True,
+                optrace=True,
             )
             skip_node_op_set = {"llama.fallback.default"}
 
@@ -399,6 +402,34 @@ class SingleLlama:
             with open(f"{work_space}/{self.pte_filename}.pte", "wb") as file:
                 exec_prog_mgr.write_to_file(file)
 
+            # generate optrace and QHAS
+            adb = SimpleADB(
+                qnn_sdk=os.getenv("QNN_SDK_ROOT"),
+                build_path=f"{args.build_folder}",
+                pte_path=f"{work_space}/{self.pte_filename}.pte",
+                workspace=f"/data/local/tmp/executorch/{self.pte_filename}",
+                device_id=args.device,
+                host_id=args.host,
+                soc_model=args.model,
+                target=args.target,
+            )
+            optrace_inputs = []
+            for kv_input in self.inputs[3:]:
+                optrace_inputs.append(kv_input.to(dtype=torch.uint8))
+            token, mask, pos = self.inputs[:3]
+            optrace_inputs.append(mask.to(dtype=torch.uint16))
+            optrace_inputs.append(token)
+            optrace_inputs.append(pos)
+
+            binaries_trace = generate_optrace(
+                args.artifact,
+                get_soc_to_chipset_map()[args.model],
+                adb,
+                f"{work_space}/{self.pte_filename}.pte",
+                [tuple(optrace_inputs)],
+            )
+# binaries_trace = generate_optrace(args.artifact,get_soc_to_chipset_map()[args.model],adb,f"{work_space}/{self.pte_filename}.pte",[optrace_inputs],)
+
     def get_example_inputs(self, use_kv_cache=True):
         return self.decoder_model.get_example_inputs(use_kv_cache)
 
@@ -786,6 +817,7 @@ def compile(
 
     if args.model_mode in ["kv"]:
         llama_instance_list[0].lowering_modules(
+            args,
             args.artifact,
             use_fp16=use_fp16,
             soc_model=get_soc_to_chipset_map()[args.model],

Answer selected by liu-mengyang

liu-mengyang Dec 22, 2025
Author

Thank you so much for your helpful and detailed guidance! I have successfully viewed the HTP analysis results I wanted.

kimminsu38oo Dec 22, 2025

@shewu-quic
Thanks to your kind explanation, I was able to obtain the QHAS results.
Thank you.

Additionally, I would like to visualize the model graph by following the guide at
https://github.com/pytorch/executorch/blob/main/backends/qualcomm/debugger/README.md#3-visualizing-and-analyzing-optrace-and-qhas
According to the documentation, graph visualization requires a forward_0.dlc file rather than a .bin file.

For the LLaMA3.2-1B-Instruct,
https://github.com/pytorch/executorch/tree/main/examples/qualcomm/oss_scripts/llama#llama32-1b-instruct
or other LLM models, is there a way to obtain .dlc file?

When compiling, passing only the --online_prepare flag resulted in generating only .bin files for the LLaMA3.2-1B-Instruct model.

Thank you

shewu-quic Dec 22, 2025
Collaborator

Hi @kimminsu38oo,
I'm happy to hear your good news.

Yes, you can use the change below to obtain the dlc file. We haven't added --online_prepare support in llama.py because LLM is too large to prepare on the device.

diff --git a/examples/qualcomm/oss_scripts/llama/llama.py b/examples/qualcomm/oss_scripts/llama/llama.py
index 70836f2374..78b591c95b 100755
--- a/examples/qualcomm/oss_scripts/llama/llama.py
+++ b/examples/qualcomm/oss_scripts/llama/llama.py
@@ -86,6 +86,7 @@ from executorch.examples.qualcomm.oss_scripts.llama.range_setting_pt2e import (
 from executorch.examples.qualcomm.oss_scripts.llama.static_llm_quant_recipe import (
     StaticLLMQuantRecipe,
 )
+from executorch.backends.qualcomm.debugger.utils import generate_optrace
 
 from executorch.examples.qualcomm.utils import (
     make_output_dir,
@@ -347,6 +348,7 @@ class SingleLlama:
 
     def lowering_modules(
         self,
         args,
         work_space,
         use_fp16=False,
         soc_model=QcomChipset.SM8650,
@@ -375,6 +377,8 @@ class SingleLlama:
                 backend_options=backend_options,
                 shared_buffer=shared_buffer,
                 use_mha2sha=True,
                 optrace=True,
+                online_prepare=True
             )
             skip_node_op_set = {"llama.fallback.default"}

How to analyze etdump results for QNN backend? #16285

Uh oh!

liu-mengyang Dec 17, 2025

Reproduce command

Replies: 4 comments · 4 replies

Uh oh!

yujiaoliang Dec 18, 2025

Uh oh!

cccclai Dec 18, 2025 Collaborator

Uh oh!

shewu-quic Dec 19, 2025 Collaborator

Uh oh!

kimminsu38oo Dec 21, 2025

Uh oh!

Uh oh!

shewu-quic Dec 22, 2025 Collaborator

Reproduce command

Uh oh!

Uh oh!

liu-mengyang Dec 22, 2025 Author

Uh oh!

Uh oh!

kimminsu38oo Dec 22, 2025

Uh oh!

shewu-quic Dec 22, 2025 Collaborator

liu-mengyang
Dec 17, 2025

Replies: 4 comments 4 replies

yujiaoliang
Dec 18, 2025

cccclai
Dec 18, 2025
Collaborator

shewu-quic
Dec 19, 2025
Collaborator

kimminsu38oo
Dec 21, 2025

shewu-quic Dec 22, 2025
Collaborator

liu-mengyang Dec 22, 2025
Author

shewu-quic Dec 22, 2025
Collaborator