[TOSA] MultiheadAttention legalization #4382

catcor01 · 2025-11-19T14:51:03Z

No description provided.

Lallapallooza

Thanks for patch, few comments.

lib/Conversion/TorchToTosa/TorchToTosa.cpp

projects/pt1/test/python/scaled_dot_product_attention_lowering.py

lib/Dialect/TorchConversion/Transforms/BackendTypeConversion.cpp

lib/Dialect/Torch/Transforms/DecomposeComplexOps.cpp

- Legalize Torch scaled_dot_product_attention into TOSA by adding the necessary patterns in TorchToTosa.cpp plus backend type-conversion hooks. - Introduce a detailed decomposition path for multi-head attention within DecomposeComplexOps.cpp, preparing inputs for TOSA lowering. - Expands the PT1 e2e suite with a dedicated multi-head attention MLIR/Python test and drop the corresponding xfails now that the path works. Signed-off-by: Cathal Corbett <cathal.corbett@arm.com> Change-Id: I96c17aefd25b979f1cf6e897d91d5a29f0a2fa85

sahas3 · 2025-12-16T12:22:25Z

lib/Dialect/Torch/Transforms/DecomposeComplexOps.cpp

    legalOpsSet.clear();
    legalOpsSet.insert(legalOps.begin(), legalOps.end());

+    patterns.add<DecomposeAtenScaledDotProductAttentionOp>(context);


Is this pattern needed anymore with the change in fx_decomp_util?

Please correct me if I misunderstand, but I believe we still need the MLIR-side pattern. The new entry in python/torch_mlir/extras/fx_decomp_util.py only affects the FX/ExportedProgram import path. Other frontends—TorchScript, AOTAutograd, or anyone who feeds raw Torch dialect into torch-mlir-opt—never touch that Python list, so they can still produce torch.aten.scaled_dot_product_attention. For those cases the rewrite in lib/Dialect/Torch/Transforms/DecomposeComplexOps.cpp is what lowers sdpa into the matmul/softmax pipeline so that downstream -convert-torch-to-tosa or -convert-torch-to-linalg keeps working.

Based on

torch-mlir/docs/development.md

Line 244 in 0844d4d

This path doesn't give access to the current generation work that is being driven via the fx_importer

IIUC, fx_importer path is the only maintained path. Rest have been deprecated but the code still exists. Maybe @sjarus / @zjgarvey can confirm / correct that understanding and we can discuss if it's still valuable to have this decomposition pattern or we can rely on PyTorch's decomposition.

I don't have any flag on adding this, just want to make sure that it will actually be exercised.

Yes, the fx_importer is the only path we should expect to support.

I have found attention to be a bit frustrating, however. For example, running decompositions on an exported program with an sdpa op sometimes converts sdpa into a slightly different attention op- even when attention itself isn't getting decomposed. Merely running decompositions at all actually retraces the graph with a different tool, and may select different ops further varied based on other factors like the torch device used by the inputs.

In any case, I don't mind adding a decomposition pattern. We have a bit more control with a pattern like this as opposed to fx decompositions.

@sahas3 based on the above comment are you happy to keep the decomposition as is or would you prefer to remove?

Thanks for the reminder, I'm fine with adding the pattern.

The puzzle now is that if we add sdpa to the decomp list in fx_decomp_util, while we do have a LIT test locking down the decomposition pattern, it won't be tested via the e2e test. However the ops we are decomposing to with the C++ pattern are all locked down via e2e test (I hope :D ). So not being able to e2e test this C++ decomp pattern is probably fine? Thoughts @zjgarvey / @sjarus ?

On that note, we probably need a way to be able to control what ops to decompose with PyTorch decomposition at the e2e test level but that's out of scope of this PR.

sahas3

Thanks for the change. I'm fine with adding the decomposition pattern.

sahas3 · 2025-12-19T12:36:49Z

lib/Conversion/TorchToTosa/TorchToTosa.cpp

+    SmallVector<int64_t> transposedShape(rankedSelf.getRank(),
+                                         ShapedType::kDynamic);
+    if (rankedSelf.hasStaticShape()) {
+      auto staticShape =


to_vector is redundant since the return type of makeShapeTorchCompatible is already SmallVector

sahas3 · 2025-12-19T12:40:34Z

lib/Conversion/TorchToTosa/TorchToTosa.cpp

+          llvm::to_vector(makeShapeTorchCompatible(rankedSelf.getShape()));
+      auto dim0Index = static_cast<size_t>(dim0);
+      auto dim1Index = static_cast<size_t>(dim1);
+      if (dim0Index < staticShape.size() && dim1Index < staticShape.size())


Isn't this condition always guaranteed by the check on line

torch-mlir/lib/Conversion/TorchToTosa/TorchToTosa.cpp

Line 4008 in 9ef3ed1

if (!isValidDim(dim0, selfRank) || !isValidDim(dim1, selfRank))

?

sahas3 · 2025-12-19T12:43:47Z

lib/Conversion/TorchToTosa/TorchToTosa.cpp

+      for (size_t i = 0; i < staticShape.size(); ++i)
+        transposedShape[i] = staticShape[i];
+    }
+    auto rankedResult = RankedTensorType::get(


IIUC, you are computing the transposed shape for statically shaped inputs and using that to construct tosa::transposeOp. I think we can use tosa::CreateOpAndInfer with UnrankedTensorType::get(elemTy) and let the transposed op creation process infer the resultType instead of computing it here.

sahas3 · 2025-12-19T12:45:05Z

lib/Conversion/TorchToTosa/TorchToTosa.cpp

+  target.addDynamicallyLegalOp<tensor::CastOp>([](tensor::CastOp op) -> bool {
+    auto sourceType = dyn_cast<RankedTensorType>(op.getSource().getType());
+    auto resultType = dyn_cast<RankedTensorType>(op.getType());
+    if (!sourceType || !resultType)
+      return true;
+    if (sourceType.getElementType() != resultType.getElementType())
+      return true;
+    if (!sourceType.hasStaticShape())
+      return true;
+    if (!resultType.hasStaticShape())
+      return true;
+    if (sourceType == resultType)
+      return true;
+    return false;
+  });


I am guessing this is needed because of the targetMaterialization change?

sahas3 · 2025-12-19T12:48:08Z

lib/Dialect/Torch/Transforms/DecomposeComplexOps.cpp

    legalOpsSet.clear();
    legalOpsSet.insert(legalOps.begin(), legalOps.end());

+    patterns.add<DecomposeAtenScaledDotProductAttentionOp>(context);


Suggested change

patterns.add<DecomposeAtenScaledDotProductAttentionOp>(context);

addPatternIfTargetOpIsIllegal<DecomposeAtenScaledDotProductAttentionOp>(patterns);

lib/Dialect/Torch/Transforms/DecomposeComplexOps.cpp

sahas3 · 2025-12-19T13:00:47Z

lib/Dialect/Torch/Transforms/DecomposeComplexOps.cpp

+    if (static_cast<int64_t>(keySizes.size()) != queryRank ||
+        static_cast<int64_t>(valueSizes.size()) != queryRank)
+      return rewriter.notifyMatchFailure(
+          op, "expected query, key, and value to share rank");


This failure check should happen before any IR modification starts otherwise IR will be in a bad state as we've already introduced new ops but the original op is not replaced, eventually leading to the pass failing.

sahas3 · 2025-12-19T13:01:49Z

lib/Dialect/Torch/Transforms/DecomposeComplexOps.cpp

+    if (keyTransposedSizes.size() < 2)
+      return rewriter.notifyMatchFailure(
+          op, "expected key tensor rank >= 2 for transpose");


this too has to be checker earlier

sahas3 · 2025-12-19T13:05:14Z

lib/Dialect/Torch/Transforms/DecomposeComplexOps.cpp

+    if (!softmax)
+      return rewriter.notifyMatchFailure(op,
+                                         "failed to compute softmax scores");


This one probably is fine. Looking at getSoftmaxResult I don't expect it to fail in this logic. I can't think of a good idea to ensure that though barring replicating the checks in getSoftmaxResult here early enough but that doesn't seem like a good idea.

sahas3 · 2025-12-19T13:08:02Z

test/Conversion/TorchToTosa/multi_head_attention.mlir

@@ -0,0 +1,29 @@
+// RUN: torch-mlir-opt %s -torch-decompose-complex-ops -convert-torch-to-tosa -split-input-file | FileCheck %s
+
+// Checks that scaled dot product attention (single-head configuration) lowers


I think a better place for this test is

torch-mlir/test/Dialect/Torch/decompose-complex-ops.mlir

Line 1 in ac7b5f5

// RUN: torch-mlir-opt -torch-decompose-complex-ops -split-input-file %s | FileCheck %s

since the decomposition path is common across all backend paths.

catcor01 force-pushed the multihead_attention branch 2 times, most recently from a98526f to cf45a2e Compare November 25, 2025 08:59

Lallapallooza reviewed Dec 2, 2025

View reviewed changes

catcor01 force-pushed the multihead_attention branch from cf45a2e to fd02d37 Compare December 3, 2025 13:54

catcor01 requested a review from Lallapallooza December 3, 2025 13:59

sahas3 reviewed Dec 4, 2025

View reviewed changes

lib/Dialect/Torch/Transforms/DecomposeComplexOps.cpp Show resolved Hide resolved

catcor01 force-pushed the multihead_attention branch from fd02d37 to 9ef3ed1 Compare December 16, 2025 11:35

sahas3 requested changes Dec 16, 2025

View reviewed changes

sahas3 requested changes Dec 19, 2025

View reviewed changes

	patterns.add<DecomposeAtenScaledDotProductAttentionOp>(context);
	addPatternIfTargetOpIsIllegal<DecomposeAtenScaledDotProductAttentionOp>(patterns);

		@@ -0,0 +1,29 @@
		// RUN: torch-mlir-opt %s -torch-decompose-complex-ops -convert-torch-to-tosa -split-input-file \| FileCheck %s

		// Checks that scaled dot product attention (single-head configuration) lowers

[TOSA] MultiheadAttention legalization #4382

Are you sure you want to change the base?

[TOSA] MultiheadAttention legalization #4382

Conversation

catcor01 commented Nov 19, 2025

Uh oh!

Lallapallooza left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

sahas3 left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants