Skip to content

Conversation

@SS-JIA
Copy link
Contributor

@SS-JIA SS-JIA commented Dec 22, 2025

Summary:

A use-after-free bug in find_compute_queues() was discovered using Valgrind. The function created a local queue_priorities vector inside a loop and stored its data pointer in VkDeviceQueueCreateInfo. When the vector went out of scope, its memory was freed, but vkCreateDevice() later accessed this freed memory. Fixed by adding a queue_priorities parameter to persist the data until after vkCreateDevice() completes.

Problem

A use-after-free bug was discovered in find_compute_queues() using Valgrind.
The function created a local std::vector queue_priorities inside a
loop and stored its data pointer in VkDeviceQueueCreateInfo. When the vector
went out of scope at the end of each iteration, its memory was freed. Later,
when vkCreateDevice() accessed these queue priorities, it read from freed
memory.

Investigation

Valgrind reported:

  • Invalid read of size 4 at 0xb3bdd60 (freed memory)
  • Block was freed by operator delete in find_compute_queues()
  • Block was allocated by operator new in find_compute_queues()
  • Error occurred during vkCreateDevice() call

Fix

Modified find_compute_queues() to accept an additional parameter
std::vector<std::vector>& queue_priorities that persists the
queue priority data until after vkCreateDevice() completes. This ensures
the memory remains valid when Vulkan needs to access it.

Updated all call sites:

  • create_logical_device()
  • Adapter constructor (external device variant)

Verification

Valgrind results before fix: 296 errors from 13 contexts, 1 Invalid read
Valgrind results after fix: 295 errors from 12 contexts, 0 Invalid reads ✓

Remaining errors are in NVIDIA drivers and third-party libraries.

cc @manuelcandales @digantdesai @cbilgin

Summary:

A use-after-free bug in find_compute_queues() was discovered using Valgrind. The function created a local queue_priorities vector inside a loop and stored its data pointer in VkDeviceQueueCreateInfo. When the vector went out of scope, its memory was freed, but vkCreateDevice() later accessed this freed memory. Fixed by adding a queue_priorities parameter to persist the data until after vkCreateDevice() completes.

## Problem
A use-after-free bug was discovered in find_compute_queues() using Valgrind.
The function created a local std::vector<float> queue_priorities inside a
loop and stored its data pointer in VkDeviceQueueCreateInfo. When the vector
went out of scope at the end of each iteration, its memory was freed. Later,
when vkCreateDevice() accessed these queue priorities, it read from freed
memory.

## Investigation
Valgrind reported:
- Invalid read of size 4 at 0xb3bdd60 (freed memory)
- Block was freed by operator delete in find_compute_queues()
- Block was allocated by operator new in find_compute_queues()
- Error occurred during vkCreateDevice() call

## Fix
Modified find_compute_queues() to accept an additional parameter
std::vector<std::vector<float>>& queue_priorities that persists the
queue priority data until after vkCreateDevice() completes. This ensures
the memory remains valid when Vulkan needs to access it.

Updated all call sites:
- create_logical_device()
- Adapter constructor (external device variant)

## Verification
Valgrind results before fix: 296 errors from 13 contexts, 1 Invalid read
Valgrind results after fix: 295 errors from 12 contexts, 0 Invalid reads ✓

Remaining errors are in NVIDIA drivers and third-party libraries.
@pytorch-bot pytorch-bot bot added the module: vulkan Issues related to the Vulkan delegate and code under backends/vulkan/ label Dec 22, 2025
@pytorch-bot
Copy link

pytorch-bot bot commented Dec 22, 2025

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/executorch/16367

Note: Links to docs will display an error until the docs builds have been completed.

❌ 1 Cancelled Job, 1 Unrelated Failure

As of commit 128b9e8 with merge base 0ee2f49 (image):

CANCELLED JOB - The following job was cancelled. Please retry:

UNSTABLE - The following job is marked as unstable, possibly due to flakiness on trunk:

This comment was automatically generated by Dr. CI and updates every 15 minutes.

@meta-cla meta-cla bot added the CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. label Dec 22, 2025
@github-actions
Copy link

This PR needs a release notes: label

If your change should be included in the release notes (i.e. would users of this library care about this change?), please use a label starting with release notes:. This helps us keep track and include your important work in the next release notes.

To add a label, you can comment to pytorchbot, for example
@pytorchbot label "release notes: none"

For more information, see
https://github.com/pytorch/pytorch/wiki/PyTorch-AutoLabel-Bot#why-categorize-for-release-notes-and-how-does-it-work.

@meta-codesync
Copy link

meta-codesync bot commented Dec 22, 2025

@SS-JIA has imported this pull request. If you are a Meta employee, you can view this in D89687612.

@meta-codesync meta-codesync bot merged commit c59acfb into main Dec 23, 2025
166 of 171 checks passed
@meta-codesync meta-codesync bot deleted the pr16367 branch December 23, 2025 03:27
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. module: vulkan Issues related to the Vulkan delegate and code under backends/vulkan/

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants