Skip to content

Conversation

@chouxi
Copy link
Contributor

@chouxi chouxi commented Dec 17, 2025

Summary:
X-link: https://github.com/facebookresearch/FBGEMM/pull/2219

This change improves the performance of tracking the deltas in TBE, mainly by
replacing DtoH copy with {F1984231816}
with DtoD copy with async DtoH under stream_callback {F1984231839}

To achieve this, the following is added

  • the pre-registered UVA buffer that's accessible from both GPU and CPU are reused every iteration
    • makes the lifetime of tensors the same to TBE makes it safe to async copy.
    • reuse the same buffer to avoid repeating allocation.
  • trigger the CPU thread to async copy in raw_embedding_streamer.stream()
    • GPU ops don't wait on the D2H
  • To avoid the D2D copy overlaps with D2H copy
    • A GPU event to track the finish of the D2D copy, make the CPU thread to wait for the D2D copy finish
    • join_stream_tensor_copy_thread to trigger a blocking wait for the copy in the next iteration in case of CPU copies take too long before overwriting the pre-registered buffer.

Differential Revision: D86888586

Summary:
X-link: facebookresearch/FBGEMM#2219

This change improves the performance of tracking the deltas in TBE, mainly by
replacing DtoH copy with  {F1984231816}
with DtoD copy with async DtoH under stream_callback {F1984231839}

To achieve this, the following is added
- the pre-registered UVA buffer that's accessible from both GPU and CPU are reused every iteration
  - makes the lifetime of tensors the same to TBE makes it safe to async copy.
  - reuse the same buffer to avoid repeating allocation.
- trigger the CPU thread to async copy in raw_embedding_streamer.stream()
  - GPU ops don't wait on the D2H
- To avoid the D2D copy overlaps with D2H copy
  - A GPU event to track the finish of the D2D copy, make the CPU thread to wait for the D2D copy finish
  - join_stream_tensor_copy_thread to trigger a blocking wait for the copy in the next iteration in case of CPU copies take too long before overwriting the pre-registered buffer.

Differential Revision: D86888586
@meta-cla meta-cla bot added the cla signed label Dec 17, 2025
@meta-codesync
Copy link
Contributor

meta-codesync bot commented Dec 17, 2025

@chouxi has exported this pull request. If you are a Meta employee, you can view the originating Diff in D86888586.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant