Skip to content

Conversation

@pskiran1
Copy link
Member

@pskiran1 pskiran1 commented Dec 1, 2025

This PR enhances the detection mechanism for unresponsive or crashed Python backend stub processes.

Key Changes:

  • Replaces a simple PID check with waitpid() to actively detect terminated stub processes on non-Windows platforms
  • Added a new TRITONBACKEND_ModelInstanceReady, to verify the health of the stub process and accurately report the model's state to the core.

@pskiran1 pskiran1 requested a review from Copilot December 2, 2025 16:24
Copy link

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR enhances the detection mechanism for unresponsive or crashed Python backend stub processes by implementing proper process health checks on Unix-like systems.

Key Changes:

  • Replaces a simple PID check with waitpid() to actively detect terminated stub processes on non-Windows platforms
  • Adds a new TRITONBACKEND_ModelInstanceReady function to verify stub process health before execution
  • Ensures proper cleanup by guarding the kill operation with a PID check

Reviewed changes

Copilot reviewed 2 out of 2 changed files in this pull request and generated 2 comments.

File Description
src/stub_launcher.cc Implements waitpid() with WNOHANG to detect process termination and adds guard for kill operations
src/python_be.cc Adds health check function that validates stub process is both active and responsive

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

@pskiran1 pskiran1 requested a review from yinggeh December 8, 2025 04:33
return false;
} else if (return_pid == stub_pid_) {
// Process has exited and has been reaped
return false;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does this case indicate an error? Is it worthy to be logged?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Logging is beneficial, but we may not need to log here since we are already logging the message stub process is not healthy in core. https://github.com/triton-inference-server/core/pull/461/files#diff-2d67039ec4a39ebaa76ab0e875883d5a2ce230cc413d18e09cff4f413a3686ebR479-R481

@pskiran1 pskiran1 requested a review from yinggeh December 8, 2025 13:14
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Development

Successfully merging this pull request may close these issues.

4 participants