-
Notifications
You must be signed in to change notification settings - Fork 188
fix: Enable detection of unresponsive or crashed Python backend stub process #423
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
fix: Enable detection of unresponsive or crashed Python backend stub process #423
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull request overview
This PR enhances the detection mechanism for unresponsive or crashed Python backend stub processes by implementing proper process health checks on Unix-like systems.
Key Changes:
- Replaces a simple PID check with
waitpid()to actively detect terminated stub processes on non-Windows platforms - Adds a new
TRITONBACKEND_ModelInstanceReadyfunction to verify stub process health before execution - Ensures proper cleanup by guarding the kill operation with a PID check
Reviewed changes
Copilot reviewed 2 out of 2 changed files in this pull request and generated 2 comments.
| File | Description |
|---|---|
| src/stub_launcher.cc | Implements waitpid() with WNOHANG to detect process termination and adds guard for kill operations |
| src/python_be.cc | Adds health check function that validates stub process is both active and responsive |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| return false; | ||
| } else if (return_pid == stub_pid_) { | ||
| // Process has exited and has been reaped | ||
| return false; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Does this case indicate an error? Is it worthy to be logged?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Logging is beneficial, but we may not need to log here since we are already logging the message stub process is not healthy in core. https://github.com/triton-inference-server/core/pull/461/files#diff-2d67039ec4a39ebaa76ab0e875883d5a2ce230cc413d18e09cff4f413a3686ebR479-R481
This PR enhances the detection mechanism for unresponsive or crashed Python backend stub processes.
Key Changes:
waitpid()to actively detect terminated stub processes on non-Windows platformsTRITONBACKEND_ModelInstanceReady, to verify the health of the stub process and accurately report the model's state to the core.