-
Notifications
You must be signed in to change notification settings - Fork 2.6k
Handle None inferences in eval results for issue #2729 #3805
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
Summary of ChangesHello @ShaharKatz, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed! This pull request enhances the evaluation service by implementing robust error handling for failed inference results. It ensures that the evaluation process can gracefully manage scenarios where inferences fail or are missing, preventing crashes and providing clear 'not evaluated' statuses in the results. This improves the reliability and completeness of evaluation reports, allowing users to understand why certain cases were not fully processed. Highlights
Using Gemini Code AssistThe full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips. Invoking Gemini You can request assistance from Gemini at any point by creating a comment using either
Customization To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a Limitations & Feedback Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here. You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension. Footnotes
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Code Review
This pull request effectively addresses the issue of handling None inferences in evaluation results by adding defensive checks in evaluate and _evaluate_single_inference_result. The changes are well-structured, with helper methods extracted to improve clarity and reuse. The addition of a unit test is also appreciated. My feedback includes a suggestion for a missing type hint and recommendations for expanding test coverage to ensure all new code paths are validated, which will improve the long-term robustness of these changes.
| for inference_result in evaluate_request.inference_results: | ||
| if inference_result.status == InferenceStatus.FAILURE: | ||
| logger.warning( | ||
| 'Skipping evaluation for eval case `%s` because inference failed' | ||
| ' with status `%s`: %s', | ||
| inference_result.eval_case_id, | ||
| inference_result.status, | ||
| inference_result.error_message, | ||
| ) | ||
| eval_case = self._eval_sets_manager.get_eval_case( | ||
| app_name=inference_result.app_name, | ||
| eval_set_id=inference_result.eval_set_id, | ||
| eval_case_id=inference_result.eval_case_id, | ||
| ) | ||
| if eval_case is None: | ||
| raise NotFoundError( | ||
| f'Eval case with id {inference_result.eval_case_id} not found' | ||
| f' for app {inference_result.app_name} and eval set' | ||
| f' {inference_result.eval_set_id}.' | ||
| ) | ||
| eval_case_result = await self._build_not_evaluated_eval_case_result( | ||
| inference_result=inference_result, | ||
| eval_case=eval_case, | ||
| reason='Inference failed', | ||
| ) | ||
| if self._eval_set_results_manager: | ||
| self._eval_set_results_manager.save_eval_set_result( | ||
| app_name=inference_result.app_name, | ||
| eval_set_id=inference_result.eval_set_id, | ||
| eval_case_results=[eval_case_result], | ||
| ) | ||
| yield eval_case_result | ||
| continue | ||
|
|
||
| evaluation_tasks.append(run_evaluation(inference_result)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The new logic to handle failed inferences directly within the evaluate method is a good addition. I see you've performed manual E2E testing which is great. To ensure long-term maintainability and prevent regressions, it would be highly beneficial to also add an automated unit test for this. The test could pass a list of InferenceResult objects with mixed success and failure statuses to evaluate and verify that the async generator yields the correct EvalCaseResult for each.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Extended coverage for the new failure-handling path in LocalEvalService.evaluate() by adding test_evaluate_skips_failed_inference_results. The test feeds a mixed set of InferenceResults (one FAILURE, one SUCCESS, one UNKNOWN) through the async generator and asserts:
Failed inference yields an EvalCaseResult with final_eval_status == EvalStatus.NOT_EVALUATED and no metric data.
Successful/unknown entries still run through the evaluator and produce passing metric results.
EvalSetResultsManager.save_eval_set_result is invoked for each case, matching the runtime behavior.
| async def _safe_get_session_details( | ||
| self, *, app_name: str, user_id: str, session_id: str | ||
| ): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For better type safety and code clarity, please add a return type hint to the _safe_get_session_details method. Since Session is not imported in this file, you can use a forward reference string for the type hint.
| async def _safe_get_session_details( | |
| self, *, app_name: str, user_id: str, session_id: str | |
| ): | |
| async def _safe_get_session_details( | |
| self, *, app_name: str, user_id: str, session_id: str | |
| ) -> Optional['Session']: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
added the type hint
| async def test_evaluate_single_inference_result_handles_failed_inference( | ||
| eval_service, mock_eval_sets_manager, mocker | ||
| ): | ||
| invocation = Invocation( | ||
| user_content=genai_types.Content( | ||
| parts=[genai_types.Part(text="test user content.")] | ||
| ), | ||
| final_response=genai_types.Content( | ||
| parts=[genai_types.Part(text="test final response.")] | ||
| ), | ||
| ) | ||
| inference_result = InferenceResult( | ||
| app_name="test_app", | ||
| eval_set_id="test_eval_set", | ||
| eval_case_id="case1", | ||
| inferences=None, | ||
| session_id="session1", | ||
| status=InferenceStatus.FAILURE, | ||
| error_message="simulated inference failure", | ||
| ) | ||
| eval_metric = EvalMetric(metric_name="fake_metric", threshold=0.5) | ||
| evaluate_config = EvaluateConfig(eval_metrics=[eval_metric], parallelism=1) | ||
|
|
||
| mock_eval_case = mocker.MagicMock(spec=EvalCase) | ||
| mock_eval_case.conversation = [invocation.model_copy(deep=True)] | ||
| mock_eval_case.conversation_scenario = None | ||
| mock_eval_case.session_input = None | ||
| mock_eval_sets_manager.get_eval_case.return_value = mock_eval_case | ||
|
|
||
| _, result = await eval_service._evaluate_single_inference_result( | ||
| inference_result=inference_result, evaluate_config=evaluate_config | ||
| ) | ||
|
|
||
| assert isinstance(result, EvalCaseResult) | ||
| assert result.eval_id == "case1" | ||
| assert result.final_eval_status == EvalStatus.NOT_EVALUATED | ||
| assert result.overall_eval_metric_results == [] | ||
| assert result.eval_metric_result_per_invocation == [] | ||
| mock_eval_sets_manager.get_eval_case.assert_called_once_with( | ||
| app_name="test_app", eval_set_id="test_eval_set", eval_case_id="case1" | ||
| ) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is a great test for handling failed inferences! To improve test coverage for the new logic in _evaluate_single_inference_result, could you please add another test case? Specifically, one that covers the scenario where inference_result.status is not FAILURE (e.g., SUCCESS or UNKNOWN), but inference_result.inferences is None. This would ensure the second if block in _evaluate_single_inference_result is also tested.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
added another test_evaluate_single_inference_result_handles_missing_inferences to cover this case.
| eval_case=eval_case, | ||
| reason='Inference failed', | ||
| ) | ||
| ) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We don't need to do this as we are already handling this case in _evaluate_single_inference_result method. Please revert.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've reverted.
| yield eval_case_result | ||
| continue | ||
|
|
||
| evaluation_tasks.append(run_evaluation(inference_result)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same, please revert to what we had earlier. This case is already handled by _evaluate_single_inference_result method.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've reverted.
| user_id=user_id, | ||
| reason='Missing inference payload', | ||
| ) | ||
| return (inference_result, eval_case_result) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No need for a special case, add this to "if inference_result.status == InferenceStatus.FAILURE:", i.e.
if inference_result.status == InferenceStatus.FAILURE or inference_result.inferences is None:
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've moved both cases under the same if statement.
| inference_result=inference_result, | ||
| eval_case=eval_case, | ||
| user_id=user_id, | ||
| reason='Inference status indicates failure', |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Get reason from inference_result.error_message
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Inference reason is now set as reason=inference_result.error_message
| *, | ||
| inference_result: InferenceResult, | ||
| eval_case: EvalCase, | ||
| user_id: Optional[str] = None, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
user_id should not be optional for this method.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've made user_id mandatory.
| def _resolve_user_id(self, eval_case: EvalCase) -> str: | ||
| if eval_case.session_input and eval_case.session_input.user_id: | ||
| return eval_case.session_input.user_id | ||
| return 'test_user_id' |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For now lets remove this method, as we won't need it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've removed the method.
| app_name, | ||
| user_id, | ||
| ) | ||
| return None |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We don't want to catch the NotFoundError actually, we want to bubble it up.
Remove this method and make a call to following directly, where you need it:
return await self._session_service.get_session(
app_name=app_name, user_id=user_id, session_id=session_id
)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Removed skipping the raised error, also called the session_service directly.
| 'Eval case `%s` marked as not evaluated: %s', | ||
| inference_result.eval_case_id, | ||
| reason, | ||
| ) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Remove this log line. We would have any ways written a log line earlier.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've removed the log.
ca62e21 to
e408198
Compare
e408198 to
a3529a8
Compare
|
Thanks for the review @ankursharmas! I've fixed the code according to your comments, it is now a much cleaner and leaner PR. |
|
Hey @ryanaiagent, Let me know if you have any questions or need clarification, and I’d be glad to assist. |
Please ensure you have read the contribution guide before creating a pull request.
Link to Issue or Description of Change
1. Link to an existing issue (if applicable):
2. Or, if no issue exists, describe the change:
issue exists.
Problem:
This fixes the issues of failed inferences as part of evaluating an eval set or test case.
A failed inference would result in a
InferenceResultwith statusFAILUREbut inferences would remainNone, which would raise an exception when trying to access the inferences themselves.Solution:
As proposed by @surajksharma07 , the approach includes defense-in-depth -
Filter failed inferences in evaluate() method - Now
evaluate()runs through allinference_resultsand for every inference with statusInferenceStatus.FAILUREit does the following -- logs the failure as a warning (
skipping evaluation result for case ...)- tries to obtain the eval case, if None, raise a
NotFoundError- builds a new templated
eval_case_resultfor a case that hasn't been evaluated (due to inference fail).- saves the evaluation templated results in the existing mechanism (via
save_eval_set_result)Other inferences go through the existing 'happy flow' flow code of
evaluation_tasks.append(run_evaluation(inference_result))Add safety check in _evaluate_single_inference_result() - Inferences now have protection by of building an
eval_case_resultvia a_build_not_evaluated_eval_case_resultmethod. The reason populated in that function depends on the failure reason -- If the inference resulted in
InferenceStatus.FAILURE, the reason would be'Inference status indicates failure'- If the inferences are
None, a warning is logged and the reason would be'Missing inference payload'.Note - The
_build_not_evaluated_eval_case_resultmethod constructs an EvalCaseResult for cases that could not be evaluated by populating anEvalCaseResultwith information about the user, session, eval set/case and settingEvalStatus.NOT_EVALUATEDas thefinal_eval_status. Small utils methods_resolve_user_idand_safe_get_session_detailstry to get the information for population but have fallbacks as well.Testing Plan
I have added tests that recreates the error specified in the issue in a
test_evaluate_single_inference_result_handles_failed_inferencetest under thetests\unittests\evaluation\test_local_eval_service.py.This test creates an
inference_resultswithNoneasinferences.It calls
_evaluate_single_inference_resultand asserts the results.Furthermore, I have run the other unittests as specified in the CONTRIBUTING.md file and all the tests that pass in
mainpass here as well.Unit Tests:
I have run the other unittests as specified in the CONTRIBUTING.md file and all the tests that pass in
mainpass here as well.A2A unittests do not pass (either on this branch or main).
Manual End-to-End (E2E) Tests:
Furthermore, I have built the package via
uvand tested it locally within theadk-quickstartrepo on a small agent with several eval cases.Checklist
Additional context
Add any other context or screenshots about the feature request here.