Skip to content

Conversation

@ShaharKatz
Copy link

Please ensure you have read the contribution guide before creating a pull request.

Link to Issue or Description of Change

1. Link to an existing issue (if applicable):

2. Or, if no issue exists, describe the change:

issue exists.

Problem:

This fixes the issues of failed inferences as part of evaluating an eval set or test case.
A failed inference would result in a InferenceResult with status FAILURE but inferences would remain None, which would raise an exception when trying to access the inferences themselves.

Solution:

As proposed by @surajksharma07 , the approach includes defense-in-depth -

  1. Filter failed inferences in evaluate() method - Now evaluate() runs through all inference_results and for every inference with status InferenceStatus.FAILURE it does the following -
    - logs the failure as a warning (skipping evaluation result for case ...)
    - tries to obtain the eval case, if None, raise a NotFoundError
    - builds a new templated eval_case_result for a case that hasn't been evaluated (due to inference fail).
    - saves the evaluation templated results in the existing mechanism (via save_eval_set_result)
    Other inferences go through the existing 'happy flow' flow code of evaluation_tasks.append(run_evaluation(inference_result))

  2. Add safety check in _evaluate_single_inference_result() - Inferences now have protection by of building an eval_case_result via a _build_not_evaluated_eval_case_result method. The reason populated in that function depends on the failure reason -
    - If the inference resulted in InferenceStatus.FAILURE, the reason would be 'Inference status indicates failure'
    - If the inferences are None, a warning is logged and the reason would be 'Missing inference payload'.

Note - The _build_not_evaluated_eval_case_result method constructs an EvalCaseResult for cases that could not be evaluated by populating an EvalCaseResult with information about the user, session, eval set/case and setting EvalStatus.NOT_EVALUATED as the final_eval_status. Small utils methods _resolve_user_id and _safe_get_session_details try to get the information for population but have fallbacks as well.

Testing Plan

I have added tests that recreates the error specified in the issue in a test_evaluate_single_inference_result_handles_failed_inference test under the tests\unittests\evaluation\test_local_eval_service.py.

This test creates an inference_results with None as inferences.
It calls _evaluate_single_inference_result and asserts the results.

Furthermore, I have run the other unittests as specified in the CONTRIBUTING.md file and all the tests that pass in main pass here as well.

Unit Tests:

  • I have added or updated unit tests for my change.
  • All unit tests pass locally.

I have run the other unittests as specified in the CONTRIBUTING.md file and all the tests that pass in main pass here as well.
A2A unittests do not pass (either on this branch or main).

Manual End-to-End (E2E) Tests:

Furthermore, I have built the package via uv and tested it locally within the adk-quickstart repo on a small agent with several eval cases.

image

Checklist

  • I have read the CONTRIBUTING.md document.
  • I have performed a self-review of my own code.
  • I have commented my code, particularly in hard-to-understand areas.
  • I have added tests that prove my fix is effective or that my feature works.
  • New and existing unit tests pass locally with my changes.
  • I have manually tested my changes end-to-end.
  • Any dependent changes have been merged and published in downstream modules.

Additional context

Add any other context or screenshots about the feature request here.

@gemini-code-assist
Copy link
Contributor

Summary of Changes

Hello @ShaharKatz, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request enhances the evaluation service by implementing robust error handling for failed inference results. It ensures that the evaluation process can gracefully manage scenarios where inferences fail or are missing, preventing crashes and providing clear 'not evaluated' statuses in the results. This improves the reliability and completeness of evaluation reports, allowing users to understand why certain cases were not fully processed.

Highlights

  • Robust Error Handling for Failed Inferences: The pull request introduces a 'defense-in-depth' approach to gracefully handle InferenceResult objects with InferenceStatus.FAILURE or None inferences during evaluation, preventing exceptions and ensuring a complete evaluation report.
  • Filtering Failed Inferences in evaluate(): The evaluate() method now explicitly filters out inference results marked as FAILURE. Instead of attempting to evaluate them, it logs a warning, retrieves the corresponding eval_case, and generates an EvalCaseResult with EvalStatus.NOT_EVALUATED and a reason of 'Inference failed'.
  • Safety Checks in _evaluate_single_inference_result(): Additional safety checks have been added to _evaluate_single_inference_result(). If an InferenceResult has InferenceStatus.FAILURE or its inferences attribute is None, it now constructs an EvalCaseResult with EvalStatus.NOT_EVALUATED, providing specific reasons like 'Inference status indicates failure' or 'Missing inference payload'.
  • New Utility Methods for Not Evaluated Cases: Three new utility methods, _build_not_evaluated_eval_case_result, _resolve_user_id, and _safe_get_session_details, have been introduced to streamline the creation of EvalCaseResult objects for cases that could not be evaluated, including robust handling of user and session details.
  • New Unit Test Coverage: A new unit test, test_evaluate_single_inference_result_handles_failed_inference, has been added to specifically verify the correct behavior when _evaluate_single_inference_result encounters failed inferences.
Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature Command Description
Code Review /gemini review Performs a code review for the current pull request in its current state.
Pull Request Summary /gemini summary Provides a summary of the current pull request in its current state.
Comment @gemini-code-assist Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help /gemini help Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Footnotes

  1. Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution.

@adk-bot adk-bot added the eval [Component] This issue is related to evaluation label Dec 3, 2025
Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request effectively addresses the issue of handling None inferences in evaluation results by adding defensive checks in evaluate and _evaluate_single_inference_result. The changes are well-structured, with helper methods extracted to improve clarity and reuse. The addition of a unit test is also appreciated. My feedback includes a suggestion for a missing type hint and recommendations for expanding test coverage to ensure all new code paths are validated, which will improve the long-term robustness of these changes.

Comment on lines 173 to 219
for inference_result in evaluate_request.inference_results:
if inference_result.status == InferenceStatus.FAILURE:
logger.warning(
'Skipping evaluation for eval case `%s` because inference failed'
' with status `%s`: %s',
inference_result.eval_case_id,
inference_result.status,
inference_result.error_message,
)
eval_case = self._eval_sets_manager.get_eval_case(
app_name=inference_result.app_name,
eval_set_id=inference_result.eval_set_id,
eval_case_id=inference_result.eval_case_id,
)
if eval_case is None:
raise NotFoundError(
f'Eval case with id {inference_result.eval_case_id} not found'
f' for app {inference_result.app_name} and eval set'
f' {inference_result.eval_set_id}.'
)
eval_case_result = await self._build_not_evaluated_eval_case_result(
inference_result=inference_result,
eval_case=eval_case,
reason='Inference failed',
)
if self._eval_set_results_manager:
self._eval_set_results_manager.save_eval_set_result(
app_name=inference_result.app_name,
eval_set_id=inference_result.eval_set_id,
eval_case_results=[eval_case_result],
)
yield eval_case_result
continue

evaluation_tasks.append(run_evaluation(inference_result))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The new logic to handle failed inferences directly within the evaluate method is a good addition. I see you've performed manual E2E testing which is great. To ensure long-term maintainability and prevent regressions, it would be highly beneficial to also add an automated unit test for this. The test could pass a list of InferenceResult objects with mixed success and failure statuses to evaluate and verify that the async generator yields the correct EvalCaseResult for each.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Extended coverage for the new failure-handling path in LocalEvalService.evaluate() by adding test_evaluate_skips_failed_inference_results. The test feeds a mixed set of InferenceResults (one FAILURE, one SUCCESS, one UNKNOWN) through the async generator and asserts:
Failed inference yields an EvalCaseResult with final_eval_status == EvalStatus.NOT_EVALUATED and no metric data.
Successful/unknown entries still run through the evaluator and produce passing metric results.
EvalSetResultsManager.save_eval_set_result is invoked for each case, matching the runtime behavior.

Comment on lines 486 to 488
async def _safe_get_session_details(
self, *, app_name: str, user_id: str, session_id: str
):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

For better type safety and code clarity, please add a return type hint to the _safe_get_session_details method. Since Session is not imported in this file, you can use a forward reference string for the type hint.

Suggested change
async def _safe_get_session_details(
self, *, app_name: str, user_id: str, session_id: str
):
async def _safe_get_session_details(
self, *, app_name: str, user_id: str, session_id: str
) -> Optional['Session']:

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

added the type hint

Comment on lines +411 to +527
async def test_evaluate_single_inference_result_handles_failed_inference(
eval_service, mock_eval_sets_manager, mocker
):
invocation = Invocation(
user_content=genai_types.Content(
parts=[genai_types.Part(text="test user content.")]
),
final_response=genai_types.Content(
parts=[genai_types.Part(text="test final response.")]
),
)
inference_result = InferenceResult(
app_name="test_app",
eval_set_id="test_eval_set",
eval_case_id="case1",
inferences=None,
session_id="session1",
status=InferenceStatus.FAILURE,
error_message="simulated inference failure",
)
eval_metric = EvalMetric(metric_name="fake_metric", threshold=0.5)
evaluate_config = EvaluateConfig(eval_metrics=[eval_metric], parallelism=1)

mock_eval_case = mocker.MagicMock(spec=EvalCase)
mock_eval_case.conversation = [invocation.model_copy(deep=True)]
mock_eval_case.conversation_scenario = None
mock_eval_case.session_input = None
mock_eval_sets_manager.get_eval_case.return_value = mock_eval_case

_, result = await eval_service._evaluate_single_inference_result(
inference_result=inference_result, evaluate_config=evaluate_config
)

assert isinstance(result, EvalCaseResult)
assert result.eval_id == "case1"
assert result.final_eval_status == EvalStatus.NOT_EVALUATED
assert result.overall_eval_metric_results == []
assert result.eval_metric_result_per_invocation == []
mock_eval_sets_manager.get_eval_case.assert_called_once_with(
app_name="test_app", eval_set_id="test_eval_set", eval_case_id="case1"
)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

This is a great test for handling failed inferences! To improve test coverage for the new logic in _evaluate_single_inference_result, could you please add another test case? Specifically, one that covers the scenario where inference_result.status is not FAILURE (e.g., SUCCESS or UNKNOWN), but inference_result.inferences is None. This would ensure the second if block in _evaluate_single_inference_result is also tested.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

added another test_evaluate_single_inference_result_handles_missing_inferences to cover this case.

eval_case=eval_case,
reason='Inference failed',
)
)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We don't need to do this as we are already handling this case in _evaluate_single_inference_result method. Please revert.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've reverted.

yield eval_case_result
continue

evaluation_tasks.append(run_evaluation(inference_result))
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same, please revert to what we had earlier. This case is already handled by _evaluate_single_inference_result method.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've reverted.

user_id=user_id,
reason='Missing inference payload',
)
return (inference_result, eval_case_result)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No need for a special case, add this to "if inference_result.status == InferenceStatus.FAILURE:", i.e.

if inference_result.status == InferenceStatus.FAILURE or inference_result.inferences is None:

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've moved both cases under the same if statement.

inference_result=inference_result,
eval_case=eval_case,
user_id=user_id,
reason='Inference status indicates failure',
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Get reason from inference_result.error_message

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Inference reason is now set as reason=inference_result.error_message

*,
inference_result: InferenceResult,
eval_case: EvalCase,
user_id: Optional[str] = None,
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

user_id should not be optional for this method.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've made user_id mandatory.

def _resolve_user_id(self, eval_case: EvalCase) -> str:
if eval_case.session_input and eval_case.session_input.user_id:
return eval_case.session_input.user_id
return 'test_user_id'
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For now lets remove this method, as we won't need it.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've removed the method.

app_name,
user_id,
)
return None
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We don't want to catch the NotFoundError actually, we want to bubble it up.

Remove this method and make a call to following directly, where you need it:
return await self._session_service.get_session(
app_name=app_name, user_id=user_id, session_id=session_id
)

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Removed skipping the raised error, also called the session_service directly.

'Eval case `%s` marked as not evaluated: %s',
inference_result.eval_case_id,
reason,
)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Remove this log line. We would have any ways written a log line earlier.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've removed the log.

@ryanaiagent ryanaiagent self-assigned this Dec 4, 2025
@ShaharKatz ShaharKatz force-pushed the fix-2729-none-inferences branch 2 times, most recently from ca62e21 to e408198 Compare December 4, 2025 21:17
@ryanaiagent ryanaiagent added the request clarification [Status] The maintainer need clarification or more information from the author label Dec 4, 2025
@ShaharKatz ShaharKatz force-pushed the fix-2729-none-inferences branch from e408198 to a3529a8 Compare December 4, 2025 22:15
@ShaharKatz
Copy link
Author

Thanks for the review @ankursharmas!

I've fixed the code according to your comments, it is now a much cleaner and leaner PR.

@ShaharKatz
Copy link
Author

Hey @ryanaiagent,

Let me know if you have any questions or need clarification, and I’d be glad to assist.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

eval [Component] This issue is related to evaluation request clarification [Status] The maintainer need clarification or more information from the author

Projects

None yet

Development

Successfully merging this pull request may close these issues.

during run evaluation of a test case inferences could be NoneType which has no len attribute and error is not caught

4 participants