Handle None inferences in eval results for issue #2729 #3805

ShaharKatz · 2025-12-03T12:09:11Z

Please ensure you have read the contribution guide before creating a pull request.

Link to Issue or Description of Change

1. Link to an existing issue (if applicable):

Closes: during run evaluation of a test case inferences could be NoneType which has no len attribute and error is not caught #2729

2. Or, if no issue exists, describe the change:

issue exists.

Problem:

This fixes the issues of failed inferences as part of evaluating an eval set or test case.
A failed inference would result in a InferenceResult with status FAILURE but inferences would remain None, which would raise an exception when trying to access the inferences themselves.

Solution:

As proposed by @surajksharma07 , the approach includes defense-in-depth -

Filter failed inferences in evaluate() method - Now evaluate() runs through all inference_results and for every inference with status InferenceStatus.FAILURE it does the following -
- logs the failure as a warning (skipping evaluation result for case ...)
- tries to obtain the eval case, if None, raise a NotFoundError
- builds a new templated eval_case_result for a case that hasn't been evaluated (due to inference fail).
- saves the evaluation templated results in the existing mechanism (via save_eval_set_result)
Other inferences go through the existing 'happy flow' flow code of evaluation_tasks.append(run_evaluation(inference_result))
Add safety check in _evaluate_single_inference_result() - Inferences now have protection by of building an eval_case_result via a _build_not_evaluated_eval_case_result method. The reason populated in that function depends on the failure reason -
- If the inference resulted in InferenceStatus.FAILURE, the reason would be 'Inference status indicates failure'
- If the inferences are None, a warning is logged and the reason would be 'Missing inference payload'.

Note - The _build_not_evaluated_eval_case_result method constructs an EvalCaseResult for cases that could not be evaluated by populating an EvalCaseResult with information about the user, session, eval set/case and setting EvalStatus.NOT_EVALUATED as the final_eval_status. Small utils methods _resolve_user_id and _safe_get_session_details try to get the information for population but have fallbacks as well.

Testing Plan

I have added tests that recreates the error specified in the issue in a test_evaluate_single_inference_result_handles_failed_inference test under the tests\unittests\evaluation\test_local_eval_service.py.

This test creates an inference_results with None as inferences.
It calls _evaluate_single_inference_result and asserts the results.

Furthermore, I have run the other unittests as specified in the CONTRIBUTING.md file and all the tests that pass in main pass here as well.

Unit Tests:

I have added or updated unit tests for my change.
All unit tests pass locally.

I have run the other unittests as specified in the CONTRIBUTING.md file and all the tests that pass in main pass here as well.
A2A unittests do not pass (either on this branch or main).

Manual End-to-End (E2E) Tests:

Furthermore, I have built the package via uv and tested it locally within the adk-quickstart repo on a small agent with several eval cases.

Checklist

I have read the CONTRIBUTING.md document.
I have performed a self-review of my own code.
I have commented my code, particularly in hard-to-understand areas.
I have added tests that prove my fix is effective or that my feature works.
New and existing unit tests pass locally with my changes.
I have manually tested my changes end-to-end.
Any dependent changes have been merged and published in downstream modules.

Additional context

Add any other context or screenshots about the feature request here.

gemini-code-assist · 2025-12-03T12:09:30Z

Summary of Changes

Hello @ShaharKatz, I'm Gemini Code Assist¹! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request enhances the evaluation service by implementing robust error handling for failed inference results. It ensures that the evaluation process can gracefully manage scenarios where inferences fail or are missing, preventing crashes and providing clear 'not evaluated' statuses in the results. This improves the reliability and completeness of evaluation reports, allowing users to understand why certain cases were not fully processed.

Highlights

Robust Error Handling for Failed Inferences: The pull request introduces a 'defense-in-depth' approach to gracefully handle InferenceResult objects with InferenceStatus.FAILURE or None inferences during evaluation, preventing exceptions and ensuring a complete evaluation report.
Filtering Failed Inferences in evaluate(): The evaluate() method now explicitly filters out inference results marked as FAILURE. Instead of attempting to evaluate them, it logs a warning, retrieves the corresponding eval_case, and generates an EvalCaseResult with EvalStatus.NOT_EVALUATED and a reason of 'Inference failed'.
Safety Checks in _evaluate_single_inference_result(): Additional safety checks have been added to _evaluate_single_inference_result(). If an InferenceResult has InferenceStatus.FAILURE or its inferences attribute is None, it now constructs an EvalCaseResult with EvalStatus.NOT_EVALUATED, providing specific reasons like 'Inference status indicates failure' or 'Missing inference payload'.
New Utility Methods for Not Evaluated Cases: Three new utility methods, _build_not_evaluated_eval_case_result, _resolve_user_id, and _safe_get_session_details, have been introduced to streamline the creation of EvalCaseResult objects for cases that could not be evaluated, including robust handling of user and session details.
New Unit Test Coverage: A new unit test, test_evaluate_single_inference_result_handles_failed_inference, has been added to specifically verify the correct behavior when _evaluate_single_inference_result encounters failed inferences.

Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature	Command	Description
Code Review	`/gemini review`	Performs a code review for the current pull request in its current state.
Pull Request Summary	`/gemini summary`	Provides a summary of the current pull request in its current state.
Comment	@gemini-code-assist	Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help	`/gemini help`	Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution. ↩

gemini-code-assist

Code Review

This pull request effectively addresses the issue of handling None inferences in evaluation results by adding defensive checks in evaluate and _evaluate_single_inference_result. The changes are well-structured, with helper methods extracted to improve clarity and reuse. The addition of a unit test is also appreciated. My feedback includes a suggestion for a missing type hint and recommendations for expanding test coverage to ensure all new code paths are validated, which will improve the long-term robustness of these changes.

gemini-code-assist · 2025-12-03T12:11:09Z

src/google/adk/evaluation/local_eval_service.py

+    for inference_result in evaluate_request.inference_results:
+      if inference_result.status == InferenceStatus.FAILURE:
+        logger.warning(
+            'Skipping evaluation for eval case `%s` because inference failed'
+            ' with status `%s`: %s',
+            inference_result.eval_case_id,
+            inference_result.status,
+            inference_result.error_message,
+        )
+        eval_case = self._eval_sets_manager.get_eval_case(
+            app_name=inference_result.app_name,
+            eval_set_id=inference_result.eval_set_id,
+            eval_case_id=inference_result.eval_case_id,
+        )
+        if eval_case is None:
+          raise NotFoundError(
+              f'Eval case with id {inference_result.eval_case_id} not found'
+              f' for app {inference_result.app_name} and eval set'
+              f' {inference_result.eval_set_id}.'
+          )
+        eval_case_result = await self._build_not_evaluated_eval_case_result(
+            inference_result=inference_result,
+            eval_case=eval_case,
+            reason='Inference failed',
+        )
+        if self._eval_set_results_manager:
+          self._eval_set_results_manager.save_eval_set_result(
+              app_name=inference_result.app_name,
+              eval_set_id=inference_result.eval_set_id,
+              eval_case_results=[eval_case_result],
+          )
+        yield eval_case_result
+        continue
+
+      evaluation_tasks.append(run_evaluation(inference_result))


The new logic to handle failed inferences directly within the evaluate method is a good addition. I see you've performed manual E2E testing which is great. To ensure long-term maintainability and prevent regressions, it would be highly beneficial to also add an automated unit test for this. The test could pass a list of InferenceResult objects with mixed success and failure statuses to evaluate and verify that the async generator yields the correct EvalCaseResult for each.

Extended coverage for the new failure-handling path in LocalEvalService.evaluate() by adding test_evaluate_skips_failed_inference_results. The test feeds a mixed set of InferenceResults (one FAILURE, one SUCCESS, one UNKNOWN) through the async generator and asserts:
Failed inference yields an EvalCaseResult with final_eval_status == EvalStatus.NOT_EVALUATED and no metric data.
Successful/unknown entries still run through the evaluator and produce passing metric results.
EvalSetResultsManager.save_eval_set_result is invoked for each case, matching the runtime behavior.

gemini-code-assist · 2025-12-03T12:11:09Z

src/google/adk/evaluation/local_eval_service.py

+  async def _safe_get_session_details(
+      self, *, app_name: str, user_id: str, session_id: str
+  ):


For better type safety and code clarity, please add a return type hint to the _safe_get_session_details method. Since Session is not imported in this file, you can use a forward reference string for the type hint.

Suggested change

async def _safe_get_session_details(

self, *, app_name: str, user_id: str, session_id: str

):

async def _safe_get_session_details(

self, *, app_name: str, user_id: str, session_id: str

) -> Optional['Session']:

added the type hint

gemini-code-assist · 2025-12-03T12:11:09Z

tests/unittests/evaluation/test_local_eval_service.py

+async def test_evaluate_single_inference_result_handles_failed_inference(
+    eval_service, mock_eval_sets_manager, mocker
+):
+  invocation = Invocation(
+      user_content=genai_types.Content(
+          parts=[genai_types.Part(text="test user content.")]
+      ),
+      final_response=genai_types.Content(
+          parts=[genai_types.Part(text="test final response.")]
+      ),
+  )
+  inference_result = InferenceResult(
+      app_name="test_app",
+      eval_set_id="test_eval_set",
+      eval_case_id="case1",
+      inferences=None,
+      session_id="session1",
+      status=InferenceStatus.FAILURE,
+      error_message="simulated inference failure",
+  )
+  eval_metric = EvalMetric(metric_name="fake_metric", threshold=0.5)
+  evaluate_config = EvaluateConfig(eval_metrics=[eval_metric], parallelism=1)
+
+  mock_eval_case = mocker.MagicMock(spec=EvalCase)
+  mock_eval_case.conversation = [invocation.model_copy(deep=True)]
+  mock_eval_case.conversation_scenario = None
+  mock_eval_case.session_input = None
+  mock_eval_sets_manager.get_eval_case.return_value = mock_eval_case
+
+  _, result = await eval_service._evaluate_single_inference_result(
+      inference_result=inference_result, evaluate_config=evaluate_config
+  )
+
+  assert isinstance(result, EvalCaseResult)
+  assert result.eval_id == "case1"
+  assert result.final_eval_status == EvalStatus.NOT_EVALUATED
+  assert result.overall_eval_metric_results == []
+  assert result.eval_metric_result_per_invocation == []
+  mock_eval_sets_manager.get_eval_case.assert_called_once_with(
+      app_name="test_app", eval_set_id="test_eval_set", eval_case_id="case1"
+  )


This is a great test for handling failed inferences! To improve test coverage for the new logic in _evaluate_single_inference_result, could you please add another test case? Specifically, one that covers the scenario where inference_result.status is not FAILURE (e.g., SUCCESS or UNKNOWN), but inference_result.inferences is None. This would ensure the second if block in _evaluate_single_inference_result is also tested.

added another test_evaluate_single_inference_result_handles_missing_inferences to cover this case.

ankursharmas · 2025-12-04T01:39:09Z

src/google/adk/evaluation/local_eval_service.py

+            eval_case=eval_case,
+            reason='Inference failed',
+          )
+        )


We don't need to do this as we are already handling this case in _evaluate_single_inference_result method. Please revert.

I've reverted.

ankursharmas · 2025-12-04T01:39:51Z

src/google/adk/evaluation/local_eval_service.py

+        yield eval_case_result
+        continue
+
+      evaluation_tasks.append(run_evaluation(inference_result))


Same, please revert to what we had earlier. This case is already handled by _evaluate_single_inference_result method.

I've reverted.

ankursharmas · 2025-12-04T01:42:42Z

src/google/adk/evaluation/local_eval_service.py

+          user_id=user_id,
+          reason='Missing inference payload',
+      )
+      return (inference_result, eval_case_result)


No need for a special case, add this to "if inference_result.status == InferenceStatus.FAILURE:", i.e.

if inference_result.status == InferenceStatus.FAILURE or inference_result.inferences is None:

I've moved both cases under the same if statement.

ankursharmas · 2025-12-04T01:43:25Z

src/google/adk/evaluation/local_eval_service.py

+          inference_result=inference_result,
+          eval_case=eval_case,
+          user_id=user_id,
+          reason='Inference status indicates failure',


Get reason from inference_result.error_message

Inference reason is now set as reason=inference_result.error_message

ankursharmas · 2025-12-04T01:44:38Z

src/google/adk/evaluation/local_eval_service.py

+      *,
+      inference_result: InferenceResult,
+      eval_case: EvalCase,
+      user_id: Optional[str] = None,


user_id should not be optional for this method.

I've made user_id mandatory.

ankursharmas · 2025-12-04T01:45:01Z

src/google/adk/evaluation/local_eval_service.py

+  def _resolve_user_id(self, eval_case: EvalCase) -> str:
+    if eval_case.session_input and eval_case.session_input.user_id:
+      return eval_case.session_input.user_id
+    return 'test_user_id'


For now lets remove this method, as we won't need it.

I've removed the method.

ankursharmas · 2025-12-04T01:46:52Z

src/google/adk/evaluation/local_eval_service.py

+          app_name,
+          user_id,
+      )
+      return None


We don't want to catch the NotFoundError actually, we want to bubble it up.

Remove this method and make a call to following directly, where you need it:
return await self._session_service.get_session(
app_name=app_name, user_id=user_id, session_id=session_id
)

Removed skipping the raised error, also called the session_service directly.

ankursharmas · 2025-12-04T01:47:26Z

src/google/adk/evaluation/local_eval_service.py

+          'Eval case `%s` marked as not evaluated: %s',
+          inference_result.eval_case_id,
+          reason,
+      )


Remove this log line. We would have any ways written a log line earlier.

I've removed the log.

ShaharKatz · 2025-12-04T22:17:45Z

Thanks for the review @ankursharmas!

I've fixed the code according to your comments, it is now a much cleaner and leaner PR.

ShaharKatz · 2025-12-05T16:26:11Z

Hey @ryanaiagent,

Let me know if you have any questions or need clarification, and I’d be glad to assist.

adk-bot added the eval [Component] This issue is related to evaluation label Dec 3, 2025

gemini-code-assist bot reviewed Dec 3, 2025

View reviewed changes

surajksharma07 mentioned this pull request Dec 3, 2025

during run evaluation of a test case inferences could be NoneType which has no len attribute and error is not caught #2729

Open

ankursharmas requested changes Dec 4, 2025

View reviewed changes

ryanaiagent self-assigned this Dec 4, 2025

ShaharKatz force-pushed the fix-2729-none-inferences branch 2 times, most recently from ca62e21 to e408198 Compare December 4, 2025 21:17

ryanaiagent added the request clarification [Status] The maintainer need clarification or more information from the author label Dec 4, 2025

fixed CR comments

a3529a8

ShaharKatz force-pushed the fix-2729-none-inferences branch from e408198 to a3529a8 Compare December 4, 2025 22:15

ShaharKatz requested a review from ankursharmas December 4, 2025 22:16

Merge branch 'main' into fix-2729-none-inferences

09d3cd6

Merge branch 'main' into fix-2729-none-inferences

f9f6ebf

Merge branch 'main' into fix-2729-none-inferences

7691d22

Handle None inferences in eval results for issue #2729 #3805

Are you sure you want to change the base?

Handle None inferences in eval results for issue #2729 #3805

Conversation

ShaharKatz commented Dec 3, 2025

Link to Issue or Description of Change

Testing Plan

Checklist

Additional context

Uh oh!

gemini-code-assist bot commented Dec 3, 2025

Summary of Changes

Highlights

Footnotes

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

gemini-code-assist bot Dec 3, 2025

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot Dec 3, 2025

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot Dec 3, 2025

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

ShaharKatz commented Dec 4, 2025

Uh oh!

ShaharKatz commented Dec 5, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants