fix(eval): handle unevaluated final response v2 results#5728
fix(eval): handle unevaluated final response v2 results#5728pragnyanramtha wants to merge 12 commits into
Conversation
…onse-v2-no-eval-guard
…onse-v2-no-eval-guard
…onse-v2-no-eval-guard
…onse-v2-no-eval-guard
|
Refreshed this branch with current Validation rerun:
|
…onse-v2-no-eval-guard
|
Refreshed this branch with current Validation rerun:
|
…onse-v2-no-eval-guard
|
Pushed The failure was from repository-wide hooks updating existing files outside this PR's evaluator patch:
Validation:
|
|
Hi @pragnyanramtha , Thank you for your contribution! We appreciate you taking the time to submit this pull request. Your PR has been received by the team and is currently under review. We will provide feedback as soon as we have an update to share. |
|
Hi @sasha-gitg , can you please review this. |
|
I noticed the |
Summary
Fixes a small aggregation edge case in
FinalResponseMatchV2Evaluator: when every per-invocation result is skipped or not evaluated, the evaluator currently divides by zero while computing the overall score.Root Cause
aggregate_invocation_results()filters out results whosescoreisNoneor whoseeval_statusisNOT_EVALUATED, but it unconditionally computes:If all judge samples fail to produce a usable score,
num_evaluatedremains0and evaluation crashes instead of returning a not-evaluated aggregate result. Other ADK evaluators handle this condition by returningoverall_score=Noneandoverall_eval_status=NOT_EVALUATED.Change
EvaluationResultwithoverall_score=Noneandoverall_eval_status=NOT_EVALUATEDwhen no FinalResponseMatchV2 invocation results are evaluable.Validation
uv sync --extra test uv run pytest tests/unittests/evaluation/test_final_response_match_v2.pyResult:
18 passed, 20 warnings.Full unit suite was not run; this patch is limited to FinalResponseMatchV2 aggregation and its targeted unit test file.