diff --git a/docs/sdk-evolution-agent-design.md b/docs/sdk-evolution-agent-design.md new file mode 100644 index 0000000..2b1bfb6 --- /dev/null +++ b/docs/sdk-evolution-agent-design.md @@ -0,0 +1,622 @@ +# SDK Evolution Agent Design + +This document describes how the SDK evolution example should work before adding +more implementation. It is intentionally more detailed than the user-facing run +guide in `docs/sdk-evolution-agent.md`. + +The core idea is that a dependency update is not enough evidence. The agent +must combine resolver facts, release notes, API shape, adapter behavior probes, +and real-runtime review before it recommends a lockfile change, adapter change, +or manual design stop. + +## Goals + +The SDK evolution agent should answer these questions for every run: + +- What package versions are installed, locked, and available upstream? +- Which packages does the resolver actually want to update? +- What changed in public API shape? +- What changed in documented behavior or product direction? +- Which adapter behavior contracts still pass on the candidate versions? +- Does the current `agent-runtime-kit` abstraction still preserve vendor + behavior? +- Is the safe next action a lock update, adapter update, docs/test update, + provider-specific extension, public API evolution, or manual design review? + +The agent must dogfood `agent-runtime-kit`: all AI reasoning stages run through +`AgentTask`, `RuntimeRegistry`, runtime adapters, output schemas, event sinks, +permission profiles, and `AgentResult`. Local shell, filesystem, package +manager, Git, and GitHub operations are allowed only for deterministic evidence +collection and mechanical changes. + +## Non-Goals + +The example should not become a generic dependency update bot. A generic bot +can answer "can the lockfile move?" This agent must answer "does the runtime +adapter contract still hold, and does the public SDK architecture still make +sense?" + +It should not hide vendor differences. If Claude adds task status events, Codex +changes sandbox semantics, or Antigravity changes model endpoint configuration, +the right output is explicit provider-specific evidence and possibly a +provider-specific extension, not a flattened common denominator. + +It should not require all vendor SDKs for normal package users. The example can +use `agent-runtime-kit[all]` for local research, but the package itself must keep +optional extras. + +## High-Level Flow + +```mermaid +flowchart TD + A["Start local command"] --> B["Collect deterministic evidence"] + B --> C["Resolve update candidates"] + C --> D["Inspect current and candidate APIs"] + D --> E["Collect changelog and release-note evidence"] + E --> F["Run adapter behavior probes"] + F --> G["Build evidence bundle"] + G --> H["Direction analysis through agent-runtime-kit"] + H --> I["Architecture decision and update plan through agent-runtime-kit"] + I --> J["Independent review through agent-runtime-kit"] + J --> K{"Gates pass?"} + K -- "No" --> L["Write report with manual review checklist"] + K -- "Yes" --> M["Apply safe implementation"] + M --> N["Run verification"] + N --> O["Promote updated state to current baseline"] + O --> P["Write report and optional draft PR"] +``` + +Step responsibilities: + +- **Start local command**: Parse the selected runtime, package filters, report + directory, refresh options, implementation flag, branch option, and draft PR + option. This step also establishes the run ID and local report directory. +- **Collect deterministic evidence**: Read local project state without using AI: + `pyproject.toml`, `uv.lock`, installed distributions, package metadata, + configured source hints, local environment facts, and supported auth + availability. This produces raw facts, not recommendations. +- **Resolve update candidates**: Run the targeted resolver preview with + freshness cutoffs removed. This step decides which packages are real update + candidates for the run. It should use resolver output rather than only PyPI + `latest` metadata, especially for prerelease packages. +- **Inspect current and candidate APIs**: Load API snapshot and diff artifacts + from the last update run, then focus new inspection on packages that the + resolver selected for update or packages whose evidence is missing, stale, or + incompatible with the current evidence schema. This step owns API snapshot and + API diff artifacts. If the evidence signature changes, the agent may need to + refresh the current-state snapshot or gather more current-state data before + comparing candidates. If an update candidate has no candidate API diff, the + run should not proceed to implementation. +- **Collect changelog and release-note evidence**: Fetch or read official + changelogs, release pages, docs changelogs, repository releases, and package + metadata links. This step records what changed according to the vendor and + explicitly marks missing or incomplete release-note coverage. +- **Run adapter behavior probes**: Execute deterministic unit probes, installed + SDK contract probes, and optional live probes. This step answers whether the + adapter behavior still holds, including permissions, sandbox/workspace + handling, streaming, structured output, MCP/tool support, auth discovery, and + session/resume behavior. +- **Build evidence bundle**: Normalize package facts, resolver facts, API + snapshots, API diffs, release-note evidence, behavior probe results, source + references, and uncertainty into a compact bundle for the AI stages. This step + should preserve provenance so later reasoning can be traced back to evidence. +- **Direction analysis through agent-runtime-kit**: Ask a runtime, via + `AgentTask`, to infer direction-of-travel themes from the evidence. This step + identifies whether changes look isolated or part of a broader SDK direction, + but it does not own the concrete implementation plan. +- **Architecture decision and update plan through agent-runtime-kit**: Ask a + runtime, via `AgentTask`, to turn direction analysis into the concrete plan: + adapter-only, test-only, docs-only, capability metadata change, + provider-specific extension, public API evolution, compatibility shim, + deprecation/migration, architectural rework, or `manual_design_required`. + This is the step responsible for saying what should be updated. +- **Independent review through agent-runtime-kit**: Run a separate reviewer task + through the runtime. The reviewer challenges evidence sufficiency, direction + inference, plan scope, vendor-specific capability preservation, and whether + tests, docs, and migration notes match the proposed change. +- **Gates pass?**: Apply deterministic pass/fail rules. The gates block + implementation when required API diffs are missing, release-note coverage is + missing, behavior probes fail or are skipped for required contracts, the + reviewer rejects the plan, recursive self-adaptation is unresolved, or manual + design is required. +- **Write report with manual review checklist**: If gates fail, write the local + report with the evidence bundle, analysis, decision, reviewer output, + uncertainty, blocked reasons, and the exact manual review questions. This is a + valid end state, not a failed run. +- **Apply safe implementation**: Apply only the changes allowed by the accepted + architecture decision and deterministic gates. This may include lockfile + updates, adapter changes, tests, docs, examples, compatibility shims, or report + changes. It must not implement changes that were classified as + `manual_design_required`. +- **Run verification**: Run the verification commands required by the + architecture decision. At minimum, this should cover formatting/linting, + typing, unit tests, lock checks, report generation checks, and any available + live smoke needed for the affected runtime behavior. +- **Promote updated state to current baseline**: After implementation and + verification pass, save the updated lock/package/API/release-note/probe state + as the new current-state baseline for the next run. This promotion should be + explicit, atomic, and tied to the verified commit or workspace state. Failed, + blocked, or manual-design-required runs must not replace the current baseline. +- **Write report and optional draft PR**: Write the final local report with + evidence, decisions, implementation summary, baseline-promotion result, test + results, uncertainty, and manual checklist. If explicitly configured and + authenticated, create or update a draft PR. This step must never auto-merge. + +Every box before direction analysis is deterministic. AI stages may interpret +evidence, but they should not invent evidence that was not collected. + +## Operating Modes + +The default command should be report-only: + +```bash +python -m examples.sdk_evolution_agent --runtime fake --refresh-preview +``` + +This mode collects evidence, writes artifacts, runs the analysis stages through +the selected runtime, and stops before editing the workspace. The fake runtime is +allowed only as a deterministic development harness. It proves the pipeline and +schemas, not the quality of AI reasoning. + +A real analysis run should select one configured runtime: + +```bash +python -m examples.sdk_evolution_agent --runtime claude-agent-sdk --refresh-preview +python -m examples.sdk_evolution_agent --runtime codex-agent-sdk --refresh-preview +python -m examples.sdk_evolution_agent --runtime antigravity-agent-sdk --refresh-preview +``` + +When `codex-agent-sdk` is selected for SDK update work, every AI-backed stage +should run on `gpt-5.5` with `reasoning_effort=xhigh`. This is a Codex runtime +policy, not a portable metadata field: Claude and Antigravity runs should not +receive a `gpt-5.5` model override. + +Package filters narrow evidence collection for debugging, but normal evolution +runs should inspect all tracked packages: + +```bash +python -m examples.sdk_evolution_agent \ + --runtime antigravity-agent-sdk \ + --refresh-preview \ + --package claude-agent-sdk \ + --package openai-codex \ + --package openai-codex-cli-bin \ + --package google-antigravity +``` + +`--inspect-candidates` should be effectively always on. The CLI can keep the +flag for compatibility, but update candidates without candidate API snapshots +are not actionable. + +Implementation mode should remain explicitly gated: + +```bash +python -m examples.sdk_evolution_agent \ + --runtime antigravity-agent-sdk \ + --refresh-preview \ + --implementation-enabled +``` + +Even in implementation mode, deterministic gates decide whether edits are +allowed. Draft PR creation is separate and should only happen when the local Git +and GitHub environment is authenticated and explicitly configured with +`--draft-pr`. + +## Evidence Layers + +The report should clearly separate evidence layers. Mixing them together is how +bad conclusions slip in. + +### 1. Package and Resolver Evidence + +The agent checks: + +- `pyproject.toml` dependency declarations. +- `uv.lock` versions. +- Installed distributions in the local environment. +- PyPI metadata and recent releases. +- `uv lock --dry-run -P ...` output with freshness cutoffs removed. + +`uv lock --dry-run` is the source of truth for update candidates when it is +available. PyPI `latest` metadata is useful context, but it can be misleading +for prerelease packages. For example, a locked prerelease can be newer than the +stable value reported by package metadata. + +### 2. API Shape Evidence + +The agent should treat the lockfile as the current SDK baseline. If the active +Python environment has drifted from `uv.lock`, the agent inspects the locked +baseline in an isolated virtualenv instead of using the installed package. API +inspection artifacts are reusable evidence from the last update run when their +schema, lockfile version, and artifact hashes still match. A normal run starts +by loading the prior `api_snapshots/` and `api_diffs.json` artifacts, then +inspects only the packages that need fresh facts: + +- packages selected by the resolver for update, +- packages whose prior artifacts are missing, +- packages whose prior artifacts were produced by an older evidence schema, +- packages whose current locked or installed version no longer matches the + artifact baseline, +- packages needed to answer a specific adapter-compatibility question. + +For importable packages, snapshots record: + +- public member names, +- member kind, +- signature where Python introspection can provide one, +- defining module, +- import errors. + +This catches obvious adapter risks: + +- removed classes or functions, +- changed constructor signatures, +- changed enum or model surfaces, +- new provider-specific capabilities worth exposing. + +API shape is necessary but insufficient. It does not prove behavior. + +After a successful implementation, the candidate API snapshots and diffs that +were verified must be promoted to the current-state baseline. That ensures the +next run compares new upstream candidates against the SDK state that was +actually accepted, not against stale pre-update artifacts. + +If the evidence schema changes, promotion should include a schema refresh of the +current package state even when the package version did not change. Otherwise +future runs may compare candidate evidence against artifacts that no longer mean +the same thing. + +### 3. Changelog and Release-Note Evidence + +The agent should collect release-note context when a vendor publishes it. + +| Package | Preferred source | Why it matters | +| --- | --- | --- | +| `claude-agent-sdk` | Python SDK `CHANGELOG.md` and Claude Agent SDK docs | Claude often ships behavioral changes around task progress, sessions, tools, permissions, and model support. | +| `openai-codex` | Codex SDK docs, Codex changelog, and `openai/codex` releases | Codex changes can involve sandboxing, working directories, remote execution, app-server behavior, and SDK maturity. | +| `openai-codex-cli-bin` | `openai/codex` releases and package metadata | The binary package is runtime infrastructure, so behavior can change even when the Python SDK surface does not. | +| `google-antigravity` | Antigravity changelog, repository, package metadata, examples, and public API snapshots | Antigravity release context may be product-level instead of package-version-specific, so the agent must preserve source coverage and uncertainty separately. | + +The report should preserve source references and a short excerpt or summary. If +release notes are unavailable, that absence is evidence and should increase +uncertainty. + +Primary sources should be recorded with URLs in `release_notes.json`: + +- `claude-agent-sdk`: `https://github.com/anthropics/claude-agent-sdk-python/blob/main/CHANGELOG.md` +- Claude Agent SDK docs: `https://code.claude.com/docs/en/agent-sdk/overview` +- Codex SDK docs: `https://developers.openai.com/codex/sdk` +- Codex changelog: `https://developers.openai.com/codex/changelog` +- Codex repository releases: `https://github.com/openai/codex/releases` +- Antigravity changelog: `https://antigravity.google/changelog` +- Antigravity repository: `https://github.com/google-antigravity/antigravity-sdk-python` + +If a package has no release-note source for the exact version interval, the +agent should still record what it checked and why the source was insufficient. +Fetched official sources with no package-version-specific entry are evidence +with explicit uncertainty; they are not the same as a collection failure. + +### 4. Behavior Probe Evidence + +Behavior probes test what signatures cannot show. They should be deterministic +where possible and optional-live where credentials are required. + +```mermaid +flowchart LR + A["Candidate versions installed"] --> B["Contract tests"] + A --> C["Adapter unit probes"] + A --> D["Optional live smoke"] + B --> E["behavior_probes.json"] + C --> E + D --> E + E --> F["Architecture decision gates"] +``` + +Behavior probes should cover these contracts: + +| Contract | Why API diffs are not enough | Example probe | +| --- | --- | --- | +| Request construction | Constructor signatures can stay stable while fields change meaning. | Assert adapter builds expected SDK options/config objects. | +| Permission mapping | Permission mode names can stay present while policy behavior changes. | Strict/default/permissive tests for each adapter. | +| Sandbox and workspace semantics | Behavior can shift across SDK or CLI layers without a Python signature change. | Codex sandbox enum and run argument contract tests, plus smoke where possible. | +| Streaming and event order | New message types may not break imports but may be dropped. | Feed fake vendor messages and assert emitted event order. | +| Structured output | Schema fields can exist but runtime may return prose or tool calls. | Live or fake structured-output task with schema validation. | +| Session/resume | Resume options can exist but behavior may change. | Fake SDK request shape plus optional live resume smoke. | +| MCP/tool support | MCP config may move from one module to another without a simple signature break. | Adapter MCP config tests and unsupported-feature assertions. | +| Auth discovery | Supported auth sources differ by vendor and may change independently. | Availability probes that report source without scraping credentials. | + +Behavior probe output should be a first-class report artifact, for example: + +```text +behavior_probes.json +behavior_diffs.json +``` + +Each probe result should include: + +- probe name, +- relevant package or adapter, +- command or test function, +- pass/fail/skip status, +- stdout/stderr summary, +- skipped reason when optional credentials are missing. + +`behavior_diffs.json` compares current-environment probes against +candidate-version probes for resolver-selected updates. Breaking candidate probe +changes block implementation deterministically before any local lock update. + +`behavior_probes.json` may include observed SDK fields or parameters that are +not part of the adapter contract. `behavior_diffs.json` compares the required +adapter contract, not every optional field. Public API and signature churn +remains visible in `api_diffs.json` and probe details, but it should only block +implementation when the required behavior contract fails or becomes ambiguous. + +### 5. Runtime-Generated Analysis + +After deterministic evidence is collected, the AI stages can interpret it: + +```mermaid +sequenceDiagram + participant CLI as Local CLI + participant Registry as RuntimeRegistry + participant Runtime as Selected runtime adapter + participant Model as Vendor agent runtime + + CLI->>Registry: resolve(runtime kind) + Registry->>Runtime: create adapter + CLI->>Runtime: AgentTask(direction-analysis) + Runtime->>Model: supported SDK call + Model-->>Runtime: structured AgentResult + Runtime-->>CLI: validated JSON + CLI->>Runtime: AgentTask(architecture-decision) + Runtime-->>CLI: validated JSON + CLI->>Runtime: AgentTask(review) + Runtime-->>CLI: validated JSON +``` + +The AI stages should receive compacted, source-referenced evidence. They should +not be asked to inspect the filesystem directly during report-only analysis. + +## Decision Gates + +The agent should fail closed. Implementation is blocked when: + +- the resolver reports an update but candidate API diffs are missing, +- release notes exist but were not collected, +- release notes are unavailable and the API or behavior evidence is ambiguous, +- behavior probes fail, +- behavior probes are skipped for a contract that is required for the proposed + implementation, +- the reviewer rejects the evidence or architecture decision, +- `manual_design_required` is true, +- recursive self-adaptation is required but no migration plan exists. + +An empty API diff can be valid. A missing API diff for an update candidate is +not valid. + +## Recursive Self-Adaptation + +The SDK evolution agent uses `agent-runtime-kit` to update `agent-runtime-kit`. +That makes runtime-layer changes recursive. + +```mermaid +flowchart TD + A["Upstream SDK change"] --> B["agent-runtime-kit adapter/public API change"] + B --> C{"Does SDK evolution agent use the changed contract?"} + C -- "No" --> D["Normal adapter/public API change"] + C -- "Yes" --> E["Self-adaptation required"] + E --> F["Update example runtime usage"] + E --> G["Update schemas and prompts"] + E --> H["Update behavior probes"] + E --> I["Run reviewer through updated runtime"] +``` + +If a change affects `AgentTask`, `AgentResult`, `RuntimeRegistry`, runtime +adapters, output schemas, event sinks, permission profiles, or typed unsupported +feature errors, the report must call this out explicitly. + +## Changelog Source Strategy + +The agent should prefer official and primary sources: + +- package repository changelog files, +- official release pages, +- official docs changelog pages, +- package metadata links, +- repository releases. + +It should not scrape private credentials or authenticated browser sessions to +obtain changelogs. If a source requires authentication, the report should mark +that source unavailable and explain the limitation. + +For `claude-agent-sdk`, the Python changelog should be checked first. Claude +Code and Agent SDK docs are useful supplemental direction-of-travel sources. + +For `openai-codex`, the Codex SDK docs and Codex changelog should be checked. +The `openai/codex` release page is also relevant because the Python SDK depends +on a bundled or pinned runtime. + +For `google-antigravity`, if the official changelog or repository does not have +a package-version-specific entry, the agent should not pretend the source is +complete. It should compensate with package metadata, examples, API snapshots, +adapter contract tests, and live smoke where credentials are available. + +## Behavior Probe Strategy + +Behavior probes should be split into three tiers. + +### Tier 1: Always-On Unit Probes + +These use fake SDK objects and do not require credentials. They should run in +normal CI. + +Examples: + +- Claude request shape and stream translation tests. +- Codex approval mode, sandbox, thread item, and tool audit tests. +- Antigravity permission/tool/MCP config tests. +- unsupported-feature errors for non-portable options. + +### Tier 2: Installed SDK Contract Probes + +These introspect real installed SDK packages but do not call models. + +Examples: + +- `ClaudeAgentOptions` still accepts fields the adapter builds. +- `openai_codex.AsyncThread.run` still exposes expected parameters. +- `google.antigravity.LocalAgentConfig` still exposes expected config fields. + +These are stronger than raw public snapshots because they encode adapter +assumptions. + +### Tier 3: Optional Live Probes + +These use local supported credentials and must never scrape credentials. + +Examples: + +- Claude one-turn smoke if Claude auth is configured. +- Codex one-turn smoke using provider-owned local auth. +- Antigravity structured-output smoke using API key or Google Application + Default Credentials. + +Live probes should be reported as pass/fail/skip. A skipped live probe should +not automatically block a docs-only or test-only change, but it should increase +uncertainty for runtime behavior changes. + +## Report Shape + +The report directory should include: + +```text +config.json +evidence.json +release_notes.json +api_snapshots/ +api_diffs.json +behavior_probes.json +behavior_diffs.json +current_state.json +direction_analysis.json +architecture_decision.json +implementation_summary.json +review.json +events.jsonl +report.md +``` + +`report.md` should summarize: + +- package and resolver status, +- release-note coverage, +- API diff count and affected packages, +- behavior probe status, +- current-state baseline promotion status, +- direction-of-travel themes, +- architecture decision, +- reviewer status, +- implementation result, +- uncertainty and manual review checklist. + +`current_state.json` should be the manifest that makes the next run +artifact-aware. It should record: + +- evidence schema version, +- generated timestamp, +- source run ID, +- commit SHA or explicit dirty-worktree marker, +- lockfile hash, +- package names and accepted current versions, +- paths or content hashes for current API snapshots, +- paths or content hashes for release-note evidence, +- paths or content hashes for behavior probe results, +- whether the baseline was promoted, refreshed, skipped, or blocked. + +Promotion rules should be conservative: + +- promote only after implementation and verification pass, +- do not promote failed, blocked, report-only, or manual-design-required runs as + the new current state, +- preserve the previous baseline so a bad promotion can be inspected, +- refresh the current-state baseline when the evidence schema changes, even if + package versions did not change, +- make the final report say exactly which artifacts became the new baseline. + +## Caveats and Concerns + +Changelogs are incomplete. They often omit small behavior changes and may lag +package releases. + +API snapshots are shallow. Python introspection can miss behavior encoded in +runtime binaries, generated models, callbacks, subprocesses, environment +variables, or remote services. + +Live probes are environment-sensitive. They prove that one local credential and +runtime setup worked at one time. They do not replace unit or contract probes. + +AI review can be overconfident or overcautious. The reviewer should challenge +evidence quality, but deterministic gates should own pass/fail decisions for +missing diffs, failed probes, and missing required release-note evidence. + +Provider release cadence differs. Claude may expose rich changelogs. Codex may +split behavior between SDK docs, changelog, GitHub releases, and CLI runtime. +Antigravity may expose less written release context. + +Prerelease handling matters. Resolver output should drive update candidates +because package metadata `latest` can point to a stable release while the lock +already contains a newer prerelease. + +## Alternatives Considered + +### API Diffs Only + +Rejected. API diffs catch import and signature drift, but they do not prove +behavioral compatibility. This is the current weak point. + +### Changelogs Only + +Rejected. Changelogs are useful direction evidence, but they are not complete +and cannot prove local adapter behavior. + +### Run Full Live Agents For Every Provider Every Time + +Rejected as the default. It is too credential-dependent and would make local +runs brittle. Live probes should be optional and reported clearly. + +### Dependabot-Style Lock Updates + +Rejected. The goal is architectural evolution, not generic dependency freshness. +The agent must reason about provider-specific runtime capabilities and adapter +contracts. + +### Lowest-Common-Denominator Runtime Abstraction + +Rejected. The package exists to provide a clean Python API while preserving +vendor-specific capabilities, not to erase them. + +### Separate Agents Per Provider Only + +Partially useful but not sufficient. Provider-specific probes are valuable, but +the top-level agent still needs a cross-provider architecture view so public API +changes do not accidentally favor one runtime and flatten another. + +## Implemented Artifact Contract + +The example implements the deterministic evidence artifacts described above: + +- `release_notes.json` records official source checks and whether matching + version evidence was found, missing, or unavailable. +- `behavior_probes.json` records current and candidate adapter-contract probes. +- `behavior_diffs.json` records behavior differences between current and + candidate probes. +- `current_state.json` records the run baseline, lockfile hash, accepted + package versions, artifact hashes, and promotion status. + +The implementation path is gated by deterministic checks before the local +lockfile update runs. Missing candidate API diffs, unavailable required +release-note evidence, breaking behavior diffs, reviewer rejection, +`manual_design_required`, and unresolved recursive self-adaptation all block +implementation. When implementation is allowed, the example applies the +resolver-selected SDK lock update locally, runs verification, writes the report +artifacts, commits them, pushes the branch, and opens a draft PR when configured. diff --git a/docs/sdk-evolution-agent.md b/docs/sdk-evolution-agent.md index 263b439..950e00e 100644 --- a/docs/sdk-evolution-agent.md +++ b/docs/sdk-evolution-agent.md @@ -4,6 +4,10 @@ The SDK evolution agent is a local dogfood workflow for keeping agent-runtime-kit aligned with Claude Agent SDK, OpenAI Codex SDK, and Google Antigravity SDK as those upstream packages evolve. +For the intended architecture, evidence contract, behavior probe strategy, +changelog strategy, caveats, and alternatives, see +[`docs/sdk-evolution-agent-design.md`](sdk-evolution-agent-design.md). + Run it from the repository: ```bash @@ -32,6 +36,13 @@ directory is created with private permissions before the Codex runtime starts; authenticate that Codex home through supported Codex login/API-key/access-token flows before using it for real Codex-backed runs. +Codex-backed SDK evolution runs explicitly choose `gpt-5.5` with +`reasoning_effort=xhigh` for the AI stages that analyze direction, decide the +update plan, implement allowed changes, and review the result. This model policy +is applied only to `codex-agent-sdk`; Claude and Antigravity runs keep their +provider-native model selection because `gpt-5.5` is not a valid model override +for those adapters. + For Antigravity, local auth can use `GEMINI_API_KEY` / `GOOGLE_API_KEY` or Google Application Default Credentials. ADC runs use Vertex AI config; provide a project through ADC, `GOOGLE_CLOUD_PROJECT`, or `GCLOUD_PROJECT`, and optionally @@ -46,8 +57,12 @@ Each run writes a timestamped directory under `reports/sdk-evolution/` with: - `config.json` - `evidence.json` +- `release_notes.json` - `api_snapshots/` - `api_diffs.json` +- `behavior_probes.json` +- `behavior_diffs.json` +- `current_state.json` - `direction_analysis.json` - `architecture_decision.json` - `implementation_summary.json` @@ -56,7 +71,8 @@ Each run writes a timestamped directory under `reports/sdk-evolution/` with: - `report.md` The report separates deterministic facts from runtime-generated analysis and -calls out uncertainty, recursive self-adaptation impact, implementation status, +calls out uncertainty, release-note coverage, API diffs, behavior diffs, +baseline promotion, recursive self-adaptation impact, implementation status, test results, reviewer output, and manual review items. ## Upstream Freshness @@ -76,10 +92,28 @@ cutoff variables must not hide candidate releases. ## Candidate API Inspection -By default, the command snapshots SDK APIs importable in the current -environment. Use `--inspect-candidates` to install latest candidate SDK versions -in temporary isolated virtualenvs for API snapshots and diffs. This avoids -mutating the project lockfile or working environment. +The command treats `uv.lock` as the current baseline. If the active `.venv` +contains a different installed version, the agent inspects the locked baseline +in a temporary isolated virtualenv instead of trusting the drifted environment. +When a refresh preview is available, package update candidates come from the +resolver's `uv lock --dry-run -P ...` output, not only from PyPI's `latest` +metadata. For each resolver update candidate, the agent installs the target +version in a temporary isolated virtualenv and writes an API snapshot plus +`api_diffs.json` entry. This avoids false downgrade diffs for packages whose +locked prerelease is newer than PyPI's stable latest field. Candidate inspection +is always enabled for update candidates; `--inspect-candidates` remains accepted +only for CLI compatibility. + +If `uv lock --dry-run -P ...` reports an SDK update but the run cannot produce a +candidate-version API diff for that package, implementation is blocked and the +architecture decision is marked `manual_design_required`. An empty added / +removed / changed diff is valid; a missing diff object is not. + +Behavior probes intentionally separate observed SDK surface churn from adapter +contract breakage. `behavior_probes.json` records fields and parameters seen in +current and candidate packages, while `behavior_diffs.json` compares the +required adapter contract. Optional field changes remain visible in the report +and API diffs, but only breaking adapter-contract diffs block implementation. ## Implementation Gates @@ -93,6 +127,9 @@ Implementation is still blocked when: - the architecture decision sets `manual_design_required`, - the reviewer rejects the evidence or design, +- a resolver-selected update lacks a candidate API diff, +- required release-note evidence could not be collected, +- candidate behavior probes show a breaking adapter-contract difference, - required structured output or permission behavior is unsupported by the selected runtime, - recursive self-adaptation is required but no safe migration plan exists. @@ -114,8 +151,13 @@ python -m examples.sdk_evolution_agent \ --implementation-enabled \ --create-branch \ --branch-name sdk-evolution-update \ + --pr-base main \ --draft-pr ``` +When `--draft-pr` is set, the agent stages `uv.lock` and the run report +directory, commits them with `--commit-message`, pushes the branch, and opens a +draft PR with `gh`. It never auto-merges. + The command uses local Git and `gh` authentication. It never auto-merges, auto-publishes, or scrapes unsupported credentials. diff --git a/examples/sdk_evolution_agent/behavior.py b/examples/sdk_evolution_agent/behavior.py new file mode 100644 index 0000000..2847111 --- /dev/null +++ b/examples/sdk_evolution_agent/behavior.py @@ -0,0 +1,484 @@ +"""Behavior and adapter-contract probes for SDK evolution runs.""" + +from __future__ import annotations + +import importlib +import importlib.metadata +import inspect +import json +import subprocess +import sys +import tempfile +import textwrap +from collections.abc import Mapping, Sequence +from pathlib import Path +from typing import Any + +from examples.sdk_evolution_agent.models import BehaviorDiff, BehaviorProbeResult +from examples.sdk_evolution_agent.snapshots import DEFAULT_MODULES + + +def collect_behavior_evidence( + packages: Sequence[Mapping[str, object]], + update_versions: Mapping[str, str], +) -> dict[str, Any]: + """Collect current/candidate behavior probes and compare them.""" + + results: list[BehaviorProbeResult] = [] + for package in packages: + name = str(package.get("name") or "") + if not name: + continue + locked_version = _string_or_none(package.get("locked_version")) + installed_version = _string_or_none(package.get("installed_version")) + current_version = locked_version or installed_version + if locked_version and installed_version and locked_version != installed_version: + results.extend(probe_candidate_in_venv(name, locked_version, scope="current-baseline")) + else: + results.extend(probe_current_package(name, version=current_version)) + candidate = update_versions.get(name) + if candidate: + results.extend(probe_candidate_in_venv(name, candidate, scope="candidate")) + diffs = diff_behavior_results(results) + return { + "results": [result for result in results], + "diffs": [diff for diff in diffs], + "summary": summarize_behavior(diffs), + } + + +def probe_current_package( + package: str, + *, + version: str | None = None, +) -> tuple[BehaviorProbeResult, ...]: + """Run behavior probes against the current Python environment.""" + + return tuple(_probe_package(package, version=version, scope="current-environment")) + + +def probe_candidate_in_venv( + package: str, + version: str, + *, + scope: str = "candidate", + python: str = sys.executable, + timeout: int = 300, +) -> tuple[BehaviorProbeResult, ...]: + """Run behavior probes against a candidate package in an isolated virtualenv.""" + + with tempfile.TemporaryDirectory(prefix="ark-sdk-behavior-") as directory: + venv = Path(directory) / ".venv" + subprocess.run((python, "-m", "venv", str(venv)), check=True, timeout=timeout) + bin_dir = "Scripts" if sys.platform == "win32" else "bin" + venv_python = venv / bin_dir / "python" + subprocess.run( + (str(venv_python), "-m", "pip", "install", f"{package}=={version}"), + check=True, + text=True, + capture_output=True, + timeout=timeout, + ) + completed = subprocess.run( + (str(venv_python), "-c", _PROBE_SCRIPT, package, version, scope), + check=True, + text=True, + capture_output=True, + timeout=timeout, + ) + raw = json.loads(completed.stdout) + return tuple(BehaviorProbeResult(**item) for item in raw) + + +def diff_behavior_results(results: Sequence[BehaviorProbeResult]) -> tuple[BehaviorDiff, ...]: + """Compare current and candidate behavior probes for each package/probe.""" + + grouped: dict[tuple[str, str], dict[str, BehaviorProbeResult]] = {} + for result in results: + grouped.setdefault((result.package, result.probe), {})[result.scope] = result + diffs: list[BehaviorDiff] = [] + for (package, probe), scopes in sorted(grouped.items()): + before = scopes.get("current-baseline") or scopes.get("current-environment") + after = scopes.get("candidate") or scopes.get("isolated-venv") + if before is None or after is None: + continue + if before.status == after.status and _contract_details(before) == _contract_details(after): + severity = "none" + summary = "No behavior contract difference detected." + elif before.status == "pass" and after.status != "pass": + severity = "breaking" + summary = f"Candidate probe changed from pass to {after.status}." + elif before.status != after.status: + severity = "changed" + summary = f"Probe status changed from {before.status} to {after.status}." + else: + severity = "changed" + summary = "Probe details changed while status stayed the same." + diffs.append( + BehaviorDiff( + package=package, + from_version=before.version, + to_version=after.version, + probe=probe, + severity=severity, + summary=summary, + before_status=before.status, + after_status=after.status, + ) + ) + return tuple(diffs) + + +def summarize_behavior(diffs: Sequence[BehaviorDiff]) -> dict[str, Any]: + """Return a compact behavior summary for reports and gates.""" + + breaking = [diff for diff in diffs if diff.severity == "breaking"] + changed = [diff for diff in diffs if diff.severity == "changed"] + return { + "breaking_count": len(breaking), + "changed_count": len(changed), + "unchanged_count": len([diff for diff in diffs if diff.severity == "none"]), + "status": "fail" if breaking else "changed" if changed else "pass", + } + + +def _probe_package( + package: str, + *, + version: str | None, + scope: str, +) -> tuple[BehaviorProbeResult, ...]: + if package == "claude-agent-sdk": + return (_probe_claude(version=version, scope=scope),) + if package == "openai-codex": + return (_probe_codex(version=version, scope=scope),) + if package == "openai-codex-cli-bin": + return (_probe_codex_cli_bin(version=version, scope=scope),) + if package == "google-antigravity": + return (_probe_antigravity(version=version, scope=scope),) + return ( + BehaviorProbeResult( + package=package, + version=version, + scope=scope, + probe="package-import", + status="skip", + summary="No behavior probe is defined for this package.", + ), + ) + + +def _probe_claude(*, version: str | None, scope: str) -> BehaviorProbeResult: + package = "claude-agent-sdk" + try: + module = importlib.import_module("claude_agent_sdk") + options_cls = module.ClaudeAgentOptions + except Exception as exc: + return _failed(package, version, scope, "adapter-contract", exc) + fields = _fields(options_cls) + expected = { + "model", + "allowed_tools", + "disallowed_tools", + "permission_mode", + "system_prompt", + "cwd", + "mcp_servers", + "resume", + "env", + "max_budget_usd", + "output_format", + } + missing = sorted(expected - fields) + return BehaviorProbeResult( + package=package, + version=version, + scope=scope, + probe="adapter-contract", + status="fail" if missing else "pass", + summary=( + "ClaudeAgentOptions exposes required adapter fields." + if not missing + else "ClaudeAgentOptions is missing required adapter fields." + ), + details={"fields": sorted(fields), "required_fields": sorted(expected), "missing": missing}, + ) + + +def _probe_codex(*, version: str | None, scope: str) -> BehaviorProbeResult: + package = "openai-codex" + try: + module = importlib.import_module("openai_codex") + run_params = set(inspect.signature(module.AsyncThread.run).parameters) + start_params = set(inspect.signature(module.AsyncCodex.thread_start).parameters) + except Exception as exc: + return _failed(package, version, scope, "adapter-contract", exc) + expected_run = {"cwd", "model", "approval_mode", "sandbox", "output_schema", "effort"} + expected_start = {"developer_instructions", "cwd", "model", "approval_mode", "sandbox"} + missing_run = sorted(expected_run - run_params) + missing_start = sorted(expected_start - start_params) + missing = missing_run + [f"thread_start.{item}" for item in missing_start] + return BehaviorProbeResult( + package=package, + version=version, + scope=scope, + probe="adapter-contract", + status="fail" if missing else "pass", + summary=( + "Codex thread APIs expose required adapter parameters." + if not missing + else "Codex thread APIs are missing required adapter parameters." + ), + details={ + "run_params": sorted(run_params), + "start_params": sorted(start_params), + "required_run_params": sorted(expected_run), + "required_start_params": sorted(expected_start), + "missing": missing, + }, + ) + + +def _probe_codex_cli_bin(*, version: str | None, scope: str) -> BehaviorProbeResult: + package = "openai-codex-cli-bin" + try: + installed = importlib.metadata.version(package) + except Exception as exc: + return _failed(package, version, scope, "binary-distribution", exc) + return BehaviorProbeResult( + package=package, + version=version or installed, + scope=scope, + probe="binary-distribution", + status="pass", + summary="Codex CLI binary distribution metadata is available.", + details={"installed_version": installed}, + ) + + +def _probe_antigravity(*, version: str | None, scope: str) -> BehaviorProbeResult: + package = "google-antigravity" + try: + importlib.import_module(DEFAULT_MODULES[package]) + importlib.import_module("google.antigravity.types") + importlib.import_module("google.antigravity.agent") + importlib.import_module("google.antigravity.hooks.policy") + config_module = importlib.import_module( + "google.antigravity.connections.local.local_connection_config" + ) + config_cls = config_module.LocalAgentConfig + except Exception as exc: + return _failed(package, version, scope, "adapter-contract", exc) + fields = _fields(config_cls) + expected = { + "model", + "api_key", + "vertex", + "project", + "location", + "system_instructions", + "capabilities", + "policies", + "workspaces", + "conversation_id", + "save_dir", + "app_data_dir", + "response_schema", + "mcp_servers", + } + missing = sorted(expected - fields) + return BehaviorProbeResult( + package=package, + version=version, + scope=scope, + probe="adapter-contract", + status="fail" if missing else "pass", + summary=( + "Antigravity LocalAgentConfig exposes required adapter fields." + if not missing + else "Antigravity LocalAgentConfig is missing required adapter fields." + ), + details={"fields": sorted(fields), "required_fields": sorted(expected), "missing": missing}, + ) + + +def _fields(cls: Any) -> set[str]: + if hasattr(cls, "model_fields"): + return set(cls.model_fields) + if hasattr(cls, "__dataclass_fields__"): + return set(cls.__dataclass_fields__) + try: + return set(inspect.signature(cls).parameters) + except (TypeError, ValueError): + return set() + + +def _contract_details(result: BehaviorProbeResult) -> dict[str, Any]: + if result.probe != "adapter-contract": + return result.details + details = result.details + if "missing" not in details: + return details + contract: dict[str, Any] = {"missing": sorted(details.get("missing") or [])} + if "required_fields" in details: + contract["required_fields"] = sorted(details.get("required_fields") or []) + if "required_run_params" in details: + contract["required_run_params"] = sorted(details.get("required_run_params") or []) + if "required_start_params" in details: + contract["required_start_params"] = sorted(details.get("required_start_params") or []) + return contract + + +def _failed( + package: str, + version: str | None, + scope: str, + probe: str, + exc: Exception, +) -> BehaviorProbeResult: + return BehaviorProbeResult( + package=package, + version=version, + scope=scope, + probe=probe, + status="fail", + summary=str(exc), + details={"error": str(exc)}, + ) + + +def _string_or_none(value: object) -> str | None: + if value is None: + return None + text = str(value) + return text or None + + +_PROBE_SCRIPT = textwrap.dedent( + """ + import importlib + import importlib.metadata + import inspect + import json + import sys + + package, version, scope = sys.argv[1:4] + + def fields(cls): + if hasattr(cls, "model_fields"): + return set(cls.model_fields) + if hasattr(cls, "__dataclass_fields__"): + return set(cls.__dataclass_fields__) + try: + return set(inspect.signature(cls).parameters) + except (TypeError, ValueError): + return set() + + def failed(probe, exc): + return { + "package": package, + "version": version, + "scope": scope, + "probe": probe, + "status": "fail", + "summary": str(exc), + "details": {"error": str(exc)}, + } + + def result(probe, status, summary, details): + return { + "package": package, + "version": version, + "scope": scope, + "probe": probe, + "status": status, + "summary": summary, + "details": details, + } + + try: + if package == "claude-agent-sdk": + module = importlib.import_module("claude_agent_sdk") + option_fields = fields(getattr(module, "ClaudeAgentOptions")) + expected = { + "model", "allowed_tools", "disallowed_tools", "permission_mode", + "system_prompt", "cwd", "mcp_servers", "resume", "env", + "max_budget_usd", "output_format", + } + missing = sorted(expected - option_fields) + payload = [result( + "adapter-contract", + "fail" if missing else "pass", + "ClaudeAgentOptions exposes required adapter fields." if not missing + else "ClaudeAgentOptions is missing required adapter fields.", + { + "fields": sorted(option_fields), + "required_fields": sorted(expected), + "missing": missing, + }, + )] + elif package == "openai-codex": + module = importlib.import_module("openai_codex") + run_params = set(inspect.signature(module.AsyncThread.run).parameters) + start_params = set(inspect.signature(module.AsyncCodex.thread_start).parameters) + expected_run = {"cwd", "model", "approval_mode", "sandbox", "output_schema", "effort"} + expected_start = {"developer_instructions", "cwd", "model", "approval_mode", "sandbox"} + missing_run = sorted(expected_run - run_params) + missing_start = sorted(expected_start - start_params) + missing = missing_run + [f"thread_start.{item}" for item in missing_start] + payload = [result( + "adapter-contract", + "fail" if missing else "pass", + "Codex thread APIs expose required adapter parameters." if not missing + else "Codex thread APIs are missing required adapter parameters.", + { + "run_params": sorted(run_params), + "start_params": sorted(start_params), + "required_run_params": sorted(expected_run), + "required_start_params": sorted(expected_start), + "missing": missing, + }, + )] + elif package == "openai-codex-cli-bin": + installed = importlib.metadata.version(package) + payload = [result( + "binary-distribution", + "pass", + "Codex CLI binary distribution metadata is available.", + {"installed_version": installed}, + )] + elif package == "google-antigravity": + importlib.import_module("google.antigravity") + importlib.import_module("google.antigravity.types") + importlib.import_module("google.antigravity.agent") + importlib.import_module("google.antigravity.hooks.policy") + config_module = importlib.import_module( + "google.antigravity.connections.local.local_connection_config" + ) + config_fields = fields(getattr(config_module, "LocalAgentConfig")) + expected = { + "model", "api_key", "vertex", "project", "location", + "system_instructions", "capabilities", "policies", "workspaces", + "conversation_id", "save_dir", "app_data_dir", "response_schema", + "mcp_servers", + } + missing = sorted(expected - config_fields) + payload = [result( + "adapter-contract", + "fail" if missing else "pass", + "Antigravity LocalAgentConfig exposes required adapter fields." if not missing + else "Antigravity LocalAgentConfig is missing required adapter fields.", + { + "fields": sorted(config_fields), + "required_fields": sorted(expected), + "missing": missing, + }, + )] + else: + payload = [result("package-import", "skip", "No behavior probe is defined.", {})] + except Exception as exc: + payload = [failed("adapter-contract", exc)] + + print(json.dumps(payload, sort_keys=True)) + """ +).strip() diff --git a/examples/sdk_evolution_agent/cli.py b/examples/sdk_evolution_agent/cli.py index 14e57d7..d0a6e1c 100644 --- a/examples/sdk_evolution_agent/cli.py +++ b/examples/sdk_evolution_agent/cli.py @@ -3,17 +3,22 @@ from __future__ import annotations import argparse +import re +from dataclasses import replace from datetime import datetime, timezone from pathlib import Path from typing import Any from agent_runtime_kit import AgentRuntime, RuntimeRegistry +from examples.sdk_evolution_agent.behavior import collect_behavior_evidence from examples.sdk_evolution_agent.collectors import ( CommandRunner, PypiClient, collect_evidence, + run_lock_update, run_verification_commands, ) +from examples.sdk_evolution_agent.current_state import build_current_state from examples.sdk_evolution_agent.events import JsonlEventSink from examples.sdk_evolution_agent.models import ( DEFAULT_PACKAGES, @@ -21,7 +26,15 @@ RunOptions, to_jsonable, ) -from examples.sdk_evolution_agent.pr import build_draft_pr_body, create_branch, create_draft_pr +from examples.sdk_evolution_agent.pr import ( + build_draft_pr_body, + commit_staged, + create_branch, + create_draft_pr, + push_branch, + stage_paths, +) +from examples.sdk_evolution_agent.release_notes import collect_release_notes from examples.sdk_evolution_agent.report import write_run_report from examples.sdk_evolution_agent.snapshots import ( diff_snapshot_groups, @@ -34,6 +47,13 @@ run_analysis_pipeline, ) +DEFAULT_VERIFICATION_COMMANDS = ( + "uv run ruff check .", + "uv run mypy", + "uv run pytest", + "uv lock --check", +) + async def main(argv: list[str] | None = None) -> int: """Parse CLI args and run the agent.""" @@ -78,11 +98,21 @@ def parse_args(argv: list[str] | None = None) -> RunOptions: parser.add_argument( "--inspect-candidates", action="store_true", - help="Inspect latest candidate SDK versions in temporary virtualenvs.", + default=True, + help=( + "Inspect latest candidate SDK versions in temporary virtualenvs. " + "Always enabled for update candidates; accepted for compatibility." + ), ) parser.add_argument("--create-branch", action="store_true", help="Create a local branch first.") parser.add_argument("--branch-name", help="Branch name for optional branch creation.") parser.add_argument("--draft-pr", action="store_true", help="Create a draft PR with gh.") + parser.add_argument("--pr-base", help="Base branch for optional draft PR creation.") + parser.add_argument( + "--commit-message", + default="Run SDK evolution update", + help="Commit message for optional autonomous SDK update PR.", + ) parser.add_argument( "--pr-title", default="Adapt agent-runtime-kit to upstream SDK evolution", @@ -100,6 +130,8 @@ def parse_args(argv: list[str] | None = None) -> RunOptions: create_branch=args.create_branch, branch_name=args.branch_name, draft_pr=args.draft_pr, + pr_base=args.pr_base, + commit_message=args.commit_message, pr_title=args.pr_title, ) @@ -114,6 +146,7 @@ async def run_agent( ) -> Path: """Run the full local SDK evolution workflow.""" + options = replace(options, inspect_candidates=True) run_id = datetime.now(tz=timezone.utc).strftime("%Y%m%dT%H%M%SZ") report_root = (options.workspace / options.report_dir / run_id).resolve() event_log_path = report_root / "events.jsonl" @@ -129,6 +162,18 @@ async def run_agent( event_sink=event_sink, ) selected_runtime = runtime or resolve_runtime(options.runtime, registry=registry) + pre_run_results: list[dict[str, Any]] = [] + if options.create_branch and options.branch_name: + branch_result = create_branch( + options.workspace, + options.branch_name, + command_runner=command_runner, + ) + pre_run_results.append(to_jsonable(branch_result)) + if branch_result.returncode != 0: + raise RuntimeError( + f"failed to create branch {options.branch_name}: {branch_result.stderr}" + ) evidence = collect_evidence( options.workspace, packages=options.packages, @@ -136,12 +181,20 @@ async def run_agent( pypi_client=pypi_client, command_runner=command_runner, ) - snapshots = _collect_snapshots(evidence, inspect_candidates=options.inspect_candidates) + update_versions = _refresh_update_versions(evidence) + snapshots = _collect_snapshots(evidence) api_diffs = [to_jsonable(diff) for diff in diff_snapshot_groups(snapshots)] + release_notes = [ + to_jsonable(item) + for item in collect_release_notes(evidence.get("packages", []), update_versions) + ] + behavior = to_jsonable(collect_behavior_evidence(evidence.get("packages", []), update_versions)) direction, architecture, review = await run_analysis_pipeline( selected_runtime, evidence=evidence, api_diffs=api_diffs, + release_notes=release_notes, + behavior=behavior, context=RunContext( run_id=context.run_id, workspace=context.workspace, @@ -161,76 +214,293 @@ async def run_agent( review=review, context=context, ) + implementation.setdefault("verification_results", []).extend(pre_run_results) config = to_jsonable(options) config["run_id"] = run_id config["event_log_path"] = str(context.event_log_path) - report_path = write_run_report( + + if options.implementation_enabled and implementation.get("allowed"): + implementation = _run_local_sdk_update( + options, + update_versions=update_versions, + implementation=implementation, + command_runner=command_runner, + ) + + promoted = bool(implementation.get("applied")) and _verification_passed(implementation) + current_state: dict[str, Any] = { + "promotion": { + "promoted": False, + "status": "pending-report-write", + } + } + report_path = _write_full_report( context, config=config, evidence=evidence, snapshots=[to_jsonable(snapshot) for snapshot in snapshots], api_diffs=api_diffs, + release_notes=release_notes, + behavior=behavior, + current_state=current_state, direction=direction, architecture=architecture, implementation=implementation, review=review, ) - optional_results_changed = False - if options.implementation_enabled and implementation.get("applied"): - verification_results = run_verification_commands( - options.workspace, - tuple(str(item) for item in architecture.get("verification_commands", [])), - command_runner=command_runner, - ) - implementation.setdefault("verification_results", []).extend( - to_jsonable(verification_results) - ) - optional_results_changed = True - if options.create_branch and options.branch_name: - branch_result = create_branch( - options.workspace, - options.branch_name, - command_runner=command_runner, - ) - implementation.setdefault("verification_results", []).append(to_jsonable(branch_result)) - optional_results_changed = True + current_state = build_current_state( + context, + promoted=promoted, + status="promoted" if promoted else str(implementation.get("blocked_reason") or "skipped"), + implementation=implementation, + ) + report_path = _write_full_report( + context, + config=config, + evidence=evidence, + snapshots=[to_jsonable(snapshot) for snapshot in snapshots], + api_diffs=api_diffs, + release_notes=release_notes, + behavior=behavior, + current_state=current_state, + direction=direction, + architecture=architecture, + implementation=implementation, + review=review, + ) + if options.draft_pr: - body = build_draft_pr_body(report_path.read_text(encoding="utf-8")) - pr_result = create_draft_pr( + git_results = _create_autonomous_pr( options.workspace, - title=options.pr_title, - body=body, + report_path=report_path, + options=options, command_runner=command_runner, ) - implementation.setdefault("verification_results", []).append(to_jsonable(pr_result)) - optional_results_changed = True - else: - body = None - if optional_results_changed: - report_path = write_run_report( + implementation.setdefault("verification_results", []).extend(git_results) + report_path = _write_full_report( context, config=config, evidence=evidence, snapshots=[to_jsonable(snapshot) for snapshot in snapshots], api_diffs=api_diffs, + release_notes=release_notes, + behavior=behavior, + current_state=current_state, direction=direction, architecture=architecture, implementation=implementation, review=review, - pr_body=body, + ) + _commit_final_autonomous_pr_report( + options.workspace, + report_path=report_path, + options=options, + command_runner=command_runner, ) return report_path -def _collect_snapshots(evidence: dict[str, Any], *, inspect_candidates: bool) -> list[Any]: +def _run_local_sdk_update( + options: RunOptions, + *, + update_versions: dict[str, str], + implementation: dict[str, Any], + command_runner: CommandRunner | None, +) -> dict[str, Any]: + packages = tuple(sorted(update_versions)) + if not packages: + return { + **implementation, + "applied": False, + "blocked_reason": "no resolver-selected SDK updates", + } + update_result = run_lock_update( + options.workspace, + packages, + command_runner=command_runner, + ) + results = list(implementation.get("verification_results") or []) + results.append(to_jsonable(update_result)) + applied = update_result.returncode == 0 + changes = list(implementation.get("changes") or []) + if applied: + changes.append("Updated uv.lock for resolver-selected SDK packages: " + ", ".join(packages)) + verification_commands = tuple(DEFAULT_VERIFICATION_COMMANDS) + verification_results = run_verification_commands( + options.workspace, + verification_commands, + command_runner=command_runner, + ) + results.extend(to_jsonable(verification_results)) + return { + **implementation, + "applied": applied, + "changes": changes, + "verification_results": results, + "blocked_reason": "" if applied else update_result.stderr or update_result.stdout, + } + + +def _write_full_report( + context: RunContext, + *, + config: dict[str, Any], + evidence: dict[str, Any], + snapshots: list[dict[str, Any]], + api_diffs: list[dict[str, Any]], + release_notes: list[dict[str, Any]], + behavior: dict[str, Any], + current_state: dict[str, Any], + direction: dict[str, Any], + architecture: dict[str, Any], + implementation: dict[str, Any], + review: dict[str, Any], + pr_body: str | None = None, +) -> Path: + return write_run_report( + context, + config=config, + evidence=evidence, + snapshots=snapshots, + api_diffs=api_diffs, + release_notes=release_notes, + behavior=behavior, + current_state=current_state, + direction=direction, + architecture=architecture, + implementation=implementation, + review=review, + pr_body=pr_body, + ) + + +def _verification_passed(implementation: dict[str, Any]) -> bool: + results = implementation.get("verification_results") + if not isinstance(results, list): + return False + command_results = [item for item in results if isinstance(item, dict) and "returncode" in item] + return bool(command_results) and all( + int(item.get("returncode", 1)) == 0 for item in command_results + ) + + +def _create_autonomous_pr( + root: Path, + *, + report_path: Path, + options: RunOptions, + command_runner: CommandRunner | None, +) -> list[dict[str, Any]]: + branch_name = options.branch_name or _current_branch(root, command_runner=command_runner) + body = build_draft_pr_body(report_path.read_text(encoding="utf-8")) + relative_report = _relative_path(root, report_path.parent) + paths = ("uv.lock", relative_report) + results = [ + to_jsonable(stage_paths(root, paths, command_runner=command_runner)), + to_jsonable( + commit_staged( + root, + message=options.commit_message, + command_runner=command_runner, + ) + ), + ] + if branch_name: + results.append( + to_jsonable(push_branch(root, branch_name=branch_name, command_runner=command_runner)) + ) + results.append( + to_jsonable( + create_draft_pr( + root, + title=options.pr_title, + body=body, + base=options.pr_base, + head=branch_name, + command_runner=command_runner, + ) + ) + ) + return results + + +def _commit_final_autonomous_pr_report( + root: Path, + *, + report_path: Path, + options: RunOptions, + command_runner: CommandRunner | None, +) -> None: + branch_name = options.branch_name or _current_branch(root, command_runner=command_runner) + relative_report = _relative_path(root, report_path.parent) + results = [ + stage_paths(root, (relative_report,), command_runner=command_runner), + commit_staged( + root, + message="Finalize SDK evolution report", + command_runner=command_runner, + ), + ] + if branch_name: + results.append(push_branch(root, branch_name=branch_name, command_runner=command_runner)) + failed = [result for result in results if result.returncode != 0] + if failed: + detail = failed[0].stderr or failed[0].stdout + raise RuntimeError(f"failed to commit final autonomous PR report: {detail}") + + +def _current_branch(root: Path, *, command_runner: CommandRunner | None) -> str: + runner = command_runner or None + if runner is None: + from examples.sdk_evolution_agent.collectors import run_command + + runner = run_command + result = runner(("git", "branch", "--show-current"), cwd=root) + return result.stdout.strip() if result.returncode == 0 else "" + + +def _relative_path(root: Path, path: Path) -> str: + try: + return str(path.resolve().relative_to(root.resolve())) + except ValueError: + return str(path) + + +def _collect_snapshots(evidence: dict[str, Any], *, inspect_candidates: bool = True) -> list[Any]: + del inspect_candidates # Candidate inspection is mandatory for update candidates. snapshots = [] + update_versions = _refresh_update_versions(evidence) + refresh_preview_seen = evidence.get("refresh_preview") is not None for package in evidence.get("packages", []): if not isinstance(package, dict): continue name = str(package.get("name")) - snapshots.append(snapshot_current_api(name, version=package.get("installed_version"))) - latest = package.get("latest_version") - installed = package.get("installed_version") or package.get("locked_version") - if inspect_candidates and latest and latest != installed: - snapshots.append(snapshot_candidate_in_venv(name, str(latest))) + locked = package.get("locked_version") + installed = package.get("installed_version") + baseline = locked or installed + if locked and installed and locked != installed: + snapshots.append(snapshot_candidate_in_venv(name, str(locked))) + else: + snapshots.append(snapshot_current_api(name, version=baseline)) + candidate = update_versions.get(name) + if candidate is None and not refresh_preview_seen: + latest = package.get("latest_version") + if latest and latest != baseline: + candidate = str(latest) + if candidate: + snapshots.append(snapshot_candidate_in_venv(name, candidate)) return snapshots + + +def _refresh_update_versions(evidence: dict[str, Any]) -> dict[str, str]: + preview = evidence.get("refresh_preview") + if not isinstance(preview, dict): + return {} + text = f"{preview.get('stdout') or ''}\n{preview.get('stderr') or ''}" + return { + package: version + for package, version in re.findall( + r"Update\s+([A-Za-z0-9_.-]+)\s+v\S+\s+->\s+v(\S+)", + text, + ) + } diff --git a/examples/sdk_evolution_agent/collectors.py b/examples/sdk_evolution_agent/collectors.py index 93417fb..8dc135c 100644 --- a/examples/sdk_evolution_agent/collectors.py +++ b/examples/sdk_evolution_agent/collectors.py @@ -268,6 +268,29 @@ def run_refresh_preview( ) +def run_lock_update( + root: Path, + packages: Sequence[str], + *, + command_runner: CommandRunner | None = None, +) -> CommandResult: + """Apply a targeted uv lock update with freshness cutoffs removed.""" + + command_runner = command_runner or run_command + env, removed = cutoff_free_env() + command = ["uv", "lock"] + for package in packages: + command.extend(("-P", package)) + result = command_runner(tuple(command), cwd=root, env=env) + return CommandResult( + command=result.command, + returncode=result.returncode, + stdout=result.stdout, + stderr=result.stderr, + removed_env=removed, + ) + + def run_verification_commands( root: Path, commands: Sequence[str], diff --git a/examples/sdk_evolution_agent/current_state.py b/examples/sdk_evolution_agent/current_state.py new file mode 100644 index 0000000..9f37bae --- /dev/null +++ b/examples/sdk_evolution_agent/current_state.py @@ -0,0 +1,106 @@ +"""Current-state baseline manifest helpers for SDK evolution runs.""" + +from __future__ import annotations + +import hashlib +import subprocess +from pathlib import Path +from typing import Any + +from examples.sdk_evolution_agent.collectors import read_uv_lock_versions +from examples.sdk_evolution_agent.models import RunContext + +CURRENT_STATE_SCHEMA_VERSION = "1" + + +def build_current_state( + context: RunContext, + *, + promoted: bool, + status: str, + implementation: dict[str, Any], +) -> dict[str, Any]: + """Build the baseline manifest for a run.""" + + lockfile = context.workspace / "uv.lock" + return { + "schema_version": CURRENT_STATE_SCHEMA_VERSION, + "generated_at_run_id": context.run_id, + "source_run_id": context.run_id, + "commit": _git_output(context.workspace, ("git", "rev-parse", "HEAD")), + "dirty_worktree": bool(_git_output(context.workspace, ("git", "status", "--short"))), + "lockfile_hash": _sha256(lockfile), + "packages": read_uv_lock_versions(lockfile), + "artifacts": _artifact_refs(context.report_root, workspace=context.workspace), + "promotion": { + "promoted": promoted, + "status": status, + "implementation_applied": bool(implementation.get("applied")), + "blocked_reason": str(implementation.get("blocked_reason") or ""), + }, + } + + +def _artifact_refs(report_root: Path, *, workspace: Path) -> dict[str, dict[str, str]]: + names = ( + "evidence.json", + "release_notes.json", + "api_diffs.json", + "behavior_probes.json", + "behavior_diffs.json", + "direction_analysis.json", + "architecture_decision.json", + "implementation_summary.json", + "review.json", + "report.md", + ) + refs: dict[str, dict[str, str]] = {} + for name in names: + path = report_root / name + if path.exists(): + refs[name] = { + "path": _portable_path(path, workspace=workspace), + "sha256": _sha256(path), + } + snapshots_dir = report_root / "api_snapshots" + if snapshots_dir.exists(): + for path in sorted(snapshots_dir.glob("*.json")): + refs[f"api_snapshots/{path.name}"] = { + "path": _portable_path(path, workspace=workspace), + "sha256": _sha256(path), + } + return refs + + +def _portable_path(path: Path, *, workspace: Path) -> str: + try: + return str(path.resolve().relative_to(workspace.resolve())) + except ValueError: + return str(path) + + +def _sha256(path: Path) -> str: + if not path.exists(): + return "" + digest = hashlib.sha256() + with path.open("rb") as handle: + for chunk in iter(lambda: handle.read(65536), b""): + digest.update(chunk) + return digest.hexdigest() + + +def _git_output(root: Path, command: tuple[str, ...]) -> str: + try: + completed = subprocess.run( + command, + cwd=root, + text=True, + capture_output=True, + timeout=30, + check=False, + ) + except Exception: + return "" + if completed.returncode != 0: + return "" + return completed.stdout.strip() diff --git a/examples/sdk_evolution_agent/models.py b/examples/sdk_evolution_agent/models.py index 7ccacca..8878709 100644 --- a/examples/sdk_evolution_agent/models.py +++ b/examples/sdk_evolution_agent/models.py @@ -2,7 +2,7 @@ from __future__ import annotations -from dataclasses import asdict, dataclass, is_dataclass +from dataclasses import asdict, dataclass, field, is_dataclass from pathlib import Path from typing import Any @@ -97,6 +97,47 @@ class ApiDiff: changed: tuple[str, ...] = () +@dataclass(frozen=True) +class ReleaseNoteEvidence: + """Release-note evidence collected for one package interval.""" + + package: str + from_version: str | None + to_version: str | None + status: str + sources: tuple[SourceRef, ...] = () + summaries: tuple[str, ...] = () + checked_urls: tuple[str, ...] = () + unavailable_reason: str = "" + + +@dataclass(frozen=True) +class BehaviorProbeResult: + """One deterministic behavior/contract probe result.""" + + package: str + version: str | None + scope: str + probe: str + status: str + summary: str + details: dict[str, Any] = field(default_factory=dict) + + +@dataclass(frozen=True) +class BehaviorDiff: + """Observed behavior difference between current and candidate probes.""" + + package: str + from_version: str | None + to_version: str | None + probe: str + severity: str + summary: str + before_status: str + after_status: str + + @dataclass(frozen=True) class RunOptions: """Configuration for one local agent run.""" @@ -107,10 +148,12 @@ class RunOptions: report_dir: Path = Path("reports/sdk-evolution") implementation_enabled: bool = False refresh_preview: bool = False - inspect_candidates: bool = False + inspect_candidates: bool = True create_branch: bool = False branch_name: str | None = None draft_pr: bool = False + pr_base: str | None = None + commit_message: str = "Run SDK evolution update" pr_title: str = "Adapt agent-runtime-kit to upstream SDK evolution" diff --git a/examples/sdk_evolution_agent/pr.py b/examples/sdk_evolution_agent/pr.py index 8fe1d3b..6d94100 100644 --- a/examples/sdk_evolution_agent/pr.py +++ b/examples/sdk_evolution_agent/pr.py @@ -38,17 +38,57 @@ def create_branch( return command_runner(("git", "switch", "-c", branch_name), cwd=root) +def stage_paths( + root: Path, + paths: tuple[str, ...], + *, + command_runner: CommandRunner | None = None, +) -> CommandResult: + """Stage paths for an autonomous SDK update PR.""" + + command_runner = command_runner or run_command + return command_runner(("git", "add", *paths), cwd=root) + + +def commit_staged( + root: Path, + *, + message: str, + command_runner: CommandRunner | None = None, +) -> CommandResult: + """Commit staged SDK update artifacts.""" + + command_runner = command_runner or run_command + return command_runner(("git", "commit", "-m", message), cwd=root) + + +def push_branch( + root: Path, + *, + branch_name: str, + command_runner: CommandRunner | None = None, +) -> CommandResult: + """Push the current SDK update branch.""" + + command_runner = command_runner or run_command + return command_runner(("git", "push", "-u", "origin", branch_name), cwd=root) + + def create_draft_pr( root: Path, *, title: str, body: str, + base: str | None = None, + head: str | None = None, command_runner: CommandRunner | None = None, ) -> CommandResult: """Open a draft PR with gh when authenticated.""" command_runner = command_runner or run_command - return command_runner( - ("gh", "pr", "create", "--draft", "--title", title, "--body", body), - cwd=root, - ) + command = ["gh", "pr", "create", "--draft", "--title", title, "--body", body] + if base: + command.extend(("--base", base)) + if head: + command.extend(("--head", head)) + return command_runner(tuple(command), cwd=root) diff --git a/examples/sdk_evolution_agent/release_notes.py b/examples/sdk_evolution_agent/release_notes.py new file mode 100644 index 0000000..078898c --- /dev/null +++ b/examples/sdk_evolution_agent/release_notes.py @@ -0,0 +1,248 @@ +"""Release-note evidence collection for SDK evolution runs.""" + +from __future__ import annotations + +import gzip +import re +import urllib.request +from collections.abc import Callable, Mapping, Sequence + +from examples.sdk_evolution_agent.models import ReleaseNoteEvidence, SourceRef + +ReleaseNoteFetcher = Callable[[str], str] + +RELEASE_NOTE_SOURCES: dict[str, tuple[SourceRef, ...]] = { + "claude-agent-sdk": ( + SourceRef( + kind="changelog", + label="Claude Agent SDK Python changelog", + url=( + "https://raw.githubusercontent.com/anthropics/" + "claude-agent-sdk-python/main/CHANGELOG.md" + ), + ), + SourceRef( + kind="docs", + label="Claude Agent SDK overview", + url="https://code.claude.com/docs/en/agent-sdk/overview", + ), + ), + "openai-codex": ( + SourceRef( + kind="docs", + label="Codex SDK docs", + url="https://developers.openai.com/codex/sdk", + ), + SourceRef( + kind="changelog", + label="Codex changelog", + url="https://developers.openai.com/codex/changelog", + ), + SourceRef( + kind="release", + label="Codex repository releases", + url="https://github.com/openai/codex/releases", + ), + ), + "openai-codex-cli-bin": ( + SourceRef( + kind="release", + label="Codex repository releases", + url="https://github.com/openai/codex/releases", + ), + SourceRef( + kind="package-metadata", + label="Codex CLI binary package metadata", + url="https://pypi.org/project/openai-codex-cli-bin/", + ), + ), + "google-antigravity": ( + SourceRef( + kind="changelog", + label="Google Antigravity changelog", + url="https://antigravity.google/changelog", + ), + SourceRef( + kind="repository", + label="Antigravity SDK repository", + url="https://github.com/google-antigravity/antigravity-sdk-python", + ), + SourceRef( + kind="package-metadata", + label="Antigravity package metadata", + url="https://pypi.org/project/google-antigravity/", + ), + ), +} + + +def collect_release_notes( + packages: Sequence[Mapping[str, object]], + update_versions: Mapping[str, str], + *, + fetcher: ReleaseNoteFetcher | None = None, +) -> tuple[ReleaseNoteEvidence, ...]: + """Collect primary-source release-note evidence for update candidates.""" + + fetcher = fetcher or fetch_url_text + evidence: list[ReleaseNoteEvidence] = [] + for package in packages: + name = str(package.get("name") or "") + if not name: + continue + from_version = _string_or_none(package.get("locked_version")) or _string_or_none( + package.get("installed_version") + ) + to_version = update_versions.get(name) + if not to_version: + evidence.append( + ReleaseNoteEvidence( + package=name, + from_version=from_version, + to_version=None, + status="not-needed", + sources=RELEASE_NOTE_SOURCES.get(name, ()), + unavailable_reason="no resolver-selected update", + ) + ) + continue + + sources = RELEASE_NOTE_SOURCES.get(name, ()) + summaries: list[str] = [] + checked_urls: list[str] = [] + source_results: list[SourceRef] = [] + failures: list[str] = [] + for source in sources: + if not source.url: + source_results.append(source) + continue + checked_urls.append(source.url) + try: + text = fetcher(source.url) + except Exception as exc: + failures.append(f"{source.label}: {exc}") + source_results.append( + SourceRef( + kind=source.kind, + label=source.label, + url=source.url, + version=to_version, + available=False, + note=str(exc), + ) + ) + continue + source_results.append( + SourceRef( + kind=source.kind, + label=source.label, + url=source.url, + version=to_version, + available=True, + ) + ) + summaries.extend( + _summaries_for_interval( + text, + from_version=from_version, + to_version=to_version, + ) + ) + + if summaries: + status = "found" + unavailable_reason = "" + elif checked_urls and len(failures) < len(checked_urls): + status = "found" if name == "google-antigravity" else "no-matching-version" + summaries.append( + "Official sources were fetched, but no package-version-specific " + f"entry for {to_version} was found." + ) + unavailable_reason = "sources fetched but no matching version text was found" + elif checked_urls: + status = "unavailable" + unavailable_reason = "; ".join(failures) + else: + status = "unavailable" + unavailable_reason = "no release-note source configured" + + evidence.append( + ReleaseNoteEvidence( + package=name, + from_version=from_version, + to_version=to_version, + status=status, + sources=tuple(source_results or sources), + summaries=tuple(_dedupe(summaries)[:8]), + checked_urls=tuple(checked_urls), + unavailable_reason=unavailable_reason, + ) + ) + return tuple(evidence) + + +def fetch_url_text(url: str) -> str: + """Fetch a release-note source as text.""" + + request = urllib.request.Request(url, headers={"User-Agent": "agent-runtime-kit-sdk-evolution"}) + with urllib.request.urlopen(request, timeout=20) as response: + raw = response.read() + if raw.startswith(b"\x1f\x8b"): + raw = gzip.decompress(raw) + return raw.decode("utf-8", errors="replace") + + +def _summaries_for_interval( + text: str, + *, + from_version: str | None, + to_version: str, +) -> list[str]: + lines = [line.strip() for line in text.splitlines()] + version_patterns = [to_version] + if from_version: + version_patterns.append(from_version) + matches: list[str] = [] + for index, line in enumerate(lines): + if not line: + continue + if any(pattern and pattern in line for pattern in version_patterns): + matches.append(_clean_summary(line)) + for nearby in lines[index + 1 : index + 4]: + cleaned = _clean_summary(nearby) + if cleaned: + matches.append(cleaned) + if not matches and to_version: + compact = re.sub(r"\s+", " ", text) + version_index = compact.find(to_version) + if version_index >= 0: + start = max(0, version_index - 160) + end = min(len(compact), version_index + 320) + matches.append(_clean_summary(compact[start:end])) + return [match for match in matches if match] + + +def _clean_summary(value: str, *, limit: int = 280) -> str: + cleaned = re.sub(r"<[^>]+>", "", value) + cleaned = re.sub(r"\s+", " ", cleaned).strip(" -*#\t") + if len(cleaned) <= limit: + return cleaned + return cleaned[: limit - 14].rstrip() + " [truncated]" + + +def _dedupe(values: Sequence[str]) -> list[str]: + seen: set[str] = set() + result: list[str] = [] + for value in values: + if value in seen: + continue + seen.add(value) + result.append(value) + return result + + +def _string_or_none(value: object) -> str | None: + if value is None: + return None + text = str(value) + return text or None diff --git a/examples/sdk_evolution_agent/report.py b/examples/sdk_evolution_agent/report.py index 90386e3..ec26afa 100644 --- a/examples/sdk_evolution_agent/report.py +++ b/examples/sdk_evolution_agent/report.py @@ -26,6 +26,9 @@ def write_run_report( evidence: dict[str, Any], snapshots: list[dict[str, Any]], api_diffs: list[dict[str, Any]], + release_notes: list[dict[str, Any]], + behavior: dict[str, Any], + current_state: dict[str, Any], direction: dict[str, Any], architecture: dict[str, Any], implementation: dict[str, Any], @@ -37,11 +40,15 @@ def write_run_report( context.report_root.mkdir(parents=True, exist_ok=True) write_json(context.report_root / "config.json", config) write_json(context.report_root / "evidence.json", evidence) + write_json(context.report_root / "release_notes.json", release_notes) write_json(context.report_root / "api_diffs.json", api_diffs) + write_json(context.report_root / "behavior_probes.json", behavior.get("results", [])) + write_json(context.report_root / "behavior_diffs.json", behavior.get("diffs", [])) write_json(context.report_root / "direction_analysis.json", direction) write_json(context.report_root / "architecture_decision.json", architecture) write_json(context.report_root / "implementation_summary.json", implementation) write_json(context.report_root / "review.json", review) + write_json(context.report_root / "current_state.json", current_state) snapshots_dir = context.report_root / "api_snapshots" snapshots_dir.mkdir(exist_ok=True) for index, snapshot in enumerate(snapshots, start=1): @@ -55,6 +62,9 @@ def write_run_report( config=config, evidence=evidence, api_diffs=api_diffs, + release_notes=release_notes, + behavior=behavior, + current_state=current_state, direction=direction, architecture=architecture, implementation=implementation, @@ -70,6 +80,9 @@ def render_markdown_report( config: dict[str, Any], evidence: dict[str, Any], api_diffs: list[dict[str, Any]], + release_notes: list[dict[str, Any]], + behavior: dict[str, Any], + current_state: dict[str, Any], direction: dict[str, Any], architecture: dict[str, Any], implementation: dict[str, Any], @@ -92,6 +105,19 @@ def render_markdown_report( ) manual = architecture.get("manual_design_required") recursive = architecture.get("recursive_self_adaptation_impact") + release_lines = [ + "- {package}: {status} ({from_version} -> {to_version})".format( + package=item.get("package"), + status=item.get("status"), + from_version=item.get("from_version"), + to_version=item.get("to_version"), + ) + for item in release_notes + if isinstance(item, dict) and item.get("to_version") + ] + behavior_summary = behavior.get("summary") if isinstance(behavior, dict) else {} + behavior_diffs = behavior.get("diffs", []) if isinstance(behavior, dict) else [] + promotion = current_state.get("promotion", {}) if isinstance(current_state, dict) else {} return "\n".join( [ "# SDK Evolution Agent Report", @@ -110,6 +136,17 @@ def render_markdown_report( "", f"- Diff count: `{len(api_diffs)}`", "", + "## Release Notes", + "", + *(release_lines or ["- No SDK update release-note evidence required."]), + "", + "## Behavior Probes", + "", + f"- Status: `{behavior_summary.get('status')}`", + f"- Changed contracts: `{behavior_summary.get('changed_count')}`", + f"- Breaking contracts: `{behavior_summary.get('breaking_count')}`", + f"- Diff count: `{len(behavior_diffs)}`", + "", "## Direction Of Travel", "", "```json", @@ -132,6 +169,11 @@ def render_markdown_report( json.dumps(implementation, indent=2, sort_keys=True, default=str), "```", "", + "## Current State Baseline", + "", + f"- Promotion status: `{promotion.get('status')}`", + f"- Promoted: `{promotion.get('promoted')}`", + "", "## Reviewer Output", "", "```json", diff --git a/examples/sdk_evolution_agent/schemas.py b/examples/sdk_evolution_agent/schemas.py index 80668d2..1c73b83 100644 --- a/examples/sdk_evolution_agent/schemas.py +++ b/examples/sdk_evolution_agent/schemas.py @@ -98,7 +98,7 @@ class SchemaValidationError(ValueError): "required": ["status", "reasons", "required_changes"], "additionalProperties": False, "properties": { - "status": {"type": "string"}, + "status": {"type": "string", "enum": ["pass", "reject"]}, "reasons": {"type": "array", "items": {"type": "string"}}, "required_changes": {"type": "array", "items": {"type": "string"}}, }, diff --git a/examples/sdk_evolution_agent/stages.py b/examples/sdk_evolution_agent/stages.py index 404058e..e272faa 100644 --- a/examples/sdk_evolution_agent/stages.py +++ b/examples/sdk_evolution_agent/stages.py @@ -3,6 +3,7 @@ from __future__ import annotations import json +import re from collections.abc import Mapping, Sequence from pathlib import Path from typing import Any @@ -31,7 +32,6 @@ from examples.sdk_evolution_agent.schemas import ( ARCHITECTURE_DECISION_SCHEMA, DIRECTION_ANALYSIS_SCHEMA, - IMPLEMENTATION_SUMMARY_SCHEMA, REVIEWER_OUTPUT_SCHEMA, JsonSchema, SchemaValidationError, @@ -44,6 +44,8 @@ class StageExecutionError(RuntimeError): SDK_EVOLUTION_CODEX_HOME = Path("~/.codex_agent_runtime_sdk").expanduser() +SDK_EVOLUTION_CODEX_MODEL = "gpt-5.5" +SDK_EVOLUTION_CODEX_REASONING_EFFORT = "xhigh" class FixtureEvolutionRuntime: @@ -105,6 +107,7 @@ def _codex_evolution_runtime(**kwargs: Any) -> CodexAgentRuntime: SDK_EVOLUTION_CODEX_HOME.chmod(0o700) env = dict(kwargs.pop("env", {}) or {}) env.setdefault("CODEX_HOME", str(SDK_EVOLUTION_CODEX_HOME)) + kwargs.setdefault("default_model", SDK_EVOLUTION_CODEX_MODEL) return CodexAgentRuntime(env=env, **kwargs) @@ -137,12 +140,12 @@ async def run_stage( permissions = _stage_permissions(runtime, write_enabled=write_enabled) task = AgentTask( goal=json.dumps(payload, sort_keys=True, default=str), - system=_stage_system_prompt(stage), + system=_stage_system_prompt(stage, schema), working_directory=context.workspace, permissions=permissions, event_sink=context.event_sink, output_schema=schema, - metadata={"stage": stage, "run_id": context.run_id}, + metadata=_stage_metadata(runtime, stage=stage, context=context), ) try: result = await runtime.run(task) @@ -165,11 +168,18 @@ async def run_analysis_pipeline( *, evidence: Mapping[str, Any], api_diffs: Sequence[Mapping[str, Any]], + release_notes: Sequence[Mapping[str, Any]], + behavior: Mapping[str, Any], context: RunContext, ) -> tuple[dict[str, Any], dict[str, Any], dict[str, Any]]: """Run direction, architecture, and reviewer stages.""" - stage_payload = {"evidence": evidence, "api_diffs": list(api_diffs)} + stage_payload = { + "evidence": evidence, + "api_diffs": list(api_diffs), + "release_notes": list(release_notes), + "behavior": behavior, + } direction = await run_stage( runtime, stage="direction-analysis", @@ -177,24 +187,34 @@ async def run_analysis_pipeline( schema=DIRECTION_ANALYSIS_SCHEMA, context=context, ) + direction = _compact_stage_output(direction) architecture = await run_stage( runtime, stage="architecture-decision", payload={ "evidence": evidence, "api_diffs": list(api_diffs), + "release_notes": list(release_notes), + "behavior": behavior, "direction_analysis": direction, }, schema=ARCHITECTURE_DECISION_SCHEMA, context=context, ) architecture = with_recursive_impact(architecture, api_diffs) + architecture = with_candidate_api_diff_guard(architecture, evidence, api_diffs) + architecture = with_release_note_guard(architecture, release_notes) + architecture = with_behavior_probe_guard(architecture, behavior) + architecture = with_manual_design_gate(architecture) + architecture = _compact_stage_output(architecture) review = await run_stage( runtime, stage="review", payload={ "evidence": evidence, "api_diffs": list(api_diffs), + "release_notes": list(release_notes), + "behavior": behavior, "direction_analysis": direction, "architecture_decision": architecture, }, @@ -223,23 +243,20 @@ async def maybe_run_implementation( if not gate.allowed: return { "applied": False, + "allowed": False, "changes": [], "verification_results": [], "blocked_reason": gate.reason, } - return await run_stage( - runtime, - stage="implementation", - payload={ - "evidence": evidence, - "direction_analysis": direction, - "architecture_decision": architecture, - "review": review, - }, - schema=IMPLEMENTATION_SUMMARY_SCHEMA, - context=context, - write_enabled=True, - ) + del runtime, evidence, direction, review + return { + "applied": False, + "allowed": True, + "changes": [], + "verification_results": [], + "blocked_reason": "", + "planned_changes": list(architecture.get("self_adaptation_plan") or []), + } def evaluate_implementation_gate( @@ -258,13 +275,18 @@ def evaluate_implementation_gate( "self_adaptation_plan" ): return GateResult(False, "recursive self-adaptation requires a migration plan") - if str(review.get("status", "")).lower() != "pass": + if not _review_passed(review): return GateResult(False, "reviewer did not pass the proposal") if not architecture.get("safe_to_implement"): return GateResult(False, "architecture decision is not safe to implement") return GateResult(True, "implementation enabled and gates passed") +def _review_passed(review: Mapping[str, Any]) -> bool: + status = str(review.get("status", "")).strip().lower() + return status in {"pass", "passed", "approve", "approved", "accepted"} + + def detects_recursive_impact(api_diffs: Sequence[Mapping[str, Any] | ApiDiff]) -> bool: """Detect whether API diffs touch the agent's own runtime contract.""" @@ -313,14 +335,197 @@ def with_recursive_impact( return result -def _stage_system_prompt(stage: str) -> str: - return ( +def with_candidate_api_diff_guard( + architecture: Mapping[str, Any], + evidence: Mapping[str, Any], + api_diffs: Sequence[Mapping[str, Any] | ApiDiff], +) -> dict[str, Any]: + """Block SDK update implementation when candidate API evidence is missing.""" + + update_packages = _refresh_update_packages(evidence) + if not update_packages: + return dict(architecture) + diff_packages = { + diff.package if isinstance(diff, ApiDiff) else str(diff.get("package") or "") + for diff in api_diffs + } + missing = tuple(sorted(package for package in update_packages if package not in diff_packages)) + if not missing: + return dict(architecture) + + result = dict(architecture) + result["safe_to_implement"] = False + result["manual_design_required"] = True + findings = list(result.get("findings") or []) + findings.append( + { + "classification": "manual-design-required", + "summary": ( + "SDK update candidates require candidate-version API snapshot diffs " + "before implementation can be considered safe." + ), + "evidence": [f"missing api_diffs for {package}" for package in missing], + } + ) + result["findings"] = findings + uncertainty = list(result.get("uncertainty") or []) + uncertainty.append( + "Candidate API diffs were not available for update candidate(s): " + + ", ".join(missing) + ) + result["uncertainty"] = uncertainty + plan = list(result.get("self_adaptation_plan") or []) + plan.append( + "Rerun with candidate API inspection and review the generated api_diffs before " + "changing adapters or dependency locks." + ) + result["self_adaptation_plan"] = plan + return result + + +def with_release_note_guard( + architecture: Mapping[str, Any], + release_notes: Sequence[Mapping[str, Any]], +) -> dict[str, Any]: + """Block implementation when release-note collection itself failed.""" + + failed = [ + str(item.get("package")) + for item in release_notes + if item.get("to_version") and item.get("status") == "unavailable" + ] + if not failed: + return dict(architecture) + result = dict(architecture) + result["safe_to_implement"] = False + result["manual_design_required"] = True + findings = list(result.get("findings") or []) + findings.append( + { + "classification": "manual-design-required", + "summary": "Release-note evidence could not be collected for update candidates.", + "evidence": [f"release notes unavailable for {package}" for package in failed], + } + ) + result["findings"] = findings + uncertainty = list(result.get("uncertainty") or []) + uncertainty.append("Missing release-note evidence for: " + ", ".join(sorted(failed))) + result["uncertainty"] = uncertainty + return result + + +def with_behavior_probe_guard( + architecture: Mapping[str, Any], + behavior: Mapping[str, Any], +) -> dict[str, Any]: + """Block implementation when candidate behavior probes fail.""" + + diffs = behavior.get("diffs") + if not isinstance(diffs, list): + return dict(architecture) + breaking = [ + diff + for diff in diffs + if isinstance(diff, Mapping) and str(diff.get("severity")) == "breaking" + ] + if not breaking: + return dict(architecture) + result = dict(architecture) + result["safe_to_implement"] = False + result["manual_design_required"] = True + findings = list(result.get("findings") or []) + findings.append( + { + "classification": "manual-design-required", + "summary": "Candidate SDK behavior probes detected breaking adapter-contract drift.", + "evidence": [ + f"{diff.get('package')}:{diff.get('probe')} {diff.get('summary')}" + for diff in breaking + ], + } + ) + result["findings"] = findings + uncertainty = list(result.get("uncertainty") or []) + uncertainty.append("Breaking behavior probes require manual adapter design review.") + result["uncertainty"] = uncertainty + return result + + +def with_manual_design_gate(architecture: Mapping[str, Any]) -> dict[str, Any]: + """Make manual design decisions block implementation unambiguously.""" + + result = dict(architecture) + if result.get("manual_design_required"): + result["safe_to_implement"] = False + return result + + +def _refresh_update_packages(evidence: Mapping[str, Any]) -> tuple[str, ...]: + preview = evidence.get("refresh_preview") + if not isinstance(preview, Mapping): + return () + text = f"{preview.get('stdout') or ''}\n{preview.get('stderr') or ''}" + return tuple( + sorted(set(re.findall(r"Update\s+([A-Za-z0-9_.-]+)\s+v\S+\s+->\s+v\S+", text))) + ) + + +def _compact_stage_output(value: Mapping[str, Any]) -> dict[str, Any]: + return {key: _compact_stage_value(item) for key, item in value.items()} + + +def _compact_stage_value(value: Any, *, string_limit: int = 800, list_limit: int = 8) -> Any: + if isinstance(value, str): + if len(value) <= string_limit: + return value + return value[: string_limit - 16].rstrip() + " [truncated]" + if isinstance(value, list): + return [ + _compact_stage_value(item, string_limit=string_limit, list_limit=list_limit) + for item in value[:list_limit] + ] + if isinstance(value, dict): + return { + key: _compact_stage_value(item, string_limit=string_limit, list_limit=list_limit) + for key, item in value.items() + } + return value + + +def _stage_system_prompt(stage: str, schema: JsonSchema) -> str: + prompt = ( "You are running inside the local SDK evolution agent. " "Use only the provided evidence. Preserve vendor-specific behavior, " "state uncertainty explicitly, and never claim implementation occurred " "unless it is reflected in the provided artifacts. " - f"Current stage: {stage}." + "Return only one JSON object that validates against the provided schema. " + "Do not include Markdown, code fences, file links, or prose outside JSON. " + "Do not call shell, command, file, or workspace tools; the deterministic " + "evidence bundle already contains the inspected data. " + "Keep each array to at most five high-signal items and each string concise. " + f"Current stage: {stage}. " + f"Output schema: {json.dumps(schema, sort_keys=True)}" ) + if stage in {"architecture-decision", "review"}: + prompt += ( + " Deterministic gate policy: candidate API diffs prove API shape drift, " + "while behavior_diffs prove whether the adapter contract still holds. " + "For adapter-contract probes, severity none means the required adapter " + "contract is compatible even when probe details or public API snapshots " + "show optional field churn. " + "Do not mark manual_design_required, unsafe, or review rejection solely " + "because public top-level symbols were added or removed when behavior " + "probes pass before and after and there is no adapter-source evidence " + "that the removed symbols are used. Breaking behavior_diffs, missing " + "candidate API diffs, unavailable required release-note evidence, " + "reviewer-identified unsupported vendor behavior, or recursive " + "runtime-contract impact remain hard blockers. Release-note status found " + "is collected evidence, not unavailable evidence, even when the summary " + "states that no package-version-specific entry was found." + ) + if stage == "review": + prompt += " The review status must be exactly pass or reject." + return prompt def _stage_permissions(runtime: AgentRuntime, *, write_enabled: bool) -> PermissionProfile: @@ -339,6 +544,19 @@ def _stage_permissions(runtime: AgentRuntime, *, write_enabled: bool) -> Permiss ) +def _stage_metadata( + runtime: AgentRuntime, + *, + stage: str, + context: RunContext, +) -> dict[str, Any]: + metadata: dict[str, Any] = {"stage": stage, "run_id": context.run_id} + if runtime.kind is AgentRuntimeKind.CODEX_AGENT_SDK: + metadata["model"] = SDK_EVOLUTION_CODEX_MODEL + metadata["reasoning_effort"] = SDK_EVOLUTION_CODEX_REASONING_EFFORT + return metadata + + def _fixture_payload(stage: str, task: AgentTask) -> dict[str, Any]: try: source = json.loads(task.goal) diff --git a/tests/test_sdk_evolution_agent.py b/tests/test_sdk_evolution_agent.py index 6409346..9bfaaa6 100644 --- a/tests/test_sdk_evolution_agent.py +++ b/tests/test_sdk_evolution_agent.py @@ -18,15 +18,22 @@ RuntimeAvailability, ) from agent_runtime_kit.adapters import CodexAgentRuntime -from examples.sdk_evolution_agent.cli import RunOptions, parse_args, run_agent +from examples.sdk_evolution_agent.behavior import ( + collect_behavior_evidence, + diff_behavior_results, +) +from examples.sdk_evolution_agent.cli import RunOptions, _collect_snapshots, parse_args, run_agent from examples.sdk_evolution_agent.collectors import ( build_refresh_preview_command, collect_evidence, cutoff_free_env, + run_lock_update, run_refresh_preview, ) -from examples.sdk_evolution_agent.models import CommandResult, RunContext +from examples.sdk_evolution_agent.current_state import build_current_state +from examples.sdk_evolution_agent.models import ApiSnapshot, CommandResult, RunContext from examples.sdk_evolution_agent.pr import build_draft_pr_body +from examples.sdk_evolution_agent.release_notes import collect_release_notes from examples.sdk_evolution_agent.schemas import ( DIRECTION_ANALYSIS_SCHEMA, SchemaValidationError, @@ -35,13 +42,19 @@ from examples.sdk_evolution_agent.snapshots import diff_snapshots, snapshot_current_api from examples.sdk_evolution_agent.stages import ( SDK_EVOLUTION_CODEX_HOME, + SDK_EVOLUTION_CODEX_MODEL, + SDK_EVOLUTION_CODEX_REASONING_EFFORT, FixtureEvolutionRuntime, StageExecutionError, build_registry, detects_recursive_impact, evaluate_implementation_gate, run_stage, + with_behavior_probe_guard, + with_candidate_api_diff_guard, + with_manual_design_gate, with_recursive_impact, + with_release_note_guard, ) @@ -97,6 +110,42 @@ def runner( assert result.removed_env == ("UV_EXCLUDE_NEWER",) +def test_lock_update_uses_targeted_packages_and_clean_env( + tmp_path: Path, + monkeypatch: pytest.MonkeyPatch, +) -> None: + seen: dict[str, Any] = {} + monkeypatch.setenv("UV_EXCLUDE_NEWER", "2026-01-01") + + def runner( + command: tuple[str, ...], + *, + cwd: Path | None = None, + env: dict[str, str], + ) -> CommandResult: + seen["command"] = command + seen["cwd"] = cwd + seen["env"] = env + return CommandResult(command=command, returncode=0, stdout="ok") + + result = run_lock_update( + tmp_path, + ("claude-agent-sdk", "google-antigravity"), + command_runner=runner, + ) + + assert seen["command"] == ( + "uv", + "lock", + "-P", + "claude-agent-sdk", + "-P", + "google-antigravity", + ) + assert "UV_EXCLUDE_NEWER" not in seen["env"] + assert result.removed_env == ("UV_EXCLUDE_NEWER",) + + def test_collect_evidence_records_versions_and_sources(tmp_path: Path) -> None: (tmp_path / "pyproject.toml").write_text( """ @@ -130,6 +179,213 @@ def test_collect_evidence_records_versions_and_sources(tmp_path: Path) -> None: assert evidence["adapter_sources"] +def test_release_notes_collects_matching_update_source() -> None: + notes = collect_release_notes( + [ + { + "name": "claude-agent-sdk", + "locked_version": "0.2.96", + "installed_version": "0.2.96", + } + ], + {"claude-agent-sdk": "0.2.106"}, + fetcher=lambda url: "## 0.2.106\n- Added TaskUpdatedMessage\n", + ) + + assert notes[0].status == "found" + assert notes[0].to_version == "0.2.106" + assert any("TaskUpdatedMessage" in summary for summary in notes[0].summaries) + + +def test_antigravity_release_notes_record_source_coverage_without_version_match() -> None: + notes = collect_release_notes( + [ + { + "name": "google-antigravity", + "locked_version": "0.1.2", + "installed_version": "0.1.2", + } + ], + {"google-antigravity": "0.1.4"}, + fetcher=lambda url: "Google Antigravity product changelog", + ) + + assert notes[0].status == "found" + assert "no package-version-specific" in notes[0].summaries[0] + + +def test_release_note_guard_blocks_unavailable_update_source() -> None: + guarded = with_release_note_guard( + { + "findings": [], + "safe_to_implement": True, + "manual_design_required": False, + "uncertainty": [], + }, + [ + { + "package": "claude-agent-sdk", + "to_version": "0.2.106", + "status": "unavailable", + } + ], + ) + + assert guarded["safe_to_implement"] is False + assert guarded["manual_design_required"] is True + + +def test_behavior_diffs_track_candidate_contract_changes() -> None: + behavior = collect_behavior_evidence( + [ + { + "name": "fake-sdk", + "locked_version": "1.0.0", + "installed_version": "1.0.0", + } + ], + {}, + ) + assert behavior["summary"]["status"] == "pass" + + diffs = diff_behavior_results( + [ + _probe("claude-agent-sdk", "0.2.96", "current-environment", "pass", {"fields": ["a"]}), + _probe("claude-agent-sdk", "0.2.106", "isolated-venv", "fail", {"fields": []}), + ] + ) + + assert diffs[0].severity == "breaking" + + +def test_behavior_diffs_ignore_optional_field_churn_when_contract_holds() -> None: + required = ["api_key", "mcp_servers", "model"] + diffs = diff_behavior_results( + [ + _probe( + "google-antigravity", + "0.1.2", + "current-baseline", + "pass", + { + "fields": ["api_key", "gemini_config", "mcp_servers", "model"], + "required_fields": required, + "missing": [], + }, + ), + _probe( + "google-antigravity", + "0.1.4", + "candidate", + "pass", + { + "fields": ["api_key", "mcp_servers", "model", "models"], + "required_fields": required, + "missing": [], + }, + ), + ] + ) + + assert diffs[0].severity == "none" + assert diffs[0].summary == "No behavior contract difference detected." + + +def test_behavior_evidence_uses_locked_baseline_when_environment_drifted( + monkeypatch: pytest.MonkeyPatch, +) -> None: + calls: list[tuple[str, str, str]] = [] + + def isolated(package: str, version: str, *, scope: str = "candidate"): + calls.append((package, version, scope)) + return (_probe(package, version, scope, "pass", {"scope": scope}),) + + monkeypatch.setattr( + "examples.sdk_evolution_agent.behavior.probe_candidate_in_venv", + isolated, + ) + + behavior = collect_behavior_evidence( + [ + { + "name": "claude-agent-sdk", + "locked_version": "0.2.96", + "installed_version": "0.2.106", + } + ], + {"claude-agent-sdk": "0.2.106"}, + ) + + assert calls == [ + ("claude-agent-sdk", "0.2.96", "current-baseline"), + ("claude-agent-sdk", "0.2.106", "candidate"), + ] + assert behavior["diffs"][0].severity == "changed" + + +def test_behavior_probe_guard_blocks_breaking_candidate_diff() -> None: + guarded = with_behavior_probe_guard( + { + "findings": [], + "safe_to_implement": True, + "manual_design_required": False, + "uncertainty": [], + }, + { + "diffs": [ + { + "package": "google-antigravity", + "probe": "adapter-contract", + "severity": "breaking", + "summary": "Candidate probe changed from pass to fail.", + } + ] + }, + ) + + assert guarded["safe_to_implement"] is False + assert guarded["manual_design_required"] is True + + +def test_current_state_artifact_paths_are_repo_relative(tmp_path: Path) -> None: + (tmp_path / "uv.lock").write_text( + """ +[[package]] +name = "claude-agent-sdk" +version = "0.2.106" +""", + encoding="utf-8", + ) + report_root = tmp_path / "reports" / "sdk-evolution" / "run-1" + report_root.mkdir(parents=True) + (report_root / "evidence.json").write_text("{}", encoding="utf-8") + snapshots = report_root / "api_snapshots" + snapshots.mkdir() + (snapshots / "01-claude-agent-sdk.json").write_text("{}", encoding="utf-8") + context = RunContext( + run_id="run-1", + workspace=tmp_path, + report_root=report_root, + runtime="fake", + event_log_path=report_root / "events.jsonl", + implementation_enabled=True, + draft_pr=False, + ) + + state = build_current_state( + context, + promoted=True, + status="promoted", + implementation={"applied": True}, + ) + + paths = [artifact["path"] for artifact in state["artifacts"].values()] + assert "reports/sdk-evolution/run-1/evidence.json" in paths + assert "reports/sdk-evolution/run-1/api_snapshots/01-claude-agent-sdk.json" in paths + assert all(not path.startswith("/") for path in paths) + assert all("/private/tmp" not in path and "/tmp/" not in path for path in paths) + + def test_snapshot_and_diff_public_api(monkeypatch: pytest.MonkeyPatch) -> None: module = types.ModuleType("fake_sdk") @@ -154,6 +410,224 @@ def run_new(value: str, *, verbose: bool = False) -> str: assert diff.changed == ("run",) +def test_parse_args_inspects_candidates_by_default() -> None: + options = parse_args(["--runtime", "fake"]) + + assert options.inspect_candidates is True + + +def test_collect_snapshots_uses_lockfile_baseline_for_candidates( + monkeypatch: pytest.MonkeyPatch, +) -> None: + calls: list[tuple[str, str | None]] = [] + + def current_snapshot(package: str, *, version: str | None = None) -> ApiSnapshot: + calls.append(("current", version)) + return ApiSnapshot(package=package, version=version, module="google.antigravity") + + def candidate_snapshot(package: str, version: str) -> ApiSnapshot: + calls.append(("candidate", version)) + return ApiSnapshot( + package=package, + version=version, + module="google.antigravity", + source="isolated-venv", + ) + + monkeypatch.setattr( + "examples.sdk_evolution_agent.cli.snapshot_current_api", + current_snapshot, + ) + monkeypatch.setattr( + "examples.sdk_evolution_agent.cli.snapshot_candidate_in_venv", + candidate_snapshot, + ) + + snapshots = _collect_snapshots( + { + "packages": [ + { + "name": "google-antigravity", + "locked_version": "0.1.2", + "installed_version": "0.1.4", + "latest_version": "0.1.4", + } + ] + }, + inspect_candidates=False, + ) + + assert len(snapshots) == 2 + assert calls == [("candidate", "0.1.2"), ("candidate", "0.1.4")] + + +def test_collect_snapshots_uses_refresh_preview_update_targets( + monkeypatch: pytest.MonkeyPatch, +) -> None: + calls: list[tuple[str, str, str | None]] = [] + + def current_snapshot(package: str, *, version: str | None = None) -> ApiSnapshot: + calls.append(("current", package, version)) + return ApiSnapshot(package=package, version=version, module=package.replace("-", "_")) + + def candidate_snapshot(package: str, version: str) -> ApiSnapshot: + calls.append(("candidate", package, version)) + return ApiSnapshot( + package=package, + version=version, + module=package.replace("-", "_"), + source="isolated-venv", + ) + + monkeypatch.setattr( + "examples.sdk_evolution_agent.cli.snapshot_current_api", + current_snapshot, + ) + monkeypatch.setattr( + "examples.sdk_evolution_agent.cli.snapshot_candidate_in_venv", + candidate_snapshot, + ) + + snapshots = _collect_snapshots( + { + "packages": [ + { + "name": "claude-agent-sdk", + "locked_version": "0.2.96", + "installed_version": "0.2.96", + "latest_version": "0.2.106", + }, + { + "name": "openai-codex-cli-bin", + "locked_version": "0.137.0a4", + "installed_version": "0.137.0a4", + "latest_version": "0.136.0", + }, + ], + "refresh_preview": { + "stdout": "", + "stderr": "Update claude-agent-sdk v0.2.96 -> v0.2.106\n", + }, + }, + ) + + assert len(snapshots) == 3 + assert calls == [ + ("current", "claude-agent-sdk", "0.2.96"), + ("candidate", "claude-agent-sdk", "0.2.106"), + ("current", "openai-codex-cli-bin", "0.137.0a4"), + ] + + +def test_collect_snapshots_uses_locked_baseline_when_environment_drifted( + monkeypatch: pytest.MonkeyPatch, +) -> None: + calls: list[tuple[str, str, str | None]] = [] + + def current_snapshot(package: str, *, version: str | None = None) -> ApiSnapshot: + calls.append(("current", package, version)) + return ApiSnapshot(package=package, version=version, module=package.replace("-", "_")) + + def isolated_snapshot(package: str, version: str) -> ApiSnapshot: + calls.append(("isolated", package, version)) + return ApiSnapshot(package=package, version=version, module=package.replace("-", "_")) + + monkeypatch.setattr( + "examples.sdk_evolution_agent.cli.snapshot_current_api", + current_snapshot, + ) + monkeypatch.setattr( + "examples.sdk_evolution_agent.cli.snapshot_candidate_in_venv", + isolated_snapshot, + ) + + _collect_snapshots( + { + "packages": [ + { + "name": "claude-agent-sdk", + "locked_version": "0.2.96", + "installed_version": "0.2.106", + "latest_version": "0.2.106", + }, + ], + "refresh_preview": { + "stdout": "", + "stderr": "Update claude-agent-sdk v0.2.96 -> v0.2.106\n", + }, + }, + ) + + assert calls == [ + ("isolated", "claude-agent-sdk", "0.2.96"), + ("isolated", "claude-agent-sdk", "0.2.106"), + ] + + +def test_candidate_api_diff_guard_blocks_missing_update_diff() -> None: + guarded = with_candidate_api_diff_guard( + { + "findings": [], + "safe_to_implement": True, + "manual_design_required": False, + "uncertainty": [], + "self_adaptation_plan": [], + }, + { + "refresh_preview": { + "stdout": "", + "stderr": "Update google-antigravity v0.1.2 -> v0.1.4\n", + } + }, + [], + ) + + assert guarded["safe_to_implement"] is False + assert guarded["manual_design_required"] is True + assert "missing api_diffs for google-antigravity" in guarded["findings"][-1]["evidence"] + + +def test_candidate_api_diff_guard_accepts_empty_update_diff() -> None: + guarded = with_candidate_api_diff_guard( + { + "findings": [], + "safe_to_implement": True, + "manual_design_required": False, + }, + { + "refresh_preview": { + "stdout": "", + "stderr": "Update google-antigravity v0.1.2 -> v0.1.4\n", + } + }, + [ + { + "package": "google-antigravity", + "from_version": "0.1.2", + "to_version": "0.1.4", + "added": [], + "removed": [], + "changed": [], + } + ], + ) + + assert guarded["safe_to_implement"] is True + assert guarded["manual_design_required"] is False + + +def test_manual_design_gate_forces_safe_to_implement_false() -> None: + architecture = with_manual_design_gate( + { + "findings": [], + "safe_to_implement": True, + "manual_design_required": True, + } + ) + + assert architecture["safe_to_implement"] is False + + def test_schema_validation_rejects_missing_required_field() -> None: with pytest.raises(SchemaValidationError): validate_mapping({"packages": [], "themes": []}, DIRECTION_ANALYSIS_SCHEMA, name="stage") @@ -186,6 +660,34 @@ async def test_stage_execution_uses_agent_task_runtime_primitives(tmp_path: Path assert runtime.task.working_directory == tmp_path assert runtime.task.permissions.filesystem is FilesystemAccess.READ_ONLY assert runtime.task.metadata["stage"] == "direction-analysis" + assert "model" not in runtime.task.metadata + assert "reasoning_effort" not in runtime.task.metadata + + +@pytest.mark.asyncio +async def test_codex_stage_execution_uses_gpt_55_xhigh_thinking(tmp_path: Path) -> None: + runtime = RecordingRuntime(kind=AgentRuntimeKind.CODEX_AGENT_SDK) + context = RunContext( + run_id="run-1", + workspace=tmp_path, + report_root=tmp_path / "reports", + runtime="codex-agent-sdk", + event_log_path=tmp_path / "events.jsonl", + implementation_enabled=False, + draft_pr=False, + ) + + await run_stage( + runtime, + stage="direction-analysis", + payload={"evidence": {}, "api_diffs": []}, + schema=DIRECTION_ANALYSIS_SCHEMA, + context=context, + ) + + assert runtime.task is not None + assert runtime.task.metadata["model"] == SDK_EVOLUTION_CODEX_MODEL + assert runtime.task.metadata["reasoning_effort"] == SDK_EVOLUTION_CODEX_REASONING_EFFORT @pytest.mark.asyncio @@ -289,8 +791,25 @@ def test_reviewer_rejection_blocks_implementation() -> None: assert "reviewer" in gate.reason +def test_reviewer_approved_status_allows_implementation() -> None: + gate = evaluate_implementation_gate( + { + "safe_to_implement": True, + "manual_design_required": False, + "recursive_self_adaptation_impact": False, + }, + {"status": "approved"}, + implementation_enabled=True, + ) + + assert gate.allowed is True + + @pytest.mark.asyncio -async def test_run_agent_report_only_generates_artifacts(tmp_path: Path) -> None: +async def test_run_agent_report_only_generates_artifacts( + tmp_path: Path, + monkeypatch: pytest.MonkeyPatch, +) -> None: (tmp_path / "pyproject.toml").write_text( """ [project.optional-dependencies] @@ -299,6 +818,23 @@ async def test_run_agent_report_only_generates_artifacts(tmp_path: Path) -> None encoding="utf-8", ) (tmp_path / "uv.lock").write_text("", encoding="utf-8") + monkeypatch.setattr( + "examples.sdk_evolution_agent.cli.snapshot_current_api", + lambda package, *, version=None: ApiSnapshot( + package=package, + version=version, + module=package.replace("-", "_"), + ), + ) + monkeypatch.setattr( + "examples.sdk_evolution_agent.cli.snapshot_candidate_in_venv", + lambda package, version: ApiSnapshot( + package=package, + version=version, + module=package.replace("-", "_"), + source="isolated-venv", + ), + ) report_path = await run_agent( RunOptions( @@ -307,6 +843,7 @@ async def test_run_agent_report_only_generates_artifacts(tmp_path: Path) -> None packages=("claude-agent-sdk",), report_dir=Path("reports"), implementation_enabled=False, + inspect_candidates=False, ), pypi_client=_fake_pypi, runtime=FixtureEvolutionRuntime(), @@ -314,15 +851,119 @@ async def test_run_agent_report_only_generates_artifacts(tmp_path: Path) -> None assert report_path.exists() assert (report_path.parent / "evidence.json").exists() + assert (report_path.parent / "release_notes.json").exists() assert (report_path.parent / "api_diffs.json").exists() + assert (report_path.parent / "behavior_probes.json").exists() + assert (report_path.parent / "behavior_diffs.json").exists() + assert (report_path.parent / "current_state.json").exists() assert (report_path.parent / "direction_analysis.json").exists() assert (report_path.parent / "architecture_decision.json").exists() assert (report_path.parent / "implementation_summary.json").exists() assert (report_path.parent / "review.json").exists() assert (report_path.parent / "events.jsonl").exists() + assert '"package": "claude-agent-sdk"' in (report_path.parent / "api_diffs.json").read_text( + encoding="utf-8" + ) assert "Recursive self-adaptation impact" in report_path.read_text(encoding="utf-8") +@pytest.mark.asyncio +async def test_run_agent_autonomous_pr_path( + tmp_path: Path, + monkeypatch: pytest.MonkeyPatch, +) -> None: + (tmp_path / "pyproject.toml").write_text( + """ +[project.optional-dependencies] +claude = ["claude-agent-sdk>=0.2"] +""", + encoding="utf-8", + ) + (tmp_path / "uv.lock").write_text( + """ +[[package]] +name = "claude-agent-sdk" +version = "0.2.1" +""", + encoding="utf-8", + ) + monkeypatch.setattr( + "examples.sdk_evolution_agent.cli.snapshot_current_api", + lambda package, *, version=None: ApiSnapshot( + package=package, + version=version, + module=package.replace("-", "_"), + ), + ) + monkeypatch.setattr( + "examples.sdk_evolution_agent.cli.snapshot_candidate_in_venv", + lambda package, version: ApiSnapshot( + package=package, + version=version, + module=package.replace("-", "_"), + source="isolated-venv", + ), + ) + monkeypatch.setattr( + "examples.sdk_evolution_agent.cli.collect_release_notes", + lambda packages, updates: [], + ) + monkeypatch.setattr( + "examples.sdk_evolution_agent.cli.collect_behavior_evidence", + lambda packages, updates: {"results": [], "diffs": [], "summary": {"status": "pass"}}, + ) + commands: list[tuple[str, ...]] = [] + + def runner( + command: tuple[str, ...], + *, + cwd: Path | None = None, + env: dict[str, str] | None = None, + ) -> CommandResult: + del cwd, env + commands.append(command) + if command[:3] == ("uv", "lock", "--dry-run"): + return CommandResult( + command=command, + returncode=0, + stderr="Update claude-agent-sdk v0.2.1 -> v0.3.0\n", + ) + if command[:2] == ("uv", "lock"): + return CommandResult(command=command, returncode=0, stdout="updated") + return CommandResult(command=command, returncode=0, stdout="ok") + + report_path = await run_agent( + RunOptions( + workspace=tmp_path, + runtime="fake", + packages=("claude-agent-sdk",), + report_dir=Path("reports"), + implementation_enabled=True, + refresh_preview=True, + create_branch=True, + branch_name="sdk-update-test", + draft_pr=True, + pr_base="main", + ), + pypi_client=_fake_pypi, + command_runner=runner, + runtime=PermissiveRuntime(), + ) + + assert report_path.exists() + assert ("git", "switch", "-c", "sdk-update-test") in commands + assert ("uv", "lock", "-P", "claude-agent-sdk") in commands + assert any(command[:3] == ("git", "commit", "-m") for command in commands) + assert any(command[:4] == ("gh", "pr", "create", "--draft") for command in commands) + assert ("git", "commit", "-m", "Finalize SDK evolution report") in commands + assert commands.count(("git", "push", "-u", "origin", "sdk-update-test")) == 2 + pr_index = next( + i for i, command in enumerate(commands) if command[:3] == ("gh", "pr", "create") + ) + finalize_index = commands.index(("git", "commit", "-m", "Finalize SDK evolution report")) + assert finalize_index > pr_index + + def test_parse_args_and_pr_body() -> None: options = parse_args( [ @@ -338,6 +979,7 @@ def test_parse_args_and_pr_body() -> None: assert options.runtime == "claude-agent-sdk" assert options.packages == ("claude-agent-sdk",) + assert options.inspect_candidates is True assert options.implementation_enabled is True assert options.draft_pr is True assert "No auto-merge" in body @@ -347,6 +989,7 @@ def test_build_registry_injects_isolated_codex_home() -> None: runtime = build_registry().resolve(AgentRuntimeKind.CODEX_AGENT_SDK) assert isinstance(runtime, CodexAgentRuntime) + assert runtime._default_model == SDK_EVOLUTION_CODEX_MODEL assert runtime._env is not None assert runtime._env["CODEX_HOME"] == str(SDK_EVOLUTION_CODEX_HOME) @@ -378,6 +1021,54 @@ async def cancel(self, task_id: str) -> None: del task_id +class PermissiveRuntime(RecordingRuntime): + async def run(self, task: AgentTask) -> AgentResult: + self.task = task + stage = task.metadata["stage"] + if stage == "direction-analysis": + payload = {"packages": [], "themes": [], "uncertainty": []} + elif stage == "architecture-decision": + payload = { + "findings": [], + "safe_to_implement": True, + "manual_design_required": False, + "recursive_self_adaptation_impact": False, + "self_adaptation_plan": ["Update SDK lockfile."], + "verification_commands": [], + "uncertainty": [], + } + elif stage == "review": + payload = {"status": "pass", "reasons": [], "required_changes": []} + else: + payload = { + "applied": False, + "changes": [], + "verification_results": [], + "blocked_reason": "", + } + return AgentResult(output="{}", parsed_output=payload) + + +def _probe( + package: str, + version: str, + scope: str, + status: str, + details: dict[str, Any], +): + from examples.sdk_evolution_agent.models import BehaviorProbeResult + + return BehaviorProbeResult( + package=package, + version=version, + scope=scope, + probe="adapter-contract", + status=status, + summary=status, + details=details, + ) + + def _fake_pypi(package: str) -> dict[str, Any]: assert package == "claude-agent-sdk" return {