diff --git a/docs/sdk-evolution-agent-design.md b/docs/sdk-evolution-agent-design.md
new file mode 100644
index 0000000..2b1bfb6
--- /dev/null
+++ b/docs/sdk-evolution-agent-design.md
@@ -0,0 +1,622 @@
+# SDK Evolution Agent Design
+
+This document describes how the SDK evolution example should work before adding
+more implementation. It is intentionally more detailed than the user-facing run
+guide in `docs/sdk-evolution-agent.md`.
+
+The core idea is that a dependency update is not enough evidence. The agent
+must combine resolver facts, release notes, API shape, adapter behavior probes,
+and real-runtime review before it recommends a lockfile change, adapter change,
+or manual design stop.
+
+## Goals
+
+The SDK evolution agent should answer these questions for every run:
+
+- What package versions are installed, locked, and available upstream?
+- Which packages does the resolver actually want to update?
+- What changed in public API shape?
+- What changed in documented behavior or product direction?
+- Which adapter behavior contracts still pass on the candidate versions?
+- Does the current `agent-runtime-kit` abstraction still preserve vendor
+  behavior?
+- Is the safe next action a lock update, adapter update, docs/test update,
+  provider-specific extension, public API evolution, or manual design review?
+
+The agent must dogfood `agent-runtime-kit`: all AI reasoning stages run through
+`AgentTask`, `RuntimeRegistry`, runtime adapters, output schemas, event sinks,
+permission profiles, and `AgentResult`. Local shell, filesystem, package
+manager, Git, and GitHub operations are allowed only for deterministic evidence
+collection and mechanical changes.
+
+## Non-Goals
+
+The example should not become a generic dependency update bot. A generic bot
+can answer "can the lockfile move?" This agent must answer "does the runtime
+adapter contract still hold, and does the public SDK architecture still make
+sense?"
+
+It should not hide vendor differences. If Claude adds task status events, Codex
+changes sandbox semantics, or Antigravity changes model endpoint configuration,
+the right output is explicit provider-specific evidence and possibly a
+provider-specific extension, not a flattened common denominator.
+
+It should not require all vendor SDKs for normal package users. The example can
+use `agent-runtime-kit[all]` for local research, but the package itself must keep
+optional extras.
+
+## High-Level Flow
+
+```mermaid
+flowchart TD
+    A["Start local command"] --> B["Collect deterministic evidence"]
+    B --> C["Resolve update candidates"]
+    C --> D["Inspect current and candidate APIs"]
+    D --> E["Collect changelog and release-note evidence"]
+    E --> F["Run adapter behavior probes"]
+    F --> G["Build evidence bundle"]
+    G --> H["Direction analysis through agent-runtime-kit"]
+    H --> I["Architecture decision and update plan through agent-runtime-kit"]
+    I --> J["Independent review through agent-runtime-kit"]
+    J --> K{"Gates pass?"}
+    K -- "No" --> L["Write report with manual review checklist"]
+    K -- "Yes" --> M["Apply safe implementation"]
+    M --> N["Run verification"]
+    N --> O["Promote updated state to current baseline"]
+    O --> P["Write report and optional draft PR"]
+```
+
+Step responsibilities:
+
+- **Start local command**: Parse the selected runtime, package filters, report
+  directory, refresh options, implementation flag, branch option, and draft PR
+  option. This step also establishes the run ID and local report directory.
+- **Collect deterministic evidence**: Read local project state without using AI:
+  `pyproject.toml`, `uv.lock`, installed distributions, package metadata,
+  configured source hints, local environment facts, and supported auth
+  availability. This produces raw facts, not recommendations.
+- **Resolve update candidates**: Run the targeted resolver preview with
+  freshness cutoffs removed. This step decides which packages are real update
+  candidates for the run. It should use resolver output rather than only PyPI
+  `latest` metadata, especially for prerelease packages.
+- **Inspect current and candidate APIs**: Load API snapshot and diff artifacts
+  from the last update run, then focus new inspection on packages that the
+  resolver selected for update or packages whose evidence is missing, stale, or
+  incompatible with the current evidence schema. This step owns API snapshot and
+  API diff artifacts. If the evidence signature changes, the agent may need to
+  refresh the current-state snapshot or gather more current-state data before
+  comparing candidates. If an update candidate has no candidate API diff, the
+  run should not proceed to implementation.
+- **Collect changelog and release-note evidence**: Fetch or read official
+  changelogs, release pages, docs changelogs, repository releases, and package
+  metadata links. This step records what changed according to the vendor and
+  explicitly marks missing or incomplete release-note coverage.
+- **Run adapter behavior probes**: Execute deterministic unit probes, installed
+  SDK contract probes, and optional live probes. This step answers whether the
+  adapter behavior still holds, including permissions, sandbox/workspace
+  handling, streaming, structured output, MCP/tool support, auth discovery, and
+  session/resume behavior.
+- **Build evidence bundle**: Normalize package facts, resolver facts, API
+  snapshots, API diffs, release-note evidence, behavior probe results, source
+  references, and uncertainty into a compact bundle for the AI stages. This step
+  should preserve provenance so later reasoning can be traced back to evidence.
+- **Direction analysis through agent-runtime-kit**: Ask a runtime, via
+  `AgentTask`, to infer direction-of-travel themes from the evidence. This step
+  identifies whether changes look isolated or part of a broader SDK direction,
+  but it does not own the concrete implementation plan.
+- **Architecture decision and update plan through agent-runtime-kit**: Ask a
+  runtime, via `AgentTask`, to turn direction analysis into the concrete plan:
+  adapter-only, test-only, docs-only, capability metadata change,
+  provider-specific extension, public API evolution, compatibility shim,
+  deprecation/migration, architectural rework, or `manual_design_required`.
+  This is the step responsible for saying what should be updated.
+- **Independent review through agent-runtime-kit**: Run a separate reviewer task
+  through the runtime. The reviewer challenges evidence sufficiency, direction
+  inference, plan scope, vendor-specific capability preservation, and whether
+  tests, docs, and migration notes match the proposed change.
+- **Gates pass?**: Apply deterministic pass/fail rules. The gates block
+  implementation when required API diffs are missing, release-note coverage is
+  missing, behavior probes fail or are skipped for required contracts, the
+  reviewer rejects the plan, recursive self-adaptation is unresolved, or manual
+  design is required.
+- **Write report with manual review checklist**: If gates fail, write the local
+  report with the evidence bundle, analysis, decision, reviewer output,
+  uncertainty, blocked reasons, and the exact manual review questions. This is a
+  valid end state, not a failed run.
+- **Apply safe implementation**: Apply only the changes allowed by the accepted
+  architecture decision and deterministic gates. This may include lockfile
+  updates, adapter changes, tests, docs, examples, compatibility shims, or report
+  changes. It must not implement changes that were classified as
+  `manual_design_required`.
+- **Run verification**: Run the verification commands required by the
+  architecture decision. At minimum, this should cover formatting/linting,
+  typing, unit tests, lock checks, report generation checks, and any available
+  live smoke needed for the affected runtime behavior.
+- **Promote updated state to current baseline**: After implementation and
+  verification pass, save the updated lock/package/API/release-note/probe state
+  as the new current-state baseline for the next run. This promotion should be
+  explicit, atomic, and tied to the verified commit or workspace state. Failed,
+  blocked, or manual-design-required runs must not replace the current baseline.
+- **Write report and optional draft PR**: Write the final local report with
+  evidence, decisions, implementation summary, baseline-promotion result, test
+  results, uncertainty, and manual checklist. If explicitly configured and
+  authenticated, create or update a draft PR. This step must never auto-merge.
+
+Every box before direction analysis is deterministic. AI stages may interpret
+evidence, but they should not invent evidence that was not collected.
+
+## Operating Modes
+
+The default command should be report-only:
+
+```bash
+python -m examples.sdk_evolution_agent --runtime fake --refresh-preview
+```
+
+This mode collects evidence, writes artifacts, runs the analysis stages through
+the selected runtime, and stops before editing the workspace. The fake runtime is
+allowed only as a deterministic development harness. It proves the pipeline and
+schemas, not the quality of AI reasoning.
+
+A real analysis run should select one configured runtime:
+
+```bash
+python -m examples.sdk_evolution_agent --runtime claude-agent-sdk --refresh-preview
+python -m examples.sdk_evolution_agent --runtime codex-agent-sdk --refresh-preview
+python -m examples.sdk_evolution_agent --runtime antigravity-agent-sdk --refresh-preview
+```
+
+When `codex-agent-sdk` is selected for SDK update work, every AI-backed stage
+should run on `gpt-5.5` with `reasoning_effort=xhigh`. This is a Codex runtime
+policy, not a portable metadata field: Claude and Antigravity runs should not
+receive a `gpt-5.5` model override.
+
+Package filters narrow evidence collection for debugging, but normal evolution
+runs should inspect all tracked packages:
+
+```bash
+python -m examples.sdk_evolution_agent \
+  --runtime antigravity-agent-sdk \
+  --refresh-preview \
+  --package claude-agent-sdk \
+  --package openai-codex \
+  --package openai-codex-cli-bin \
+  --package google-antigravity
+```
+
+`--inspect-candidates` should be effectively always on. The CLI can keep the
+flag for compatibility, but update candidates without candidate API snapshots
+are not actionable.
+
+Implementation mode should remain explicitly gated:
+
+```bash
+python -m examples.sdk_evolution_agent \
+  --runtime antigravity-agent-sdk \
+  --refresh-preview \
+  --implementation-enabled
+```
+
+Even in implementation mode, deterministic gates decide whether edits are
+allowed. Draft PR creation is separate and should only happen when the local Git
+and GitHub environment is authenticated and explicitly configured with
+`--draft-pr`.
+
+## Evidence Layers
+
+The report should clearly separate evidence layers. Mixing them together is how
+bad conclusions slip in.
+
+### 1. Package and Resolver Evidence
+
+The agent checks:
+
+- `pyproject.toml` dependency declarations.
+- `uv.lock` versions.
+- Installed distributions in the local environment.
+- PyPI metadata and recent releases.
+- `uv lock --dry-run -P ...` output with freshness cutoffs removed.
+
+`uv lock --dry-run` is the source of truth for update candidates when it is
+available. PyPI `latest` metadata is useful context, but it can be misleading
+for prerelease packages. For example, a locked prerelease can be newer than the
+stable value reported by package metadata.
+
+### 2. API Shape Evidence
+
+The agent should treat the lockfile as the current SDK baseline. If the active
+Python environment has drifted from `uv.lock`, the agent inspects the locked
+baseline in an isolated virtualenv instead of using the installed package. API
+inspection artifacts are reusable evidence from the last update run when their
+schema, lockfile version, and artifact hashes still match. A normal run starts
+by loading the prior `api_snapshots/` and `api_diffs.json` artifacts, then
+inspects only the packages that need fresh facts:
+
+- packages selected by the resolver for update,
+- packages whose prior artifacts are missing,
+- packages whose prior artifacts were produced by an older evidence schema,
+- packages whose current locked or installed version no longer matches the
+  artifact baseline,
+- packages needed to answer a specific adapter-compatibility question.
+
+For importable packages, snapshots record:
+
+- public member names,
+- member kind,
+- signature where Python introspection can provide one,
+- defining module,
+- import errors.
+
+This catches obvious adapter risks:
+
+- removed classes or functions,
+- changed constructor signatures,
+- changed enum or model surfaces,
+- new provider-specific capabilities worth exposing.
+
+API shape is necessary but insufficient. It does not prove behavior.
+
+After a successful implementation, the candidate API snapshots and diffs that
+were verified must be promoted to the current-state baseline. That ensures the
+next run compares new upstream candidates against the SDK state that was
+actually accepted, not against stale pre-update artifacts.
+
+If the evidence schema changes, promotion should include a schema refresh of the
+current package state even when the package version did not change. Otherwise
+future runs may compare candidate evidence against artifacts that no longer mean
+the same thing.
+
+### 3. Changelog and Release-Note Evidence
+
+The agent should collect release-note context when a vendor publishes it.
+
+| Package | Preferred source | Why it matters |
+| --- | --- | --- |
+| `claude-agent-sdk` | Python SDK `CHANGELOG.md` and Claude Agent SDK docs | Claude often ships behavioral changes around task progress, sessions, tools, permissions, and model support. |
+| `openai-codex` | Codex SDK docs, Codex changelog, and `openai/codex` releases | Codex changes can involve sandboxing, working directories, remote execution, app-server behavior, and SDK maturity. |
+| `openai-codex-cli-bin` | `openai/codex` releases and package metadata | The binary package is runtime infrastructure, so behavior can change even when the Python SDK surface does not. |
+| `google-antigravity` | Antigravity changelog, repository, package metadata, examples, and public API snapshots | Antigravity release context may be product-level instead of package-version-specific, so the agent must preserve source coverage and uncertainty separately. |
+
+The report should preserve source references and a short excerpt or summary. If
+release notes are unavailable, that absence is evidence and should increase
+uncertainty.
+
+Primary sources should be recorded with URLs in `release_notes.json`:
+
+- `claude-agent-sdk`: `https://github.com/anthropics/claude-agent-sdk-python/blob/main/CHANGELOG.md`
+- Claude Agent SDK docs: `https://code.claude.com/docs/en/agent-sdk/overview`
+- Codex SDK docs: `https://developers.openai.com/codex/sdk`
+- Codex changelog: `https://developers.openai.com/codex/changelog`
+- Codex repository releases: `https://github.com/openai/codex/releases`
+- Antigravity changelog: `https://antigravity.google/changelog`
+- Antigravity repository: `https://github.com/google-antigravity/antigravity-sdk-python`
+
+If a package has no release-note source for the exact version interval, the
+agent should still record what it checked and why the source was insufficient.
+Fetched official sources with no package-version-specific entry are evidence
+with explicit uncertainty; they are not the same as a collection failure.
+
+### 4. Behavior Probe Evidence
+
+Behavior probes test what signatures cannot show. They should be deterministic
+where possible and optional-live where credentials are required.
+
+```mermaid
+flowchart LR
+    A["Candidate versions installed"] --> B["Contract tests"]
+    A --> C["Adapter unit probes"]
+    A --> D["Optional live smoke"]
+    B --> E["behavior_probes.json"]
+    C --> E
+    D --> E
+    E --> F["Architecture decision gates"]
+```
+
+Behavior probes should cover these contracts:
+
+| Contract | Why API diffs are not enough | Example probe |
+| --- | --- | --- |
+| Request construction | Constructor signatures can stay stable while fields change meaning. | Assert adapter builds expected SDK options/config objects. |
+| Permission mapping | Permission mode names can stay present while policy behavior changes. | Strict/default/permissive tests for each adapter. |
+| Sandbox and workspace semantics | Behavior can shift across SDK or CLI layers without a Python signature change. | Codex sandbox enum and run argument contract tests, plus smoke where possible. |
+| Streaming and event order | New message types may not break imports but may be dropped. | Feed fake vendor messages and assert emitted event order. |
+| Structured output | Schema fields can exist but runtime may return prose or tool calls. | Live or fake structured-output task with schema validation. |
+| Session/resume | Resume options can exist but behavior may change. | Fake SDK request shape plus optional live resume smoke. |
+| MCP/tool support | MCP config may move from one module to another without a simple signature break. | Adapter MCP config tests and unsupported-feature assertions. |
+| Auth discovery | Supported auth sources differ by vendor and may change independently. | Availability probes that report source without scraping credentials. |
+
+Behavior probe output should be a first-class report artifact, for example:
+
+```text
+behavior_probes.json
+behavior_diffs.json
+```
+
+Each probe result should include:
+
+- probe name,
+- relevant package or adapter,
+- command or test function,
+- pass/fail/skip status,
+- stdout/stderr summary,
+- skipped reason when optional credentials are missing.
+
+`behavior_diffs.json` compares current-environment probes against
+candidate-version probes for resolver-selected updates. Breaking candidate probe
+changes block implementation deterministically before any local lock update.
+
+`behavior_probes.json` may include observed SDK fields or parameters that are
+not part of the adapter contract. `behavior_diffs.json` compares the required
+adapter contract, not every optional field. Public API and signature churn
+remains visible in `api_diffs.json` and probe details, but it should only block
+implementation when the required behavior contract fails or becomes ambiguous.
+
+### 5. Runtime-Generated Analysis
+
+After deterministic evidence is collected, the AI stages can interpret it:
+
+```mermaid
+sequenceDiagram
+    participant CLI as Local CLI
+    participant Registry as RuntimeRegistry
+    participant Runtime as Selected runtime adapter
+    participant Model as Vendor agent runtime
+
+    CLI->>Registry: resolve(runtime kind)
+    Registry->>Runtime: create adapter
+    CLI->>Runtime: AgentTask(direction-analysis)
+    Runtime->>Model: supported SDK call
+    Model-->>Runtime: structured AgentResult
+    Runtime-->>CLI: validated JSON
+    CLI->>Runtime: AgentTask(architecture-decision)
+    Runtime-->>CLI: validated JSON
+    CLI->>Runtime: AgentTask(review)
+    Runtime-->>CLI: validated JSON
+```
+
+The AI stages should receive compacted, source-referenced evidence. They should
+not be asked to inspect the filesystem directly during report-only analysis.
+
+## Decision Gates
+
+The agent should fail closed. Implementation is blocked when:
+
+- the resolver reports an update but candidate API diffs are missing,
+- release notes exist but were not collected,
+- release notes are unavailable and the API or behavior evidence is ambiguous,
+- behavior probes fail,
+- behavior probes are skipped for a contract that is required for the proposed
+  implementation,
+- the reviewer rejects the evidence or architecture decision,
+- `manual_design_required` is true,
+- recursive self-adaptation is required but no migration plan exists.
+
+An empty API diff can be valid. A missing API diff for an update candidate is
+not valid.
+
+## Recursive Self-Adaptation
+
+The SDK evolution agent uses `agent-runtime-kit` to update `agent-runtime-kit`.
+That makes runtime-layer changes recursive.
+
+```mermaid
+flowchart TD
+    A["Upstream SDK change"] --> B["agent-runtime-kit adapter/public API change"]
+    B --> C{"Does SDK evolution agent use the changed contract?"}
+    C -- "No" --> D["Normal adapter/public API change"]
+    C -- "Yes" --> E["Self-adaptation required"]
+    E --> F["Update example runtime usage"]
+    E --> G["Update schemas and prompts"]
+    E --> H["Update behavior probes"]
+    E --> I["Run reviewer through updated runtime"]
+```
+
+If a change affects `AgentTask`, `AgentResult`, `RuntimeRegistry`, runtime
+adapters, output schemas, event sinks, permission profiles, or typed unsupported
+feature errors, the report must call this out explicitly.
+
+## Changelog Source Strategy
+
+The agent should prefer official and primary sources:
+
+- package repository changelog files,
+- official release pages,
+- official docs changelog pages,
+- package metadata links,
+- repository releases.
+
+It should not scrape private credentials or authenticated browser sessions to
+obtain changelogs. If a source requires authentication, the report should mark
+that source unavailable and explain the limitation.
+
+For `claude-agent-sdk`, the Python changelog should be checked first. Claude
+Code and Agent SDK docs are useful supplemental direction-of-travel sources.
+
+For `openai-codex`, the Codex SDK docs and Codex changelog should be checked.
+The `openai/codex` release page is also relevant because the Python SDK depends
+on a bundled or pinned runtime.
+
+For `google-antigravity`, if the official changelog or repository does not have
+a package-version-specific entry, the agent should not pretend the source is
+complete. It should compensate with package metadata, examples, API snapshots,
+adapter contract tests, and live smoke where credentials are available.
+
+## Behavior Probe Strategy
+
+Behavior probes should be split into three tiers.
+
+### Tier 1: Always-On Unit Probes
+
+These use fake SDK objects and do not require credentials. They should run in
+normal CI.
+
+Examples:
+
+- Claude request shape and stream translation tests.
+- Codex approval mode, sandbox, thread item, and tool audit tests.
+- Antigravity permission/tool/MCP config tests.
+- unsupported-feature errors for non-portable options.
+
+### Tier 2: Installed SDK Contract Probes
+
+These introspect real installed SDK packages but do not call models.
+
+Examples:
+
+- `ClaudeAgentOptions` still accepts fields the adapter builds.
+- `openai_codex.AsyncThread.run` still exposes expected parameters.
+- `google.antigravity.LocalAgentConfig` still exposes expected config fields.
+
+These are stronger than raw public snapshots because they encode adapter
+assumptions.
+
+### Tier 3: Optional Live Probes
+
+These use local supported credentials and must never scrape credentials.
+
+Examples:
+
+- Claude one-turn smoke if Claude auth is configured.
+- Codex one-turn smoke using provider-owned local auth.
+- Antigravity structured-output smoke using API key or Google Application
+  Default Credentials.
+
+Live probes should be reported as pass/fail/skip. A skipped live probe should
+not automatically block a docs-only or test-only change, but it should increase
+uncertainty for runtime behavior changes.
+
+## Report Shape
+
+The report directory should include:
+
+```text
+config.json
+evidence.json
+release_notes.json
+api_snapshots/
+api_diffs.json
+behavior_probes.json
+behavior_diffs.json
+current_state.json
+direction_analysis.json
+architecture_decision.json
+implementation_summary.json
+review.json
+events.jsonl
+report.md
+```
+
+`report.md` should summarize:
+
+- package and resolver status,
+- release-note coverage,
+- API diff count and affected packages,
+- behavior probe status,
+- current-state baseline promotion status,
+- direction-of-travel themes,
+- architecture decision,
+- reviewer status,
+- implementation result,
+- uncertainty and manual review checklist.
+
+`current_state.json` should be the manifest that makes the next run
+artifact-aware. It should record:
+
+- evidence schema version,
+- generated timestamp,
+- source run ID,
+- commit SHA or explicit dirty-worktree marker,
+- lockfile hash,
+- package names and accepted current versions,
+- paths or content hashes for current API snapshots,
+- paths or content hashes for release-note evidence,
+- paths or content hashes for behavior probe results,
+- whether the baseline was promoted, refreshed, skipped, or blocked.
+
+Promotion rules should be conservative:
+
+- promote only after implementation and verification pass,
+- do not promote failed, blocked, report-only, or manual-design-required runs as
+  the new current state,
+- preserve the previous baseline so a bad promotion can be inspected,
+- refresh the current-state baseline when the evidence schema changes, even if
+  package versions did not change,
+- make the final report say exactly which artifacts became the new baseline.
+
+## Caveats and Concerns
+
+Changelogs are incomplete. They often omit small behavior changes and may lag
+package releases.
+
+API snapshots are shallow. Python introspection can miss behavior encoded in
+runtime binaries, generated models, callbacks, subprocesses, environment
+variables, or remote services.
+
+Live probes are environment-sensitive. They prove that one local credential and
+runtime setup worked at one time. They do not replace unit or contract probes.
+
+AI review can be overconfident or overcautious. The reviewer should challenge
+evidence quality, but deterministic gates should own pass/fail decisions for
+missing diffs, failed probes, and missing required release-note evidence.
+
+Provider release cadence differs. Claude may expose rich changelogs. Codex may
+split behavior between SDK docs, changelog, GitHub releases, and CLI runtime.
+Antigravity may expose less written release context.
+
+Prerelease handling matters. Resolver output should drive update candidates
+because package metadata `latest` can point to a stable release while the lock
+already contains a newer prerelease.
+
+## Alternatives Considered
+
+### API Diffs Only
+
+Rejected. API diffs catch import and signature drift, but they do not prove
+behavioral compatibility. This is the current weak point.
+
+### Changelogs Only
+
+Rejected. Changelogs are useful direction evidence, but they are not complete
+and cannot prove local adapter behavior.
+
+### Run Full Live Agents For Every Provider Every Time
+
+Rejected as the default. It is too credential-dependent and would make local
+runs brittle. Live probes should be optional and reported clearly.
+
+### Dependabot-Style Lock Updates
+
+Rejected. The goal is architectural evolution, not generic dependency freshness.
+The agent must reason about provider-specific runtime capabilities and adapter
+contracts.
+
+### Lowest-Common-Denominator Runtime Abstraction
+
+Rejected. The package exists to provide a clean Python API while preserving
+vendor-specific capabilities, not to erase them.
+
+### Separate Agents Per Provider Only
+
+Partially useful but not sufficient. Provider-specific probes are valuable, but
+the top-level agent still needs a cross-provider architecture view so public API
+changes do not accidentally favor one runtime and flatten another.
+
+## Implemented Artifact Contract
+
+The example implements the deterministic evidence artifacts described above:
+
+- `release_notes.json` records official source checks and whether matching
+  version evidence was found, missing, or unavailable.
+- `behavior_probes.json` records current and candidate adapter-contract probes.
+- `behavior_diffs.json` records behavior differences between current and
+  candidate probes.
+- `current_state.json` records the run baseline, lockfile hash, accepted
+  package versions, artifact hashes, and promotion status.
+
+The implementation path is gated by deterministic checks before the local
+lockfile update runs. Missing candidate API diffs, unavailable required
+release-note evidence, breaking behavior diffs, reviewer rejection,
+`manual_design_required`, and unresolved recursive self-adaptation all block
+implementation. When implementation is allowed, the example applies the
+resolver-selected SDK lock update locally, runs verification, writes the report
+artifacts, commits them, pushes the branch, and opens a draft PR when configured.
diff --git a/docs/sdk-evolution-agent.md b/docs/sdk-evolution-agent.md
index 263b439..950e00e 100644
--- a/docs/sdk-evolution-agent.md
+++ b/docs/sdk-evolution-agent.md
@@ -4,6 +4,10 @@ The SDK evolution agent is a local dogfood workflow for keeping
 agent-runtime-kit aligned with Claude Agent SDK, OpenAI Codex SDK, and Google
 Antigravity SDK as those upstream packages evolve.
 
+For the intended architecture, evidence contract, behavior probe strategy,
+changelog strategy, caveats, and alternatives, see
+[`docs/sdk-evolution-agent-design.md`](sdk-evolution-agent-design.md).
+
 Run it from the repository:
 
 ```bash
@@ -32,6 +36,13 @@ directory is created with private permissions before the Codex runtime starts;
 authenticate that Codex home through supported Codex login/API-key/access-token
 flows before using it for real Codex-backed runs.
 
+Codex-backed SDK evolution runs explicitly choose `gpt-5.5` with
+`reasoning_effort=xhigh` for the AI stages that analyze direction, decide the
+update plan, implement allowed changes, and review the result. This model policy
+is applied only to `codex-agent-sdk`; Claude and Antigravity runs keep their
+provider-native model selection because `gpt-5.5` is not a valid model override
+for those adapters.
+
 For Antigravity, local auth can use `GEMINI_API_KEY` / `GOOGLE_API_KEY` or
 Google Application Default Credentials. ADC runs use Vertex AI config; provide a
 project through ADC, `GOOGLE_CLOUD_PROJECT`, or `GCLOUD_PROJECT`, and optionally
@@ -46,8 +57,12 @@ Each run writes a timestamped directory under `reports/sdk-evolution/` with:
 
 - `config.json`
 - `evidence.json`
+- `release_notes.json`
 - `api_snapshots/`
 - `api_diffs.json`
+- `behavior_probes.json`
+- `behavior_diffs.json`
+- `current_state.json`
 - `direction_analysis.json`
 - `architecture_decision.json`
 - `implementation_summary.json`
@@ -56,7 +71,8 @@ Each run writes a timestamped directory under `reports/sdk-evolution/` with:
 - `report.md`
 
 The report separates deterministic facts from runtime-generated analysis and
-calls out uncertainty, recursive self-adaptation impact, implementation status,
+calls out uncertainty, release-note coverage, API diffs, behavior diffs,
+baseline promotion, recursive self-adaptation impact, implementation status,
 test results, reviewer output, and manual review items.
 
 ## Upstream Freshness
@@ -76,10 +92,28 @@ cutoff variables must not hide candidate releases.
 
 ## Candidate API Inspection
 
-By default, the command snapshots SDK APIs importable in the current
-environment. Use `--inspect-candidates` to install latest candidate SDK versions
-in temporary isolated virtualenvs for API snapshots and diffs. This avoids
-mutating the project lockfile or working environment.
+The command treats `uv.lock` as the current baseline. If the active `.venv`
+contains a different installed version, the agent inspects the locked baseline
+in a temporary isolated virtualenv instead of trusting the drifted environment.
+When a refresh preview is available, package update candidates come from the
+resolver's `uv lock --dry-run -P ...` output, not only from PyPI's `latest`
+metadata. For each resolver update candidate, the agent installs the target
+version in a temporary isolated virtualenv and writes an API snapshot plus
+`api_diffs.json` entry. This avoids false downgrade diffs for packages whose
+locked prerelease is newer than PyPI's stable latest field. Candidate inspection
+is always enabled for update candidates; `--inspect-candidates` remains accepted
+only for CLI compatibility.
+
+If `uv lock --dry-run -P ...` reports an SDK update but the run cannot produce a
+candidate-version API diff for that package, implementation is blocked and the
+architecture decision is marked `manual_design_required`. An empty added /
+removed / changed diff is valid; a missing diff object is not.
+
+Behavior probes intentionally separate observed SDK surface churn from adapter
+contract breakage. `behavior_probes.json` records fields and parameters seen in
+current and candidate packages, while `behavior_diffs.json` compares the
+required adapter contract. Optional field changes remain visible in the report
+and API diffs, but only breaking adapter-contract diffs block implementation.
 
 ## Implementation Gates
 
@@ -93,6 +127,9 @@ Implementation is still blocked when:
 
 - the architecture decision sets `manual_design_required`,
 - the reviewer rejects the evidence or design,
+- a resolver-selected update lacks a candidate API diff,
+- required release-note evidence could not be collected,
+- candidate behavior probes show a breaking adapter-contract difference,
 - required structured output or permission behavior is unsupported by the
   selected runtime,
 - recursive self-adaptation is required but no safe migration plan exists.
@@ -114,8 +151,13 @@ python -m examples.sdk_evolution_agent \
   --implementation-enabled \
   --create-branch \
   --branch-name sdk-evolution-update \
+  --pr-base main \
   --draft-pr
 ```
 
+When `--draft-pr` is set, the agent stages `uv.lock` and the run report
+directory, commits them with `--commit-message`, pushes the branch, and opens a
+draft PR with `gh`. It never auto-merges.
+
 The command uses local Git and `gh` authentication. It never auto-merges,
 auto-publishes, or scrapes unsupported credentials.
diff --git a/examples/sdk_evolution_agent/behavior.py b/examples/sdk_evolution_agent/behavior.py
new file mode 100644
index 0000000..2847111
--- /dev/null
+++ b/examples/sdk_evolution_agent/behavior.py
@@ -0,0 +1,484 @@
+"""Behavior and adapter-contract probes for SDK evolution runs."""
+
+from __future__ import annotations
+
+import importlib
+import importlib.metadata
+import inspect
+import json
+import subprocess
+import sys
+import tempfile
+import textwrap
+from collections.abc import Mapping, Sequence
+from pathlib import Path
+from typing import Any
+
+from examples.sdk_evolution_agent.models import BehaviorDiff, BehaviorProbeResult
+from examples.sdk_evolution_agent.snapshots import DEFAULT_MODULES
+
+
+def collect_behavior_evidence(
+    packages: Sequence[Mapping[str, object]],
+    update_versions: Mapping[str, str],
+) -> dict[str, Any]:
+    """Collect current/candidate behavior probes and compare them."""
+
+    results: list[BehaviorProbeResult] = []
+    for package in packages:
+        name = str(package.get("name") or "")
+        if not name:
+            continue
+        locked_version = _string_or_none(package.get("locked_version"))
+        installed_version = _string_or_none(package.get("installed_version"))
+        current_version = locked_version or installed_version
+        if locked_version and installed_version and locked_version != installed_version:
+            results.extend(probe_candidate_in_venv(name, locked_version, scope="current-baseline"))
+        else:
+            results.extend(probe_current_package(name, version=current_version))
+        candidate = update_versions.get(name)
+        if candidate:
+            results.extend(probe_candidate_in_venv(name, candidate, scope="candidate"))
+    diffs = diff_behavior_results(results)
+    return {
+        "results": [result for result in results],
+        "diffs": [diff for diff in diffs],
+        "summary": summarize_behavior(diffs),
+    }
+
+
+def probe_current_package(
+    package: str,
+    *,
+    version: str | None = None,
+) -> tuple[BehaviorProbeResult, ...]:
+    """Run behavior probes against the current Python environment."""
+
+    return tuple(_probe_package(package, version=version, scope="current-environment"))
+
+
+def probe_candidate_in_venv(
+    package: str,
+    version: str,
+    *,
+    scope: str = "candidate",
+    python: str = sys.executable,
+    timeout: int = 300,
+) -> tuple[BehaviorProbeResult, ...]:
+    """Run behavior probes against a candidate package in an isolated virtualenv."""
+
+    with tempfile.TemporaryDirectory(prefix="ark-sdk-behavior-") as directory:
+        venv = Path(directory) / ".venv"
+        subprocess.run((python, "-m", "venv", str(venv)), check=True, timeout=timeout)
+        bin_dir = "Scripts" if sys.platform == "win32" else "bin"
+        venv_python = venv / bin_dir / "python"
+        subprocess.run(
+            (str(venv_python), "-m", "pip", "install", f"{package}=={version}"),
+            check=True,
+            text=True,
+            capture_output=True,
+            timeout=timeout,
+        )
+        completed = subprocess.run(
+            (str(venv_python), "-c", _PROBE_SCRIPT, package, version, scope),
+            check=True,
+            text=True,
+            capture_output=True,
+            timeout=timeout,
+        )
+    raw = json.loads(completed.stdout)
+    return tuple(BehaviorProbeResult(**item) for item in raw)
+
+
+def diff_behavior_results(results: Sequence[BehaviorProbeResult]) -> tuple[BehaviorDiff, ...]:
+    """Compare current and candidate behavior probes for each package/probe."""
+
+    grouped: dict[tuple[str, str], dict[str, BehaviorProbeResult]] = {}
+    for result in results:
+        grouped.setdefault((result.package, result.probe), {})[result.scope] = result
+    diffs: list[BehaviorDiff] = []
+    for (package, probe), scopes in sorted(grouped.items()):
+        before = scopes.get("current-baseline") or scopes.get("current-environment")
+        after = scopes.get("candidate") or scopes.get("isolated-venv")
+        if before is None or after is None:
+            continue
+        if before.status == after.status and _contract_details(before) == _contract_details(after):
+            severity = "none"
+            summary = "No behavior contract difference detected."
+        elif before.status == "pass" and after.status != "pass":
+            severity = "breaking"
+            summary = f"Candidate probe changed from pass to {after.status}."
+        elif before.status != after.status:
+            severity = "changed"
+            summary = f"Probe status changed from {before.status} to {after.status}."
+        else:
+            severity = "changed"
+            summary = "Probe details changed while status stayed the same."
+        diffs.append(
+            BehaviorDiff(
+                package=package,
+                from_version=before.version,
+                to_version=after.version,
+                probe=probe,
+                severity=severity,
+                summary=summary,
+                before_status=before.status,
+                after_status=after.status,
+            )
+        )
+    return tuple(diffs)
+
+
+def summarize_behavior(diffs: Sequence[BehaviorDiff]) -> dict[str, Any]:
+    """Return a compact behavior summary for reports and gates."""
+
+    breaking = [diff for diff in diffs if diff.severity == "breaking"]
+    changed = [diff for diff in diffs if diff.severity == "changed"]
+    return {
+        "breaking_count": len(breaking),
+        "changed_count": len(changed),
+        "unchanged_count": len([diff for diff in diffs if diff.severity == "none"]),
+        "status": "fail" if breaking else "changed" if changed else "pass",
+    }
+
+
+def _probe_package(
+    package: str,
+    *,
+    version: str | None,
+    scope: str,
+) -> tuple[BehaviorProbeResult, ...]:
+    if package == "claude-agent-sdk":
+        return (_probe_claude(version=version, scope=scope),)
+    if package == "openai-codex":
+        return (_probe_codex(version=version, scope=scope),)
+    if package == "openai-codex-cli-bin":
+        return (_probe_codex_cli_bin(version=version, scope=scope),)
+    if package == "google-antigravity":
+        return (_probe_antigravity(version=version, scope=scope),)
+    return (
+        BehaviorProbeResult(
+            package=package,
+            version=version,
+            scope=scope,
+            probe="package-import",
+            status="skip",
+            summary="No behavior probe is defined for this package.",
+        ),
+    )
+
+
+def _probe_claude(*, version: str | None, scope: str) -> BehaviorProbeResult:
+    package = "claude-agent-sdk"
+    try:
+        module = importlib.import_module("claude_agent_sdk")
+        options_cls = module.ClaudeAgentOptions
+    except Exception as exc:
+        return _failed(package, version, scope, "adapter-contract", exc)
+    fields = _fields(options_cls)
+    expected = {
+        "model",
+        "allowed_tools",
+        "disallowed_tools",
+        "permission_mode",
+        "system_prompt",
+        "cwd",
+        "mcp_servers",
+        "resume",
+        "env",
+        "max_budget_usd",
+        "output_format",
+    }
+    missing = sorted(expected - fields)
+    return BehaviorProbeResult(
+        package=package,
+        version=version,
+        scope=scope,
+        probe="adapter-contract",
+        status="fail" if missing else "pass",
+        summary=(
+            "ClaudeAgentOptions exposes required adapter fields."
+            if not missing
+            else "ClaudeAgentOptions is missing required adapter fields."
+        ),
+        details={"fields": sorted(fields), "required_fields": sorted(expected), "missing": missing},
+    )
+
+
+def _probe_codex(*, version: str | None, scope: str) -> BehaviorProbeResult:
+    package = "openai-codex"
+    try:
+        module = importlib.import_module("openai_codex")
+        run_params = set(inspect.signature(module.AsyncThread.run).parameters)
+        start_params = set(inspect.signature(module.AsyncCodex.thread_start).parameters)
+    except Exception as exc:
+        return _failed(package, version, scope, "adapter-contract", exc)
+    expected_run = {"cwd", "model", "approval_mode", "sandbox", "output_schema", "effort"}
+    expected_start = {"developer_instructions", "cwd", "model", "approval_mode", "sandbox"}
+    missing_run = sorted(expected_run - run_params)
+    missing_start = sorted(expected_start - start_params)
+    missing = missing_run + [f"thread_start.{item}" for item in missing_start]
+    return BehaviorProbeResult(
+        package=package,
+        version=version,
+        scope=scope,
+        probe="adapter-contract",
+        status="fail" if missing else "pass",
+        summary=(
+            "Codex thread APIs expose required adapter parameters."
+            if not missing
+            else "Codex thread APIs are missing required adapter parameters."
+        ),
+        details={
+            "run_params": sorted(run_params),
+            "start_params": sorted(start_params),
+            "required_run_params": sorted(expected_run),
+            "required_start_params": sorted(expected_start),
+            "missing": missing,
+        },
+    )
+
+
+def _probe_codex_cli_bin(*, version: str | None, scope: str) -> BehaviorProbeResult:
+    package = "openai-codex-cli-bin"
+    try:
+        installed = importlib.metadata.version(package)
+    except Exception as exc:
+        return _failed(package, version, scope, "binary-distribution", exc)
+    return BehaviorProbeResult(
+        package=package,
+        version=version or installed,
+        scope=scope,
+        probe="binary-distribution",
+        status="pass",
+        summary="Codex CLI binary distribution metadata is available.",
+        details={"installed_version": installed},
+    )
+
+
+def _probe_antigravity(*, version: str | None, scope: str) -> BehaviorProbeResult:
+    package = "google-antigravity"
+    try:
+        importlib.import_module(DEFAULT_MODULES[package])
+        importlib.import_module("google.antigravity.types")
+        importlib.import_module("google.antigravity.agent")
+        importlib.import_module("google.antigravity.hooks.policy")
+        config_module = importlib.import_module(
+            "google.antigravity.connections.local.local_connection_config"
+        )
+        config_cls = config_module.LocalAgentConfig
+    except Exception as exc:
+        return _failed(package, version, scope, "adapter-contract", exc)
+    fields = _fields(config_cls)
+    expected = {
+        "model",
+        "api_key",
+        "vertex",
+        "project",
+        "location",
+        "system_instructions",
+        "capabilities",
+        "policies",
+        "workspaces",
+        "conversation_id",
+        "save_dir",
+        "app_data_dir",
+        "response_schema",
+        "mcp_servers",
+    }
+    missing = sorted(expected - fields)
+    return BehaviorProbeResult(
+        package=package,
+        version=version,
+        scope=scope,
+        probe="adapter-contract",
+        status="fail" if missing else "pass",
+        summary=(
+            "Antigravity LocalAgentConfig exposes required adapter fields."
+            if not missing
+            else "Antigravity LocalAgentConfig is missing required adapter fields."
+        ),
+        details={"fields": sorted(fields), "required_fields": sorted(expected), "missing": missing},
+    )
+
+
+def _fields(cls: Any) -> set[str]:
+    if hasattr(cls, "model_fields"):
+        return set(cls.model_fields)
+    if hasattr(cls, "__dataclass_fields__"):
+        return set(cls.__dataclass_fields__)
+    try:
+        return set(inspect.signature(cls).parameters)
+    except (TypeError, ValueError):
+        return set()
+
+
+def _contract_details(result: BehaviorProbeResult) -> dict[str, Any]:
+    if result.probe != "adapter-contract":
+        return result.details
+    details = result.details
+    if "missing" not in details:
+        return details
+    contract: dict[str, Any] = {"missing": sorted(details.get("missing") or [])}
+    if "required_fields" in details:
+        contract["required_fields"] = sorted(details.get("required_fields") or [])
+    if "required_run_params" in details:
+        contract["required_run_params"] = sorted(details.get("required_run_params") or [])
+    if "required_start_params" in details:
+        contract["required_start_params"] = sorted(details.get("required_start_params") or [])
+    return contract
+
+
+def _failed(
+    package: str,
+    version: str | None,
+    scope: str,
+    probe: str,
+    exc: Exception,
+) -> BehaviorProbeResult:
+    return BehaviorProbeResult(
+        package=package,
+        version=version,
+        scope=scope,
+        probe=probe,
+        status="fail",
+        summary=str(exc),
+        details={"error": str(exc)},
+    )
+
+
+def _string_or_none(value: object) -> str | None:
+    if value is None:
+        return None
+    text = str(value)
+    return text or None
+
+
+_PROBE_SCRIPT = textwrap.dedent(
+    """
+    import importlib
+    import importlib.metadata
+    import inspect
+    import json
+    import sys
+
+    package, version, scope = sys.argv[1:4]
+
+    def fields(cls):
+        if hasattr(cls, "model_fields"):
+            return set(cls.model_fields)
+        if hasattr(cls, "__dataclass_fields__"):
+            return set(cls.__dataclass_fields__)
+        try:
+            return set(inspect.signature(cls).parameters)
+        except (TypeError, ValueError):
+            return set()
+
+    def failed(probe, exc):
+        return {
+            "package": package,
+            "version": version,
+            "scope": scope,
+            "probe": probe,
+            "status": "fail",
+            "summary": str(exc),
+            "details": {"error": str(exc)},
+        }
+
+    def result(probe, status, summary, details):
+        return {
+            "package": package,
+            "version": version,
+            "scope": scope,
+            "probe": probe,
+            "status": status,
+            "summary": summary,
+            "details": details,
+        }
+
+    try:
+        if package == "claude-agent-sdk":
+            module = importlib.import_module("claude_agent_sdk")
+            option_fields = fields(getattr(module, "ClaudeAgentOptions"))
+            expected = {
+                "model", "allowed_tools", "disallowed_tools", "permission_mode",
+                "system_prompt", "cwd", "mcp_servers", "resume", "env",
+                "max_budget_usd", "output_format",
+            }
+            missing = sorted(expected - option_fields)
+            payload = [result(
+                "adapter-contract",
+                "fail" if missing else "pass",
+                "ClaudeAgentOptions exposes required adapter fields." if not missing
+                else "ClaudeAgentOptions is missing required adapter fields.",
+                {
+                    "fields": sorted(option_fields),
+                    "required_fields": sorted(expected),
+                    "missing": missing,
+                },
+            )]
+        elif package == "openai-codex":
+            module = importlib.import_module("openai_codex")
+            run_params = set(inspect.signature(module.AsyncThread.run).parameters)
+            start_params = set(inspect.signature(module.AsyncCodex.thread_start).parameters)
+            expected_run = {"cwd", "model", "approval_mode", "sandbox", "output_schema", "effort"}
+            expected_start = {"developer_instructions", "cwd", "model", "approval_mode", "sandbox"}
+            missing_run = sorted(expected_run - run_params)
+            missing_start = sorted(expected_start - start_params)
+            missing = missing_run + [f"thread_start.{item}" for item in missing_start]
+            payload = [result(
+                "adapter-contract",
+                "fail" if missing else "pass",
+                "Codex thread APIs expose required adapter parameters." if not missing
+                else "Codex thread APIs are missing required adapter parameters.",
+                {
+                    "run_params": sorted(run_params),
+                    "start_params": sorted(start_params),
+                    "required_run_params": sorted(expected_run),
+                    "required_start_params": sorted(expected_start),
+                    "missing": missing,
+                },
+            )]
+        elif package == "openai-codex-cli-bin":
+            installed = importlib.metadata.version(package)
+            payload = [result(
+                "binary-distribution",
+                "pass",
+                "Codex CLI binary distribution metadata is available.",
+                {"installed_version": installed},
+            )]
+        elif package == "google-antigravity":
+            importlib.import_module("google.antigravity")
+            importlib.import_module("google.antigravity.types")
+            importlib.import_module("google.antigravity.agent")
+            importlib.import_module("google.antigravity.hooks.policy")
+            config_module = importlib.import_module(
+                "google.antigravity.connections.local.local_connection_config"
+            )
+            config_fields = fields(getattr(config_module, "LocalAgentConfig"))
+            expected = {
+                "model", "api_key", "vertex", "project", "location",
+                "system_instructions", "capabilities", "policies", "workspaces",
+                "conversation_id", "save_dir", "app_data_dir", "response_schema",
+                "mcp_servers",
+            }
+            missing = sorted(expected - config_fields)
+            payload = [result(
+                "adapter-contract",
+                "fail" if missing else "pass",
+                "Antigravity LocalAgentConfig exposes required adapter fields." if not missing
+                else "Antigravity LocalAgentConfig is missing required adapter fields.",
+                {
+                    "fields": sorted(config_fields),
+                    "required_fields": sorted(expected),
+                    "missing": missing,
+                },
+            )]
+        else:
+            payload = [result("package-import", "skip", "No behavior probe is defined.", {})]
+    except Exception as exc:
+        payload = [failed("adapter-contract", exc)]
+
+    print(json.dumps(payload, sort_keys=True))
+    """
+).strip()
diff --git a/examples/sdk_evolution_agent/cli.py b/examples/sdk_evolution_agent/cli.py
index 14e57d7..d0a6e1c 100644
--- a/examples/sdk_evolution_agent/cli.py
+++ b/examples/sdk_evolution_agent/cli.py
@@ -3,17 +3,22 @@
 from __future__ import annotations
 
 import argparse
+import re
+from dataclasses import replace
 from datetime import datetime, timezone
 from pathlib import Path
 from typing import Any
 
 from agent_runtime_kit import AgentRuntime, RuntimeRegistry
+from examples.sdk_evolution_agent.behavior import collect_behavior_evidence
 from examples.sdk_evolution_agent.collectors import (
     CommandRunner,
     PypiClient,
     collect_evidence,
+    run_lock_update,
     run_verification_commands,
 )
+from examples.sdk_evolution_agent.current_state import build_current_state
 from examples.sdk_evolution_agent.events import JsonlEventSink
 from examples.sdk_evolution_agent.models import (
     DEFAULT_PACKAGES,
@@ -21,7 +26,15 @@
     RunOptions,
     to_jsonable,
 )
-from examples.sdk_evolution_agent.pr import build_draft_pr_body, create_branch, create_draft_pr
+from examples.sdk_evolution_agent.pr import (
+    build_draft_pr_body,
+    commit_staged,
+    create_branch,
+    create_draft_pr,
+    push_branch,
+    stage_paths,
+)
+from examples.sdk_evolution_agent.release_notes import collect_release_notes
 from examples.sdk_evolution_agent.report import write_run_report
 from examples.sdk_evolution_agent.snapshots import (
     diff_snapshot_groups,
@@ -34,6 +47,13 @@
     run_analysis_pipeline,
 )
 
+DEFAULT_VERIFICATION_COMMANDS = (
+    "uv run ruff check .",
+    "uv run mypy",
+    "uv run pytest",
+    "uv lock --check",
+)
+
 
 async def main(argv: list[str] | None = None) -> int:
     """Parse CLI args and run the agent."""
@@ -78,11 +98,21 @@ def parse_args(argv: list[str] | None = None) -> RunOptions:
     parser.add_argument(
         "--inspect-candidates",
         action="store_true",
-        help="Inspect latest candidate SDK versions in temporary virtualenvs.",
+        default=True,
+        help=(
+            "Inspect latest candidate SDK versions in temporary virtualenvs. "
+            "Always enabled for update candidates; accepted for compatibility."
+        ),
     )
     parser.add_argument("--create-branch", action="store_true", help="Create a local branch first.")
     parser.add_argument("--branch-name", help="Branch name for optional branch creation.")
     parser.add_argument("--draft-pr", action="store_true", help="Create a draft PR with gh.")
+    parser.add_argument("--pr-base", help="Base branch for optional draft PR creation.")
+    parser.add_argument(
+        "--commit-message",
+        default="Run SDK evolution update",
+        help="Commit message for optional autonomous SDK update PR.",
+    )
     parser.add_argument(
         "--pr-title",
         default="Adapt agent-runtime-kit to upstream SDK evolution",
@@ -100,6 +130,8 @@ def parse_args(argv: list[str] | None = None) -> RunOptions:
         create_branch=args.create_branch,
         branch_name=args.branch_name,
         draft_pr=args.draft_pr,
+        pr_base=args.pr_base,
+        commit_message=args.commit_message,
         pr_title=args.pr_title,
     )
 
@@ -114,6 +146,7 @@ async def run_agent(
 ) -> Path:
     """Run the full local SDK evolution workflow."""
 
+    options = replace(options, inspect_candidates=True)
     run_id = datetime.now(tz=timezone.utc).strftime("%Y%m%dT%H%M%SZ")
     report_root = (options.workspace / options.report_dir / run_id).resolve()
     event_log_path = report_root / "events.jsonl"
@@ -129,6 +162,18 @@ async def run_agent(
         event_sink=event_sink,
     )
     selected_runtime = runtime or resolve_runtime(options.runtime, registry=registry)
+    pre_run_results: list[dict[str, Any]] = []
+    if options.create_branch and options.branch_name:
+        branch_result = create_branch(
+            options.workspace,
+            options.branch_name,
+            command_runner=command_runner,
+        )
+        pre_run_results.append(to_jsonable(branch_result))
+        if branch_result.returncode != 0:
+            raise RuntimeError(
+                f"failed to create branch {options.branch_name}: {branch_result.stderr}"
+            )
     evidence = collect_evidence(
         options.workspace,
         packages=options.packages,
@@ -136,12 +181,20 @@ async def run_agent(
         pypi_client=pypi_client,
         command_runner=command_runner,
     )
-    snapshots = _collect_snapshots(evidence, inspect_candidates=options.inspect_candidates)
+    update_versions = _refresh_update_versions(evidence)
+    snapshots = _collect_snapshots(evidence)
     api_diffs = [to_jsonable(diff) for diff in diff_snapshot_groups(snapshots)]
+    release_notes = [
+        to_jsonable(item)
+        for item in collect_release_notes(evidence.get("packages", []), update_versions)
+    ]
+    behavior = to_jsonable(collect_behavior_evidence(evidence.get("packages", []), update_versions))
     direction, architecture, review = await run_analysis_pipeline(
         selected_runtime,
         evidence=evidence,
         api_diffs=api_diffs,
+        release_notes=release_notes,
+        behavior=behavior,
         context=RunContext(
             run_id=context.run_id,
             workspace=context.workspace,
@@ -161,76 +214,293 @@ async def run_agent(
         review=review,
         context=context,
     )
+    implementation.setdefault("verification_results", []).extend(pre_run_results)
     config = to_jsonable(options)
     config["run_id"] = run_id
     config["event_log_path"] = str(context.event_log_path)
-    report_path = write_run_report(
+
+    if options.implementation_enabled and implementation.get("allowed"):
+        implementation = _run_local_sdk_update(
+            options,
+            update_versions=update_versions,
+            implementation=implementation,
+            command_runner=command_runner,
+        )
+
+    promoted = bool(implementation.get("applied")) and _verification_passed(implementation)
+    current_state: dict[str, Any] = {
+        "promotion": {
+            "promoted": False,
+            "status": "pending-report-write",
+        }
+    }
+    report_path = _write_full_report(
         context,
         config=config,
         evidence=evidence,
         snapshots=[to_jsonable(snapshot) for snapshot in snapshots],
         api_diffs=api_diffs,
+        release_notes=release_notes,
+        behavior=behavior,
+        current_state=current_state,
         direction=direction,
         architecture=architecture,
         implementation=implementation,
         review=review,
     )
-    optional_results_changed = False
-    if options.implementation_enabled and implementation.get("applied"):
-        verification_results = run_verification_commands(
-            options.workspace,
-            tuple(str(item) for item in architecture.get("verification_commands", [])),
-            command_runner=command_runner,
-        )
-        implementation.setdefault("verification_results", []).extend(
-            to_jsonable(verification_results)
-        )
-        optional_results_changed = True
-    if options.create_branch and options.branch_name:
-        branch_result = create_branch(
-            options.workspace,
-            options.branch_name,
-            command_runner=command_runner,
-        )
-        implementation.setdefault("verification_results", []).append(to_jsonable(branch_result))
-        optional_results_changed = True
+    current_state = build_current_state(
+        context,
+        promoted=promoted,
+        status="promoted" if promoted else str(implementation.get("blocked_reason") or "skipped"),
+        implementation=implementation,
+    )
+    report_path = _write_full_report(
+        context,
+        config=config,
+        evidence=evidence,
+        snapshots=[to_jsonable(snapshot) for snapshot in snapshots],
+        api_diffs=api_diffs,
+        release_notes=release_notes,
+        behavior=behavior,
+        current_state=current_state,
+        direction=direction,
+        architecture=architecture,
+        implementation=implementation,
+        review=review,
+    )
+
     if options.draft_pr:
-        body = build_draft_pr_body(report_path.read_text(encoding="utf-8"))
-        pr_result = create_draft_pr(
+        git_results = _create_autonomous_pr(
             options.workspace,
-            title=options.pr_title,
-            body=body,
+            report_path=report_path,
+            options=options,
             command_runner=command_runner,
         )
-        implementation.setdefault("verification_results", []).append(to_jsonable(pr_result))
-        optional_results_changed = True
-    else:
-        body = None
-    if optional_results_changed:
-        report_path = write_run_report(
+        implementation.setdefault("verification_results", []).extend(git_results)
+        report_path = _write_full_report(
             context,
             config=config,
             evidence=evidence,
             snapshots=[to_jsonable(snapshot) for snapshot in snapshots],
             api_diffs=api_diffs,
+            release_notes=release_notes,
+            behavior=behavior,
+            current_state=current_state,
             direction=direction,
             architecture=architecture,
             implementation=implementation,
             review=review,
-            pr_body=body,
+        )
+        _commit_final_autonomous_pr_report(
+            options.workspace,
+            report_path=report_path,
+            options=options,
+            command_runner=command_runner,
         )
     return report_path
 
 
-def _collect_snapshots(evidence: dict[str, Any], *, inspect_candidates: bool) -> list[Any]:
+def _run_local_sdk_update(
+    options: RunOptions,
+    *,
+    update_versions: dict[str, str],
+    implementation: dict[str, Any],
+    command_runner: CommandRunner | None,
+) -> dict[str, Any]:
+    packages = tuple(sorted(update_versions))
+    if not packages:
+        return {
+            **implementation,
+            "applied": False,
+            "blocked_reason": "no resolver-selected SDK updates",
+        }
+    update_result = run_lock_update(
+        options.workspace,
+        packages,
+        command_runner=command_runner,
+    )
+    results = list(implementation.get("verification_results") or [])
+    results.append(to_jsonable(update_result))
+    applied = update_result.returncode == 0
+    changes = list(implementation.get("changes") or [])
+    if applied:
+        changes.append("Updated uv.lock for resolver-selected SDK packages: " + ", ".join(packages))
+        verification_commands = tuple(DEFAULT_VERIFICATION_COMMANDS)
+        verification_results = run_verification_commands(
+            options.workspace,
+            verification_commands,
+            command_runner=command_runner,
+        )
+        results.extend(to_jsonable(verification_results))
+    return {
+        **implementation,
+        "applied": applied,
+        "changes": changes,
+        "verification_results": results,
+        "blocked_reason": "" if applied else update_result.stderr or update_result.stdout,
+    }
+
+
+def _write_full_report(
+    context: RunContext,
+    *,
+    config: dict[str, Any],
+    evidence: dict[str, Any],
+    snapshots: list[dict[str, Any]],
+    api_diffs: list[dict[str, Any]],
+    release_notes: list[dict[str, Any]],
+    behavior: dict[str, Any],
+    current_state: dict[str, Any],
+    direction: dict[str, Any],
+    architecture: dict[str, Any],
+    implementation: dict[str, Any],
+    review: dict[str, Any],
+    pr_body: str | None = None,
+) -> Path:
+    return write_run_report(
+        context,
+        config=config,
+        evidence=evidence,
+        snapshots=snapshots,
+        api_diffs=api_diffs,
+        release_notes=release_notes,
+        behavior=behavior,
+        current_state=current_state,
+        direction=direction,
+        architecture=architecture,
+        implementation=implementation,
+        review=review,
+        pr_body=pr_body,
+    )
+
+
+def _verification_passed(implementation: dict[str, Any]) -> bool:
+    results = implementation.get("verification_results")
+    if not isinstance(results, list):
+        return False
+    command_results = [item for item in results if isinstance(item, dict) and "returncode" in item]
+    return bool(command_results) and all(
+        int(item.get("returncode", 1)) == 0 for item in command_results
+    )
+
+
+def _create_autonomous_pr(
+    root: Path,
+    *,
+    report_path: Path,
+    options: RunOptions,
+    command_runner: CommandRunner | None,
+) -> list[dict[str, Any]]:
+    branch_name = options.branch_name or _current_branch(root, command_runner=command_runner)
+    body = build_draft_pr_body(report_path.read_text(encoding="utf-8"))
+    relative_report = _relative_path(root, report_path.parent)
+    paths = ("uv.lock", relative_report)
+    results = [
+        to_jsonable(stage_paths(root, paths, command_runner=command_runner)),
+        to_jsonable(
+            commit_staged(
+                root,
+                message=options.commit_message,
+                command_runner=command_runner,
+            )
+        ),
+    ]
+    if branch_name:
+        results.append(
+            to_jsonable(push_branch(root, branch_name=branch_name, command_runner=command_runner))
+        )
+    results.append(
+        to_jsonable(
+            create_draft_pr(
+                root,
+                title=options.pr_title,
+                body=body,
+                base=options.pr_base,
+                head=branch_name,
+                command_runner=command_runner,
+            )
+        )
+    )
+    return results
+
+
+def _commit_final_autonomous_pr_report(
+    root: Path,
+    *,
+    report_path: Path,
+    options: RunOptions,
+    command_runner: CommandRunner | None,
+) -> None:
+    branch_name = options.branch_name or _current_branch(root, command_runner=command_runner)
+    relative_report = _relative_path(root, report_path.parent)
+    results = [
+        stage_paths(root, (relative_report,), command_runner=command_runner),
+        commit_staged(
+            root,
+            message="Finalize SDK evolution report",
+            command_runner=command_runner,
+        ),
+    ]
+    if branch_name:
+        results.append(push_branch(root, branch_name=branch_name, command_runner=command_runner))
+    failed = [result for result in results if result.returncode != 0]
+    if failed:
+        detail = failed[0].stderr or failed[0].stdout
+        raise RuntimeError(f"failed to commit final autonomous PR report: {detail}")
+
+
+def _current_branch(root: Path, *, command_runner: CommandRunner | None) -> str:
+    runner = command_runner or None
+    if runner is None:
+        from examples.sdk_evolution_agent.collectors import run_command
+
+        runner = run_command
+    result = runner(("git", "branch", "--show-current"), cwd=root)
+    return result.stdout.strip() if result.returncode == 0 else ""
+
+
+def _relative_path(root: Path, path: Path) -> str:
+    try:
+        return str(path.resolve().relative_to(root.resolve()))
+    except ValueError:
+        return str(path)
+
+
+def _collect_snapshots(evidence: dict[str, Any], *, inspect_candidates: bool = True) -> list[Any]:
+    del inspect_candidates  # Candidate inspection is mandatory for update candidates.
     snapshots = []
+    update_versions = _refresh_update_versions(evidence)
+    refresh_preview_seen = evidence.get("refresh_preview") is not None
     for package in evidence.get("packages", []):
         if not isinstance(package, dict):
             continue
         name = str(package.get("name"))
-        snapshots.append(snapshot_current_api(name, version=package.get("installed_version")))
-        latest = package.get("latest_version")
-        installed = package.get("installed_version") or package.get("locked_version")
-        if inspect_candidates and latest and latest != installed:
-            snapshots.append(snapshot_candidate_in_venv(name, str(latest)))
+        locked = package.get("locked_version")
+        installed = package.get("installed_version")
+        baseline = locked or installed
+        if locked and installed and locked != installed:
+            snapshots.append(snapshot_candidate_in_venv(name, str(locked)))
+        else:
+            snapshots.append(snapshot_current_api(name, version=baseline))
+        candidate = update_versions.get(name)
+        if candidate is None and not refresh_preview_seen:
+            latest = package.get("latest_version")
+            if latest and latest != baseline:
+                candidate = str(latest)
+        if candidate:
+            snapshots.append(snapshot_candidate_in_venv(name, candidate))
     return snapshots
+
+
+def _refresh_update_versions(evidence: dict[str, Any]) -> dict[str, str]:
+    preview = evidence.get("refresh_preview")
+    if not isinstance(preview, dict):
+        return {}
+    text = f"{preview.get('stdout') or ''}\n{preview.get('stderr') or ''}"
+    return {
+        package: version
+        for package, version in re.findall(
+            r"Update\s+([A-Za-z0-9_.-]+)\s+v\S+\s+->\s+v(\S+)",
+            text,
+        )
+    }
diff --git a/examples/sdk_evolution_agent/collectors.py b/examples/sdk_evolution_agent/collectors.py
index 93417fb..8dc135c 100644
--- a/examples/sdk_evolution_agent/collectors.py
+++ b/examples/sdk_evolution_agent/collectors.py
@@ -268,6 +268,29 @@ def run_refresh_preview(
     )
 
 
+def run_lock_update(
+    root: Path,
+    packages: Sequence[str],
+    *,
+    command_runner: CommandRunner | None = None,
+) -> CommandResult:
+    """Apply a targeted uv lock update with freshness cutoffs removed."""
+
+    command_runner = command_runner or run_command
+    env, removed = cutoff_free_env()
+    command = ["uv", "lock"]
+    for package in packages:
+        command.extend(("-P", package))
+    result = command_runner(tuple(command), cwd=root, env=env)
+    return CommandResult(
+        command=result.command,
+        returncode=result.returncode,
+        stdout=result.stdout,
+        stderr=result.stderr,
+        removed_env=removed,
+    )
+
+
 def run_verification_commands(
     root: Path,
     commands: Sequence[str],
diff --git a/examples/sdk_evolution_agent/current_state.py b/examples/sdk_evolution_agent/current_state.py
new file mode 100644
index 0000000..9f37bae
--- /dev/null
+++ b/examples/sdk_evolution_agent/current_state.py
@@ -0,0 +1,106 @@
+"""Current-state baseline manifest helpers for SDK evolution runs."""
+
+from __future__ import annotations
+
+import hashlib
+import subprocess
+from pathlib import Path
+from typing import Any
+
+from examples.sdk_evolution_agent.collectors import read_uv_lock_versions
+from examples.sdk_evolution_agent.models import RunContext
+
+CURRENT_STATE_SCHEMA_VERSION = "1"
+
+
+def build_current_state(
+    context: RunContext,
+    *,
+    promoted: bool,
+    status: str,
+    implementation: dict[str, Any],
+) -> dict[str, Any]:
+    """Build the baseline manifest for a run."""
+
+    lockfile = context.workspace / "uv.lock"
+    return {
+        "schema_version": CURRENT_STATE_SCHEMA_VERSION,
+        "generated_at_run_id": context.run_id,
+        "source_run_id": context.run_id,
+        "commit": _git_output(context.workspace, ("git", "rev-parse", "HEAD")),
+        "dirty_worktree": bool(_git_output(context.workspace, ("git", "status", "--short"))),
+        "lockfile_hash": _sha256(lockfile),
+        "packages": read_uv_lock_versions(lockfile),
+        "artifacts": _artifact_refs(context.report_root, workspace=context.workspace),
+        "promotion": {
+            "promoted": promoted,
+            "status": status,
+            "implementation_applied": bool(implementation.get("applied")),
+            "blocked_reason": str(implementation.get("blocked_reason") or ""),
+        },
+    }
+
+
+def _artifact_refs(report_root: Path, *, workspace: Path) -> dict[str, dict[str, str]]:
+    names = (
+        "evidence.json",
+        "release_notes.json",
+        "api_diffs.json",
+        "behavior_probes.json",
+        "behavior_diffs.json",
+        "direction_analysis.json",
+        "architecture_decision.json",
+        "implementation_summary.json",
+        "review.json",
+        "report.md",
+    )
+    refs: dict[str, dict[str, str]] = {}
+    for name in names:
+        path = report_root / name
+        if path.exists():
+            refs[name] = {
+                "path": _portable_path(path, workspace=workspace),
+                "sha256": _sha256(path),
+            }
+    snapshots_dir = report_root / "api_snapshots"
+    if snapshots_dir.exists():
+        for path in sorted(snapshots_dir.glob("*.json")):
+            refs[f"api_snapshots/{path.name}"] = {
+                "path": _portable_path(path, workspace=workspace),
+                "sha256": _sha256(path),
+            }
+    return refs
+
+
+def _portable_path(path: Path, *, workspace: Path) -> str:
+    try:
+        return str(path.resolve().relative_to(workspace.resolve()))
+    except ValueError:
+        return str(path)
+
+
+def _sha256(path: Path) -> str:
+    if not path.exists():
+        return ""
+    digest = hashlib.sha256()
+    with path.open("rb") as handle:
+        for chunk in iter(lambda: handle.read(65536), b""):
+            digest.update(chunk)
+    return digest.hexdigest()
+
+
+def _git_output(root: Path, command: tuple[str, ...]) -> str:
+    try:
+        completed = subprocess.run(
+            command,
+            cwd=root,
+            text=True,
+            capture_output=True,
+            timeout=30,
+            check=False,
+        )
+    except Exception:
+        return ""
+    if completed.returncode != 0:
+        return ""
+    return completed.stdout.strip()
diff --git a/examples/sdk_evolution_agent/models.py b/examples/sdk_evolution_agent/models.py
index 7ccacca..8878709 100644
--- a/examples/sdk_evolution_agent/models.py
+++ b/examples/sdk_evolution_agent/models.py
@@ -2,7 +2,7 @@
 
 from __future__ import annotations
 
-from dataclasses import asdict, dataclass, is_dataclass
+from dataclasses import asdict, dataclass, field, is_dataclass
 from pathlib import Path
 from typing import Any
 
@@ -97,6 +97,47 @@ class ApiDiff:
     changed: tuple[str, ...] = ()
 
 
+@dataclass(frozen=True)
+class ReleaseNoteEvidence:
+    """Release-note evidence collected for one package interval."""
+
+    package: str
+    from_version: str | None
+    to_version: str | None
+    status: str
+    sources: tuple[SourceRef, ...] = ()
+    summaries: tuple[str, ...] = ()
+    checked_urls: tuple[str, ...] = ()
+    unavailable_reason: str = ""
+
+
+@dataclass(frozen=True)
+class BehaviorProbeResult:
+    """One deterministic behavior/contract probe result."""
+
+    package: str
+    version: str | None
+    scope: str
+    probe: str
+    status: str
+    summary: str
+    details: dict[str, Any] = field(default_factory=dict)
+
+
+@dataclass(frozen=True)
+class BehaviorDiff:
+    """Observed behavior difference between current and candidate probes."""
+
+    package: str
+    from_version: str | None
+    to_version: str | None
+    probe: str
+    severity: str
+    summary: str
+    before_status: str
+    after_status: str
+
+
 @dataclass(frozen=True)
 class RunOptions:
     """Configuration for one local agent run."""
@@ -107,10 +148,12 @@ class RunOptions:
     report_dir: Path = Path("reports/sdk-evolution")
     implementation_enabled: bool = False
     refresh_preview: bool = False
-    inspect_candidates: bool = False
+    inspect_candidates: bool = True
     create_branch: bool = False
     branch_name: str | None = None
     draft_pr: bool = False
+    pr_base: str | None = None
+    commit_message: str = "Run SDK evolution update"
     pr_title: str = "Adapt agent-runtime-kit to upstream SDK evolution"
 
 
diff --git a/examples/sdk_evolution_agent/pr.py b/examples/sdk_evolution_agent/pr.py
index 8fe1d3b..6d94100 100644
--- a/examples/sdk_evolution_agent/pr.py
+++ b/examples/sdk_evolution_agent/pr.py
@@ -38,17 +38,57 @@ def create_branch(
     return command_runner(("git", "switch", "-c", branch_name), cwd=root)
 
 
+def stage_paths(
+    root: Path,
+    paths: tuple[str, ...],
+    *,
+    command_runner: CommandRunner | None = None,
+) -> CommandResult:
+    """Stage paths for an autonomous SDK update PR."""
+
+    command_runner = command_runner or run_command
+    return command_runner(("git", "add", *paths), cwd=root)
+
+
+def commit_staged(
+    root: Path,
+    *,
+    message: str,
+    command_runner: CommandRunner | None = None,
+) -> CommandResult:
+    """Commit staged SDK update artifacts."""
+
+    command_runner = command_runner or run_command
+    return command_runner(("git", "commit", "-m", message), cwd=root)
+
+
+def push_branch(
+    root: Path,
+    *,
+    branch_name: str,
+    command_runner: CommandRunner | None = None,
+) -> CommandResult:
+    """Push the current SDK update branch."""
+
+    command_runner = command_runner or run_command
+    return command_runner(("git", "push", "-u", "origin", branch_name), cwd=root)
+
+
 def create_draft_pr(
     root: Path,
     *,
     title: str,
     body: str,
+    base: str | None = None,
+    head: str | None = None,
     command_runner: CommandRunner | None = None,
 ) -> CommandResult:
     """Open a draft PR with gh when authenticated."""
 
     command_runner = command_runner or run_command
-    return command_runner(
-        ("gh", "pr", "create", "--draft", "--title", title, "--body", body),
-        cwd=root,
-    )
+    command = ["gh", "pr", "create", "--draft", "--title", title, "--body", body]
+    if base:
+        command.extend(("--base", base))
+    if head:
+        command.extend(("--head", head))
+    return command_runner(tuple(command), cwd=root)
diff --git a/examples/sdk_evolution_agent/release_notes.py b/examples/sdk_evolution_agent/release_notes.py
new file mode 100644
index 0000000..078898c
--- /dev/null
+++ b/examples/sdk_evolution_agent/release_notes.py
@@ -0,0 +1,248 @@
+"""Release-note evidence collection for SDK evolution runs."""
+
+from __future__ import annotations
+
+import gzip
+import re
+import urllib.request
+from collections.abc import Callable, Mapping, Sequence
+
+from examples.sdk_evolution_agent.models import ReleaseNoteEvidence, SourceRef
+
+ReleaseNoteFetcher = Callable[[str], str]
+
+RELEASE_NOTE_SOURCES: dict[str, tuple[SourceRef, ...]] = {
+    "claude-agent-sdk": (
+        SourceRef(
+            kind="changelog",
+            label="Claude Agent SDK Python changelog",
+            url=(
+                "https://raw.githubusercontent.com/anthropics/"
+                "claude-agent-sdk-python/main/CHANGELOG.md"
+            ),
+        ),
+        SourceRef(
+            kind="docs",
+            label="Claude Agent SDK overview",
+            url="https://code.claude.com/docs/en/agent-sdk/overview",
+        ),
+    ),
+    "openai-codex": (
+        SourceRef(
+            kind="docs",
+            label="Codex SDK docs",
+            url="https://developers.openai.com/codex/sdk",
+        ),
+        SourceRef(
+            kind="changelog",
+            label="Codex changelog",
+            url="https://developers.openai.com/codex/changelog",
+        ),
+        SourceRef(
+            kind="release",
+            label="Codex repository releases",
+            url="https://github.com/openai/codex/releases",
+        ),
+    ),
+    "openai-codex-cli-bin": (
+        SourceRef(
+            kind="release",
+            label="Codex repository releases",
+            url="https://github.com/openai/codex/releases",
+        ),
+        SourceRef(
+            kind="package-metadata",
+            label="Codex CLI binary package metadata",
+            url="https://pypi.org/project/openai-codex-cli-bin/",
+        ),
+    ),
+    "google-antigravity": (
+        SourceRef(
+            kind="changelog",
+            label="Google Antigravity changelog",
+            url="https://antigravity.google/changelog",
+        ),
+        SourceRef(
+            kind="repository",
+            label="Antigravity SDK repository",
+            url="https://github.com/google-antigravity/antigravity-sdk-python",
+        ),
+        SourceRef(
+            kind="package-metadata",
+            label="Antigravity package metadata",
+            url="https://pypi.org/project/google-antigravity/",
+        ),
+    ),
+}
+
+
+def collect_release_notes(
+    packages: Sequence[Mapping[str, object]],
+    update_versions: Mapping[str, str],
+    *,
+    fetcher: ReleaseNoteFetcher | None = None,
+) -> tuple[ReleaseNoteEvidence, ...]:
+    """Collect primary-source release-note evidence for update candidates."""
+
+    fetcher = fetcher or fetch_url_text
+    evidence: list[ReleaseNoteEvidence] = []
+    for package in packages:
+        name = str(package.get("name") or "")
+        if not name:
+            continue
+        from_version = _string_or_none(package.get("locked_version")) or _string_or_none(
+            package.get("installed_version")
+        )
+        to_version = update_versions.get(name)
+        if not to_version:
+            evidence.append(
+                ReleaseNoteEvidence(
+                    package=name,
+                    from_version=from_version,
+                    to_version=None,
+                    status="not-needed",
+                    sources=RELEASE_NOTE_SOURCES.get(name, ()),
+                    unavailable_reason="no resolver-selected update",
+                )
+            )
+            continue
+
+        sources = RELEASE_NOTE_SOURCES.get(name, ())
+        summaries: list[str] = []
+        checked_urls: list[str] = []
+        source_results: list[SourceRef] = []
+        failures: list[str] = []
+        for source in sources:
+            if not source.url:
+                source_results.append(source)
+                continue
+            checked_urls.append(source.url)
+            try:
+                text = fetcher(source.url)
+            except Exception as exc:
+                failures.append(f"{source.label}: {exc}")
+                source_results.append(
+                    SourceRef(
+                        kind=source.kind,
+                        label=source.label,
+                        url=source.url,
+                        version=to_version,
+                        available=False,
+                        note=str(exc),
+                    )
+                )
+                continue
+            source_results.append(
+                SourceRef(
+                    kind=source.kind,
+                    label=source.label,
+                    url=source.url,
+                    version=to_version,
+                    available=True,
+                )
+            )
+            summaries.extend(
+                _summaries_for_interval(
+                    text,
+                    from_version=from_version,
+                    to_version=to_version,
+                )
+            )
+
+        if summaries:
+            status = "found"
+            unavailable_reason = ""
+        elif checked_urls and len(failures) < len(checked_urls):
+            status = "found" if name == "google-antigravity" else "no-matching-version"
+            summaries.append(
+                "Official sources were fetched, but no package-version-specific "
+                f"entry for {to_version} was found."
+            )
+            unavailable_reason = "sources fetched but no matching version text was found"
+        elif checked_urls:
+            status = "unavailable"
+            unavailable_reason = "; ".join(failures)
+        else:
+            status = "unavailable"
+            unavailable_reason = "no release-note source configured"
+
+        evidence.append(
+            ReleaseNoteEvidence(
+                package=name,
+                from_version=from_version,
+                to_version=to_version,
+                status=status,
+                sources=tuple(source_results or sources),
+                summaries=tuple(_dedupe(summaries)[:8]),
+                checked_urls=tuple(checked_urls),
+                unavailable_reason=unavailable_reason,
+            )
+        )
+    return tuple(evidence)
+
+
+def fetch_url_text(url: str) -> str:
+    """Fetch a release-note source as text."""
+
+    request = urllib.request.Request(url, headers={"User-Agent": "agent-runtime-kit-sdk-evolution"})
+    with urllib.request.urlopen(request, timeout=20) as response:
+        raw = response.read()
+    if raw.startswith(b"\x1f\x8b"):
+        raw = gzip.decompress(raw)
+    return raw.decode("utf-8", errors="replace")
+
+
+def _summaries_for_interval(
+    text: str,
+    *,
+    from_version: str | None,
+    to_version: str,
+) -> list[str]:
+    lines = [line.strip() for line in text.splitlines()]
+    version_patterns = [to_version]
+    if from_version:
+        version_patterns.append(from_version)
+    matches: list[str] = []
+    for index, line in enumerate(lines):
+        if not line:
+            continue
+        if any(pattern and pattern in line for pattern in version_patterns):
+            matches.append(_clean_summary(line))
+            for nearby in lines[index + 1 : index + 4]:
+                cleaned = _clean_summary(nearby)
+                if cleaned:
+                    matches.append(cleaned)
+    if not matches and to_version:
+        compact = re.sub(r"\s+", " ", text)
+        version_index = compact.find(to_version)
+        if version_index >= 0:
+            start = max(0, version_index - 160)
+            end = min(len(compact), version_index + 320)
+            matches.append(_clean_summary(compact[start:end]))
+    return [match for match in matches if match]
+
+
+def _clean_summary(value: str, *, limit: int = 280) -> str:
+    cleaned = re.sub(r"<[^>]+>", "", value)
+    cleaned = re.sub(r"\s+", " ", cleaned).strip(" -*#\t")
+    if len(cleaned) <= limit:
+        return cleaned
+    return cleaned[: limit - 14].rstrip() + " [truncated]"
+
+
+def _dedupe(values: Sequence[str]) -> list[str]:
+    seen: set[str] = set()
+    result: list[str] = []
+    for value in values:
+        if value in seen:
+            continue
+        seen.add(value)
+        result.append(value)
+    return result
+
+
+def _string_or_none(value: object) -> str | None:
+    if value is None:
+        return None
+    text = str(value)
+    return text or None
diff --git a/examples/sdk_evolution_agent/report.py b/examples/sdk_evolution_agent/report.py
index 90386e3..ec26afa 100644
--- a/examples/sdk_evolution_agent/report.py
+++ b/examples/sdk_evolution_agent/report.py
@@ -26,6 +26,9 @@ def write_run_report(
     evidence: dict[str, Any],
     snapshots: list[dict[str, Any]],
     api_diffs: list[dict[str, Any]],
+    release_notes: list[dict[str, Any]],
+    behavior: dict[str, Any],
+    current_state: dict[str, Any],
     direction: dict[str, Any],
     architecture: dict[str, Any],
     implementation: dict[str, Any],
@@ -37,11 +40,15 @@ def write_run_report(
     context.report_root.mkdir(parents=True, exist_ok=True)
     write_json(context.report_root / "config.json", config)
     write_json(context.report_root / "evidence.json", evidence)
+    write_json(context.report_root / "release_notes.json", release_notes)
     write_json(context.report_root / "api_diffs.json", api_diffs)
+    write_json(context.report_root / "behavior_probes.json", behavior.get("results", []))
+    write_json(context.report_root / "behavior_diffs.json", behavior.get("diffs", []))
     write_json(context.report_root / "direction_analysis.json", direction)
     write_json(context.report_root / "architecture_decision.json", architecture)
     write_json(context.report_root / "implementation_summary.json", implementation)
     write_json(context.report_root / "review.json", review)
+    write_json(context.report_root / "current_state.json", current_state)
     snapshots_dir = context.report_root / "api_snapshots"
     snapshots_dir.mkdir(exist_ok=True)
     for index, snapshot in enumerate(snapshots, start=1):
@@ -55,6 +62,9 @@ def write_run_report(
             config=config,
             evidence=evidence,
             api_diffs=api_diffs,
+            release_notes=release_notes,
+            behavior=behavior,
+            current_state=current_state,
             direction=direction,
             architecture=architecture,
             implementation=implementation,
@@ -70,6 +80,9 @@ def render_markdown_report(
     config: dict[str, Any],
     evidence: dict[str, Any],
     api_diffs: list[dict[str, Any]],
+    release_notes: list[dict[str, Any]],
+    behavior: dict[str, Any],
+    current_state: dict[str, Any],
     direction: dict[str, Any],
     architecture: dict[str, Any],
     implementation: dict[str, Any],
@@ -92,6 +105,19 @@ def render_markdown_report(
         )
     manual = architecture.get("manual_design_required")
     recursive = architecture.get("recursive_self_adaptation_impact")
+    release_lines = [
+        "- {package}: {status} ({from_version} -> {to_version})".format(
+            package=item.get("package"),
+            status=item.get("status"),
+            from_version=item.get("from_version"),
+            to_version=item.get("to_version"),
+        )
+        for item in release_notes
+        if isinstance(item, dict) and item.get("to_version")
+    ]
+    behavior_summary = behavior.get("summary") if isinstance(behavior, dict) else {}
+    behavior_diffs = behavior.get("diffs", []) if isinstance(behavior, dict) else []
+    promotion = current_state.get("promotion", {}) if isinstance(current_state, dict) else {}
     return "\n".join(
         [
             "# SDK Evolution Agent Report",
@@ -110,6 +136,17 @@ def render_markdown_report(
             "",
             f"- Diff count: `{len(api_diffs)}`",
             "",
+            "## Release Notes",
+            "",
+            *(release_lines or ["- No SDK update release-note evidence required."]),
+            "",
+            "## Behavior Probes",
+            "",
+            f"- Status: `{behavior_summary.get('status')}`",
+            f"- Changed contracts: `{behavior_summary.get('changed_count')}`",
+            f"- Breaking contracts: `{behavior_summary.get('breaking_count')}`",
+            f"- Diff count: `{len(behavior_diffs)}`",
+            "",
             "## Direction Of Travel",
             "",
             "```json",
@@ -132,6 +169,11 @@ def render_markdown_report(
             json.dumps(implementation, indent=2, sort_keys=True, default=str),
             "```",
             "",
+            "## Current State Baseline",
+            "",
+            f"- Promotion status: `{promotion.get('status')}`",
+            f"- Promoted: `{promotion.get('promoted')}`",
+            "",
             "## Reviewer Output",
             "",
             "```json",
diff --git a/examples/sdk_evolution_agent/schemas.py b/examples/sdk_evolution_agent/schemas.py
index 80668d2..1c73b83 100644
--- a/examples/sdk_evolution_agent/schemas.py
+++ b/examples/sdk_evolution_agent/schemas.py
@@ -98,7 +98,7 @@ class SchemaValidationError(ValueError):
     "required": ["status", "reasons", "required_changes"],
     "additionalProperties": False,
     "properties": {
-        "status": {"type": "string"},
+        "status": {"type": "string", "enum": ["pass", "reject"]},
         "reasons": {"type": "array", "items": {"type": "string"}},
         "required_changes": {"type": "array", "items": {"type": "string"}},
     },
diff --git a/examples/sdk_evolution_agent/stages.py b/examples/sdk_evolution_agent/stages.py
index 404058e..e272faa 100644
--- a/examples/sdk_evolution_agent/stages.py
+++ b/examples/sdk_evolution_agent/stages.py
@@ -3,6 +3,7 @@
 from __future__ import annotations
 
 import json
+import re
 from collections.abc import Mapping, Sequence
 from pathlib import Path
 from typing import Any
@@ -31,7 +32,6 @@
 from examples.sdk_evolution_agent.schemas import (
     ARCHITECTURE_DECISION_SCHEMA,
     DIRECTION_ANALYSIS_SCHEMA,
-    IMPLEMENTATION_SUMMARY_SCHEMA,
     REVIEWER_OUTPUT_SCHEMA,
     JsonSchema,
     SchemaValidationError,
@@ -44,6 +44,8 @@ class StageExecutionError(RuntimeError):
 
 
 SDK_EVOLUTION_CODEX_HOME = Path("~/.codex_agent_runtime_sdk").expanduser()
+SDK_EVOLUTION_CODEX_MODEL = "gpt-5.5"
+SDK_EVOLUTION_CODEX_REASONING_EFFORT = "xhigh"
 
 
 class FixtureEvolutionRuntime:
@@ -105,6 +107,7 @@ def _codex_evolution_runtime(**kwargs: Any) -> CodexAgentRuntime:
     SDK_EVOLUTION_CODEX_HOME.chmod(0o700)
     env = dict(kwargs.pop("env", {}) or {})
     env.setdefault("CODEX_HOME", str(SDK_EVOLUTION_CODEX_HOME))
+    kwargs.setdefault("default_model", SDK_EVOLUTION_CODEX_MODEL)
     return CodexAgentRuntime(env=env, **kwargs)
 
 
@@ -137,12 +140,12 @@ async def run_stage(
     permissions = _stage_permissions(runtime, write_enabled=write_enabled)
     task = AgentTask(
         goal=json.dumps(payload, sort_keys=True, default=str),
-        system=_stage_system_prompt(stage),
+        system=_stage_system_prompt(stage, schema),
         working_directory=context.workspace,
         permissions=permissions,
         event_sink=context.event_sink,
         output_schema=schema,
-        metadata={"stage": stage, "run_id": context.run_id},
+        metadata=_stage_metadata(runtime, stage=stage, context=context),
     )
     try:
         result = await runtime.run(task)
@@ -165,11 +168,18 @@ async def run_analysis_pipeline(
     *,
     evidence: Mapping[str, Any],
     api_diffs: Sequence[Mapping[str, Any]],
+    release_notes: Sequence[Mapping[str, Any]],
+    behavior: Mapping[str, Any],
     context: RunContext,
 ) -> tuple[dict[str, Any], dict[str, Any], dict[str, Any]]:
     """Run direction, architecture, and reviewer stages."""
 
-    stage_payload = {"evidence": evidence, "api_diffs": list(api_diffs)}
+    stage_payload = {
+        "evidence": evidence,
+        "api_diffs": list(api_diffs),
+        "release_notes": list(release_notes),
+        "behavior": behavior,
+    }
     direction = await run_stage(
         runtime,
         stage="direction-analysis",
@@ -177,24 +187,34 @@ async def run_analysis_pipeline(
         schema=DIRECTION_ANALYSIS_SCHEMA,
         context=context,
     )
+    direction = _compact_stage_output(direction)
     architecture = await run_stage(
         runtime,
         stage="architecture-decision",
         payload={
             "evidence": evidence,
             "api_diffs": list(api_diffs),
+            "release_notes": list(release_notes),
+            "behavior": behavior,
             "direction_analysis": direction,
         },
         schema=ARCHITECTURE_DECISION_SCHEMA,
         context=context,
     )
     architecture = with_recursive_impact(architecture, api_diffs)
+    architecture = with_candidate_api_diff_guard(architecture, evidence, api_diffs)
+    architecture = with_release_note_guard(architecture, release_notes)
+    architecture = with_behavior_probe_guard(architecture, behavior)
+    architecture = with_manual_design_gate(architecture)
+    architecture = _compact_stage_output(architecture)
     review = await run_stage(
         runtime,
         stage="review",
         payload={
             "evidence": evidence,
             "api_diffs": list(api_diffs),
+            "release_notes": list(release_notes),
+            "behavior": behavior,
             "direction_analysis": direction,
             "architecture_decision": architecture,
         },
@@ -223,23 +243,20 @@ async def maybe_run_implementation(
     if not gate.allowed:
         return {
             "applied": False,
+            "allowed": False,
             "changes": [],
             "verification_results": [],
             "blocked_reason": gate.reason,
         }
-    return await run_stage(
-        runtime,
-        stage="implementation",
-        payload={
-            "evidence": evidence,
-            "direction_analysis": direction,
-            "architecture_decision": architecture,
-            "review": review,
-        },
-        schema=IMPLEMENTATION_SUMMARY_SCHEMA,
-        context=context,
-        write_enabled=True,
-    )
+    del runtime, evidence, direction, review
+    return {
+        "applied": False,
+        "allowed": True,
+        "changes": [],
+        "verification_results": [],
+        "blocked_reason": "",
+        "planned_changes": list(architecture.get("self_adaptation_plan") or []),
+    }
 
 
 def evaluate_implementation_gate(
@@ -258,13 +275,18 @@ def evaluate_implementation_gate(
         "self_adaptation_plan"
     ):
         return GateResult(False, "recursive self-adaptation requires a migration plan")
-    if str(review.get("status", "")).lower() != "pass":
+    if not _review_passed(review):
         return GateResult(False, "reviewer did not pass the proposal")
     if not architecture.get("safe_to_implement"):
         return GateResult(False, "architecture decision is not safe to implement")
     return GateResult(True, "implementation enabled and gates passed")
 
 
+def _review_passed(review: Mapping[str, Any]) -> bool:
+    status = str(review.get("status", "")).strip().lower()
+    return status in {"pass", "passed", "approve", "approved", "accepted"}
+
+
 def detects_recursive_impact(api_diffs: Sequence[Mapping[str, Any] | ApiDiff]) -> bool:
     """Detect whether API diffs touch the agent's own runtime contract."""
 
@@ -313,14 +335,197 @@ def with_recursive_impact(
     return result
 
 
-def _stage_system_prompt(stage: str) -> str:
-    return (
+def with_candidate_api_diff_guard(
+    architecture: Mapping[str, Any],
+    evidence: Mapping[str, Any],
+    api_diffs: Sequence[Mapping[str, Any] | ApiDiff],
+) -> dict[str, Any]:
+    """Block SDK update implementation when candidate API evidence is missing."""
+
+    update_packages = _refresh_update_packages(evidence)
+    if not update_packages:
+        return dict(architecture)
+    diff_packages = {
+        diff.package if isinstance(diff, ApiDiff) else str(diff.get("package") or "")
+        for diff in api_diffs
+    }
+    missing = tuple(sorted(package for package in update_packages if package not in diff_packages))
+    if not missing:
+        return dict(architecture)
+
+    result = dict(architecture)
+    result["safe_to_implement"] = False
+    result["manual_design_required"] = True
+    findings = list(result.get("findings") or [])
+    findings.append(
+        {
+            "classification": "manual-design-required",
+            "summary": (
+                "SDK update candidates require candidate-version API snapshot diffs "
+                "before implementation can be considered safe."
+            ),
+            "evidence": [f"missing api_diffs for {package}" for package in missing],
+        }
+    )
+    result["findings"] = findings
+    uncertainty = list(result.get("uncertainty") or [])
+    uncertainty.append(
+        "Candidate API diffs were not available for update candidate(s): "
+        + ", ".join(missing)
+    )
+    result["uncertainty"] = uncertainty
+    plan = list(result.get("self_adaptation_plan") or [])
+    plan.append(
+        "Rerun with candidate API inspection and review the generated api_diffs before "
+        "changing adapters or dependency locks."
+    )
+    result["self_adaptation_plan"] = plan
+    return result
+
+
+def with_release_note_guard(
+    architecture: Mapping[str, Any],
+    release_notes: Sequence[Mapping[str, Any]],
+) -> dict[str, Any]:
+    """Block implementation when release-note collection itself failed."""
+
+    failed = [
+        str(item.get("package"))
+        for item in release_notes
+        if item.get("to_version") and item.get("status") == "unavailable"
+    ]
+    if not failed:
+        return dict(architecture)
+    result = dict(architecture)
+    result["safe_to_implement"] = False
+    result["manual_design_required"] = True
+    findings = list(result.get("findings") or [])
+    findings.append(
+        {
+            "classification": "manual-design-required",
+            "summary": "Release-note evidence could not be collected for update candidates.",
+            "evidence": [f"release notes unavailable for {package}" for package in failed],
+        }
+    )
+    result["findings"] = findings
+    uncertainty = list(result.get("uncertainty") or [])
+    uncertainty.append("Missing release-note evidence for: " + ", ".join(sorted(failed)))
+    result["uncertainty"] = uncertainty
+    return result
+
+
+def with_behavior_probe_guard(
+    architecture: Mapping[str, Any],
+    behavior: Mapping[str, Any],
+) -> dict[str, Any]:
+    """Block implementation when candidate behavior probes fail."""
+
+    diffs = behavior.get("diffs")
+    if not isinstance(diffs, list):
+        return dict(architecture)
+    breaking = [
+        diff
+        for diff in diffs
+        if isinstance(diff, Mapping) and str(diff.get("severity")) == "breaking"
+    ]
+    if not breaking:
+        return dict(architecture)
+    result = dict(architecture)
+    result["safe_to_implement"] = False
+    result["manual_design_required"] = True
+    findings = list(result.get("findings") or [])
+    findings.append(
+        {
+            "classification": "manual-design-required",
+            "summary": "Candidate SDK behavior probes detected breaking adapter-contract drift.",
+            "evidence": [
+                f"{diff.get('package')}:{diff.get('probe')} {diff.get('summary')}"
+                for diff in breaking
+            ],
+        }
+    )
+    result["findings"] = findings
+    uncertainty = list(result.get("uncertainty") or [])
+    uncertainty.append("Breaking behavior probes require manual adapter design review.")
+    result["uncertainty"] = uncertainty
+    return result
+
+
+def with_manual_design_gate(architecture: Mapping[str, Any]) -> dict[str, Any]:
+    """Make manual design decisions block implementation unambiguously."""
+
+    result = dict(architecture)
+    if result.get("manual_design_required"):
+        result["safe_to_implement"] = False
+    return result
+
+
+def _refresh_update_packages(evidence: Mapping[str, Any]) -> tuple[str, ...]:
+    preview = evidence.get("refresh_preview")
+    if not isinstance(preview, Mapping):
+        return ()
+    text = f"{preview.get('stdout') or ''}\n{preview.get('stderr') or ''}"
+    return tuple(
+        sorted(set(re.findall(r"Update\s+([A-Za-z0-9_.-]+)\s+v\S+\s+->\s+v\S+", text)))
+    )
+
+
+def _compact_stage_output(value: Mapping[str, Any]) -> dict[str, Any]:
+    return {key: _compact_stage_value(item) for key, item in value.items()}
+
+
+def _compact_stage_value(value: Any, *, string_limit: int = 800, list_limit: int = 8) -> Any:
+    if isinstance(value, str):
+        if len(value) <= string_limit:
+            return value
+        return value[: string_limit - 16].rstrip() + " [truncated]"
+    if isinstance(value, list):
+        return [
+            _compact_stage_value(item, string_limit=string_limit, list_limit=list_limit)
+            for item in value[:list_limit]
+        ]
+    if isinstance(value, dict):
+        return {
+            key: _compact_stage_value(item, string_limit=string_limit, list_limit=list_limit)
+            for key, item in value.items()
+        }
+    return value
+
+
+def _stage_system_prompt(stage: str, schema: JsonSchema) -> str:
+    prompt = (
         "You are running inside the local SDK evolution agent. "
         "Use only the provided evidence. Preserve vendor-specific behavior, "
         "state uncertainty explicitly, and never claim implementation occurred "
         "unless it is reflected in the provided artifacts. "
-        f"Current stage: {stage}."
+        "Return only one JSON object that validates against the provided schema. "
+        "Do not include Markdown, code fences, file links, or prose outside JSON. "
+        "Do not call shell, command, file, or workspace tools; the deterministic "
+        "evidence bundle already contains the inspected data. "
+        "Keep each array to at most five high-signal items and each string concise. "
+        f"Current stage: {stage}. "
+        f"Output schema: {json.dumps(schema, sort_keys=True)}"
     )
+    if stage in {"architecture-decision", "review"}:
+        prompt += (
+            " Deterministic gate policy: candidate API diffs prove API shape drift, "
+            "while behavior_diffs prove whether the adapter contract still holds. "
+            "For adapter-contract probes, severity none means the required adapter "
+            "contract is compatible even when probe details or public API snapshots "
+            "show optional field churn. "
+            "Do not mark manual_design_required, unsafe, or review rejection solely "
+            "because public top-level symbols were added or removed when behavior "
+            "probes pass before and after and there is no adapter-source evidence "
+            "that the removed symbols are used. Breaking behavior_diffs, missing "
+            "candidate API diffs, unavailable required release-note evidence, "
+            "reviewer-identified unsupported vendor behavior, or recursive "
+            "runtime-contract impact remain hard blockers. Release-note status found "
+            "is collected evidence, not unavailable evidence, even when the summary "
+            "states that no package-version-specific entry was found."
+        )
+    if stage == "review":
+        prompt += " The review status must be exactly pass or reject."
+    return prompt
 
 
 def _stage_permissions(runtime: AgentRuntime, *, write_enabled: bool) -> PermissionProfile:
@@ -339,6 +544,19 @@ def _stage_permissions(runtime: AgentRuntime, *, write_enabled: bool) -> Permiss
     )
 
 
+def _stage_metadata(
+    runtime: AgentRuntime,
+    *,
+    stage: str,
+    context: RunContext,
+) -> dict[str, Any]:
+    metadata: dict[str, Any] = {"stage": stage, "run_id": context.run_id}
+    if runtime.kind is AgentRuntimeKind.CODEX_AGENT_SDK:
+        metadata["model"] = SDK_EVOLUTION_CODEX_MODEL
+        metadata["reasoning_effort"] = SDK_EVOLUTION_CODEX_REASONING_EFFORT
+    return metadata
+
+
 def _fixture_payload(stage: str, task: AgentTask) -> dict[str, Any]:
     try:
         source = json.loads(task.goal)
diff --git a/tests/test_sdk_evolution_agent.py b/tests/test_sdk_evolution_agent.py
index 6409346..9bfaaa6 100644
--- a/tests/test_sdk_evolution_agent.py
+++ b/tests/test_sdk_evolution_agent.py
@@ -18,15 +18,22 @@
     RuntimeAvailability,
 )
 from agent_runtime_kit.adapters import CodexAgentRuntime
-from examples.sdk_evolution_agent.cli import RunOptions, parse_args, run_agent
+from examples.sdk_evolution_agent.behavior import (
+    collect_behavior_evidence,
+    diff_behavior_results,
+)
+from examples.sdk_evolution_agent.cli import RunOptions, _collect_snapshots, parse_args, run_agent
 from examples.sdk_evolution_agent.collectors import (
     build_refresh_preview_command,
     collect_evidence,
     cutoff_free_env,
+    run_lock_update,
     run_refresh_preview,
 )
-from examples.sdk_evolution_agent.models import CommandResult, RunContext
+from examples.sdk_evolution_agent.current_state import build_current_state
+from examples.sdk_evolution_agent.models import ApiSnapshot, CommandResult, RunContext
 from examples.sdk_evolution_agent.pr import build_draft_pr_body
+from examples.sdk_evolution_agent.release_notes import collect_release_notes
 from examples.sdk_evolution_agent.schemas import (
     DIRECTION_ANALYSIS_SCHEMA,
     SchemaValidationError,
@@ -35,13 +42,19 @@
 from examples.sdk_evolution_agent.snapshots import diff_snapshots, snapshot_current_api
 from examples.sdk_evolution_agent.stages import (
     SDK_EVOLUTION_CODEX_HOME,
+    SDK_EVOLUTION_CODEX_MODEL,
+    SDK_EVOLUTION_CODEX_REASONING_EFFORT,
     FixtureEvolutionRuntime,
     StageExecutionError,
     build_registry,
     detects_recursive_impact,
     evaluate_implementation_gate,
     run_stage,
+    with_behavior_probe_guard,
+    with_candidate_api_diff_guard,
+    with_manual_design_gate,
     with_recursive_impact,
+    with_release_note_guard,
 )
 
 
@@ -97,6 +110,42 @@ def runner(
     assert result.removed_env == ("UV_EXCLUDE_NEWER",)
 
 
+def test_lock_update_uses_targeted_packages_and_clean_env(
+    tmp_path: Path,
+    monkeypatch: pytest.MonkeyPatch,
+) -> None:
+    seen: dict[str, Any] = {}
+    monkeypatch.setenv("UV_EXCLUDE_NEWER", "2026-01-01")
+
+    def runner(
+        command: tuple[str, ...],
+        *,
+        cwd: Path | None = None,
+        env: dict[str, str],
+    ) -> CommandResult:
+        seen["command"] = command
+        seen["cwd"] = cwd
+        seen["env"] = env
+        return CommandResult(command=command, returncode=0, stdout="ok")
+
+    result = run_lock_update(
+        tmp_path,
+        ("claude-agent-sdk", "google-antigravity"),
+        command_runner=runner,
+    )
+
+    assert seen["command"] == (
+        "uv",
+        "lock",
+        "-P",
+        "claude-agent-sdk",
+        "-P",
+        "google-antigravity",
+    )
+    assert "UV_EXCLUDE_NEWER" not in seen["env"]
+    assert result.removed_env == ("UV_EXCLUDE_NEWER",)
+
+
 def test_collect_evidence_records_versions_and_sources(tmp_path: Path) -> None:
     (tmp_path / "pyproject.toml").write_text(
         """
@@ -130,6 +179,213 @@ def test_collect_evidence_records_versions_and_sources(tmp_path: Path) -> None:
     assert evidence["adapter_sources"]
 
 
+def test_release_notes_collects_matching_update_source() -> None:
+    notes = collect_release_notes(
+        [
+            {
+                "name": "claude-agent-sdk",
+                "locked_version": "0.2.96",
+                "installed_version": "0.2.96",
+            }
+        ],
+        {"claude-agent-sdk": "0.2.106"},
+        fetcher=lambda url: "## 0.2.106\n- Added TaskUpdatedMessage\n",
+    )
+
+    assert notes[0].status == "found"
+    assert notes[0].to_version == "0.2.106"
+    assert any("TaskUpdatedMessage" in summary for summary in notes[0].summaries)
+
+
+def test_antigravity_release_notes_record_source_coverage_without_version_match() -> None:
+    notes = collect_release_notes(
+        [
+            {
+                "name": "google-antigravity",
+                "locked_version": "0.1.2",
+                "installed_version": "0.1.2",
+            }
+        ],
+        {"google-antigravity": "0.1.4"},
+        fetcher=lambda url: "Google Antigravity product changelog",
+    )
+
+    assert notes[0].status == "found"
+    assert "no package-version-specific" in notes[0].summaries[0]
+
+
+def test_release_note_guard_blocks_unavailable_update_source() -> None:
+    guarded = with_release_note_guard(
+        {
+            "findings": [],
+            "safe_to_implement": True,
+            "manual_design_required": False,
+            "uncertainty": [],
+        },
+        [
+            {
+                "package": "claude-agent-sdk",
+                "to_version": "0.2.106",
+                "status": "unavailable",
+            }
+        ],
+    )
+
+    assert guarded["safe_to_implement"] is False
+    assert guarded["manual_design_required"] is True
+
+
+def test_behavior_diffs_track_candidate_contract_changes() -> None:
+    behavior = collect_behavior_evidence(
+        [
+            {
+                "name": "fake-sdk",
+                "locked_version": "1.0.0",
+                "installed_version": "1.0.0",
+            }
+        ],
+        {},
+    )
+    assert behavior["summary"]["status"] == "pass"
+
+    diffs = diff_behavior_results(
+        [
+            _probe("claude-agent-sdk", "0.2.96", "current-environment", "pass", {"fields": ["a"]}),
+            _probe("claude-agent-sdk", "0.2.106", "isolated-venv", "fail", {"fields": []}),
+        ]
+    )
+
+    assert diffs[0].severity == "breaking"
+
+
+def test_behavior_diffs_ignore_optional_field_churn_when_contract_holds() -> None:
+    required = ["api_key", "mcp_servers", "model"]
+    diffs = diff_behavior_results(
+        [
+            _probe(
+                "google-antigravity",
+                "0.1.2",
+                "current-baseline",
+                "pass",
+                {
+                    "fields": ["api_key", "gemini_config", "mcp_servers", "model"],
+                    "required_fields": required,
+                    "missing": [],
+                },
+            ),
+            _probe(
+                "google-antigravity",
+                "0.1.4",
+                "candidate",
+                "pass",
+                {
+                    "fields": ["api_key", "mcp_servers", "model", "models"],
+                    "required_fields": required,
+                    "missing": [],
+                },
+            ),
+        ]
+    )
+
+    assert diffs[0].severity == "none"
+    assert diffs[0].summary == "No behavior contract difference detected."
+
+
+def test_behavior_evidence_uses_locked_baseline_when_environment_drifted(
+    monkeypatch: pytest.MonkeyPatch,
+) -> None:
+    calls: list[tuple[str, str, str]] = []
+
+    def isolated(package: str, version: str, *, scope: str = "candidate"):
+        calls.append((package, version, scope))
+        return (_probe(package, version, scope, "pass", {"scope": scope}),)
+
+    monkeypatch.setattr(
+        "examples.sdk_evolution_agent.behavior.probe_candidate_in_venv",
+        isolated,
+    )
+
+    behavior = collect_behavior_evidence(
+        [
+            {
+                "name": "claude-agent-sdk",
+                "locked_version": "0.2.96",
+                "installed_version": "0.2.106",
+            }
+        ],
+        {"claude-agent-sdk": "0.2.106"},
+    )
+
+    assert calls == [
+        ("claude-agent-sdk", "0.2.96", "current-baseline"),
+        ("claude-agent-sdk", "0.2.106", "candidate"),
+    ]
+    assert behavior["diffs"][0].severity == "changed"
+
+
+def test_behavior_probe_guard_blocks_breaking_candidate_diff() -> None:
+    guarded = with_behavior_probe_guard(
+        {
+            "findings": [],
+            "safe_to_implement": True,
+            "manual_design_required": False,
+            "uncertainty": [],
+        },
+        {
+            "diffs": [
+                {
+                    "package": "google-antigravity",
+                    "probe": "adapter-contract",
+                    "severity": "breaking",
+                    "summary": "Candidate probe changed from pass to fail.",
+                }
+            ]
+        },
+    )
+
+    assert guarded["safe_to_implement"] is False
+    assert guarded["manual_design_required"] is True
+
+
+def test_current_state_artifact_paths_are_repo_relative(tmp_path: Path) -> None:
+    (tmp_path / "uv.lock").write_text(
+        """
+[[package]]
+name = "claude-agent-sdk"
+version = "0.2.106"
+""",
+        encoding="utf-8",
+    )
+    report_root = tmp_path / "reports" / "sdk-evolution" / "run-1"
+    report_root.mkdir(parents=True)
+    (report_root / "evidence.json").write_text("{}", encoding="utf-8")
+    snapshots = report_root / "api_snapshots"
+    snapshots.mkdir()
+    (snapshots / "01-claude-agent-sdk.json").write_text("{}", encoding="utf-8")
+    context = RunContext(
+        run_id="run-1",
+        workspace=tmp_path,
+        report_root=report_root,
+        runtime="fake",
+        event_log_path=report_root / "events.jsonl",
+        implementation_enabled=True,
+        draft_pr=False,
+    )
+
+    state = build_current_state(
+        context,
+        promoted=True,
+        status="promoted",
+        implementation={"applied": True},
+    )
+
+    paths = [artifact["path"] for artifact in state["artifacts"].values()]
+    assert "reports/sdk-evolution/run-1/evidence.json" in paths
+    assert "reports/sdk-evolution/run-1/api_snapshots/01-claude-agent-sdk.json" in paths
+    assert all(not path.startswith("/") for path in paths)
+    assert all("/private/tmp" not in path and "/tmp/" not in path for path in paths)
+
+
 def test_snapshot_and_diff_public_api(monkeypatch: pytest.MonkeyPatch) -> None:
     module = types.ModuleType("fake_sdk")
 
@@ -154,6 +410,224 @@ def run_new(value: str, *, verbose: bool = False) -> str:
     assert diff.changed == ("run",)
 
 
+def test_parse_args_inspects_candidates_by_default() -> None:
+    options = parse_args(["--runtime", "fake"])
+
+    assert options.inspect_candidates is True
+
+
+def test_collect_snapshots_uses_lockfile_baseline_for_candidates(
+    monkeypatch: pytest.MonkeyPatch,
+) -> None:
+    calls: list[tuple[str, str | None]] = []
+
+    def current_snapshot(package: str, *, version: str | None = None) -> ApiSnapshot:
+        calls.append(("current", version))
+        return ApiSnapshot(package=package, version=version, module="google.antigravity")
+
+    def candidate_snapshot(package: str, version: str) -> ApiSnapshot:
+        calls.append(("candidate", version))
+        return ApiSnapshot(
+            package=package,
+            version=version,
+            module="google.antigravity",
+            source="isolated-venv",
+        )
+
+    monkeypatch.setattr(
+        "examples.sdk_evolution_agent.cli.snapshot_current_api",
+        current_snapshot,
+    )
+    monkeypatch.setattr(
+        "examples.sdk_evolution_agent.cli.snapshot_candidate_in_venv",
+        candidate_snapshot,
+    )
+
+    snapshots = _collect_snapshots(
+        {
+            "packages": [
+                {
+                    "name": "google-antigravity",
+                    "locked_version": "0.1.2",
+                    "installed_version": "0.1.4",
+                    "latest_version": "0.1.4",
+                }
+            ]
+        },
+        inspect_candidates=False,
+    )
+
+    assert len(snapshots) == 2
+    assert calls == [("candidate", "0.1.2"), ("candidate", "0.1.4")]
+
+
+def test_collect_snapshots_uses_refresh_preview_update_targets(
+    monkeypatch: pytest.MonkeyPatch,
+) -> None:
+    calls: list[tuple[str, str, str | None]] = []
+
+    def current_snapshot(package: str, *, version: str | None = None) -> ApiSnapshot:
+        calls.append(("current", package, version))
+        return ApiSnapshot(package=package, version=version, module=package.replace("-", "_"))
+
+    def candidate_snapshot(package: str, version: str) -> ApiSnapshot:
+        calls.append(("candidate", package, version))
+        return ApiSnapshot(
+            package=package,
+            version=version,
+            module=package.replace("-", "_"),
+            source="isolated-venv",
+        )
+
+    monkeypatch.setattr(
+        "examples.sdk_evolution_agent.cli.snapshot_current_api",
+        current_snapshot,
+    )
+    monkeypatch.setattr(
+        "examples.sdk_evolution_agent.cli.snapshot_candidate_in_venv",
+        candidate_snapshot,
+    )
+
+    snapshots = _collect_snapshots(
+        {
+            "packages": [
+                {
+                    "name": "claude-agent-sdk",
+                    "locked_version": "0.2.96",
+                    "installed_version": "0.2.96",
+                    "latest_version": "0.2.106",
+                },
+                {
+                    "name": "openai-codex-cli-bin",
+                    "locked_version": "0.137.0a4",
+                    "installed_version": "0.137.0a4",
+                    "latest_version": "0.136.0",
+                },
+            ],
+            "refresh_preview": {
+                "stdout": "",
+                "stderr": "Update claude-agent-sdk v0.2.96 -> v0.2.106\n",
+            },
+        },
+    )
+
+    assert len(snapshots) == 3
+    assert calls == [
+        ("current", "claude-agent-sdk", "0.2.96"),
+        ("candidate", "claude-agent-sdk", "0.2.106"),
+        ("current", "openai-codex-cli-bin", "0.137.0a4"),
+    ]
+
+
+def test_collect_snapshots_uses_locked_baseline_when_environment_drifted(
+    monkeypatch: pytest.MonkeyPatch,
+) -> None:
+    calls: list[tuple[str, str, str | None]] = []
+
+    def current_snapshot(package: str, *, version: str | None = None) -> ApiSnapshot:
+        calls.append(("current", package, version))
+        return ApiSnapshot(package=package, version=version, module=package.replace("-", "_"))
+
+    def isolated_snapshot(package: str, version: str) -> ApiSnapshot:
+        calls.append(("isolated", package, version))
+        return ApiSnapshot(package=package, version=version, module=package.replace("-", "_"))
+
+    monkeypatch.setattr(
+        "examples.sdk_evolution_agent.cli.snapshot_current_api",
+        current_snapshot,
+    )
+    monkeypatch.setattr(
+        "examples.sdk_evolution_agent.cli.snapshot_candidate_in_venv",
+        isolated_snapshot,
+    )
+
+    _collect_snapshots(
+        {
+            "packages": [
+                {
+                    "name": "claude-agent-sdk",
+                    "locked_version": "0.2.96",
+                    "installed_version": "0.2.106",
+                    "latest_version": "0.2.106",
+                },
+            ],
+            "refresh_preview": {
+                "stdout": "",
+                "stderr": "Update claude-agent-sdk v0.2.96 -> v0.2.106\n",
+            },
+        },
+    )
+
+    assert calls == [
+        ("isolated", "claude-agent-sdk", "0.2.96"),
+        ("isolated", "claude-agent-sdk", "0.2.106"),
+    ]
+
+
+def test_candidate_api_diff_guard_blocks_missing_update_diff() -> None:
+    guarded = with_candidate_api_diff_guard(
+        {
+            "findings": [],
+            "safe_to_implement": True,
+            "manual_design_required": False,
+            "uncertainty": [],
+            "self_adaptation_plan": [],
+        },
+        {
+            "refresh_preview": {
+                "stdout": "",
+                "stderr": "Update google-antigravity v0.1.2 -> v0.1.4\n",
+            }
+        },
+        [],
+    )
+
+    assert guarded["safe_to_implement"] is False
+    assert guarded["manual_design_required"] is True
+    assert "missing api_diffs for google-antigravity" in guarded["findings"][-1]["evidence"]
+
+
+def test_candidate_api_diff_guard_accepts_empty_update_diff() -> None:
+    guarded = with_candidate_api_diff_guard(
+        {
+            "findings": [],
+            "safe_to_implement": True,
+            "manual_design_required": False,
+        },
+        {
+            "refresh_preview": {
+                "stdout": "",
+                "stderr": "Update google-antigravity v0.1.2 -> v0.1.4\n",
+            }
+        },
+        [
+            {
+                "package": "google-antigravity",
+                "from_version": "0.1.2",
+                "to_version": "0.1.4",
+                "added": [],
+                "removed": [],
+                "changed": [],
+            }
+        ],
+    )
+
+    assert guarded["safe_to_implement"] is True
+    assert guarded["manual_design_required"] is False
+
+
+def test_manual_design_gate_forces_safe_to_implement_false() -> None:
+    architecture = with_manual_design_gate(
+        {
+            "findings": [],
+            "safe_to_implement": True,
+            "manual_design_required": True,
+        }
+    )
+
+    assert architecture["safe_to_implement"] is False
+
+
 def test_schema_validation_rejects_missing_required_field() -> None:
     with pytest.raises(SchemaValidationError):
         validate_mapping({"packages": [], "themes": []}, DIRECTION_ANALYSIS_SCHEMA, name="stage")
@@ -186,6 +660,34 @@ async def test_stage_execution_uses_agent_task_runtime_primitives(tmp_path: Path
     assert runtime.task.working_directory == tmp_path
     assert runtime.task.permissions.filesystem is FilesystemAccess.READ_ONLY
     assert runtime.task.metadata["stage"] == "direction-analysis"
+    assert "model" not in runtime.task.metadata
+    assert "reasoning_effort" not in runtime.task.metadata
+
+
+@pytest.mark.asyncio
+async def test_codex_stage_execution_uses_gpt_55_xhigh_thinking(tmp_path: Path) -> None:
+    runtime = RecordingRuntime(kind=AgentRuntimeKind.CODEX_AGENT_SDK)
+    context = RunContext(
+        run_id="run-1",
+        workspace=tmp_path,
+        report_root=tmp_path / "reports",
+        runtime="codex-agent-sdk",
+        event_log_path=tmp_path / "events.jsonl",
+        implementation_enabled=False,
+        draft_pr=False,
+    )
+
+    await run_stage(
+        runtime,
+        stage="direction-analysis",
+        payload={"evidence": {}, "api_diffs": []},
+        schema=DIRECTION_ANALYSIS_SCHEMA,
+        context=context,
+    )
+
+    assert runtime.task is not None
+    assert runtime.task.metadata["model"] == SDK_EVOLUTION_CODEX_MODEL
+    assert runtime.task.metadata["reasoning_effort"] == SDK_EVOLUTION_CODEX_REASONING_EFFORT
 
 
 @pytest.mark.asyncio
@@ -289,8 +791,25 @@ def test_reviewer_rejection_blocks_implementation() -> None:
     assert "reviewer" in gate.reason
 
 
+def test_reviewer_approved_status_allows_implementation() -> None:
+    gate = evaluate_implementation_gate(
+        {
+            "safe_to_implement": True,
+            "manual_design_required": False,
+            "recursive_self_adaptation_impact": False,
+        },
+        {"status": "approved"},
+        implementation_enabled=True,
+    )
+
+    assert gate.allowed is True
+
+
 @pytest.mark.asyncio
-async def test_run_agent_report_only_generates_artifacts(tmp_path: Path) -> None:
+async def test_run_agent_report_only_generates_artifacts(
+    tmp_path: Path,
+    monkeypatch: pytest.MonkeyPatch,
+) -> None:
     (tmp_path / "pyproject.toml").write_text(
         """
 [project.optional-dependencies]
@@ -299,6 +818,23 @@ async def test_run_agent_report_only_generates_artifacts(tmp_path: Path) -> None
         encoding="utf-8",
     )
     (tmp_path / "uv.lock").write_text("", encoding="utf-8")
+    monkeypatch.setattr(
+        "examples.sdk_evolution_agent.cli.snapshot_current_api",
+        lambda package, *, version=None: ApiSnapshot(
+            package=package,
+            version=version,
+            module=package.replace("-", "_"),
+        ),
+    )
+    monkeypatch.setattr(
+        "examples.sdk_evolution_agent.cli.snapshot_candidate_in_venv",
+        lambda package, version: ApiSnapshot(
+            package=package,
+            version=version,
+            module=package.replace("-", "_"),
+            source="isolated-venv",
+        ),
+    )
 
     report_path = await run_agent(
         RunOptions(
@@ -307,6 +843,7 @@ async def test_run_agent_report_only_generates_artifacts(tmp_path: Path) -> None
             packages=("claude-agent-sdk",),
             report_dir=Path("reports"),
             implementation_enabled=False,
+            inspect_candidates=False,
         ),
         pypi_client=_fake_pypi,
         runtime=FixtureEvolutionRuntime(),
@@ -314,15 +851,119 @@ async def test_run_agent_report_only_generates_artifacts(tmp_path: Path) -> None
 
     assert report_path.exists()
     assert (report_path.parent / "evidence.json").exists()
+    assert (report_path.parent / "release_notes.json").exists()
     assert (report_path.parent / "api_diffs.json").exists()
+    assert (report_path.parent / "behavior_probes.json").exists()
+    assert (report_path.parent / "behavior_diffs.json").exists()
+    assert (report_path.parent / "current_state.json").exists()
     assert (report_path.parent / "direction_analysis.json").exists()
     assert (report_path.parent / "architecture_decision.json").exists()
     assert (report_path.parent / "implementation_summary.json").exists()
     assert (report_path.parent / "review.json").exists()
     assert (report_path.parent / "events.jsonl").exists()
+    assert '"package": "claude-agent-sdk"' in (report_path.parent / "api_diffs.json").read_text(
+        encoding="utf-8"
+    )
     assert "Recursive self-adaptation impact" in report_path.read_text(encoding="utf-8")
 
 
+@pytest.mark.asyncio
+async def test_run_agent_autonomous_pr_path(
+    tmp_path: Path,
+    monkeypatch: pytest.MonkeyPatch,
+) -> None:
+    (tmp_path / "pyproject.toml").write_text(
+        """
+[project.optional-dependencies]
+claude = ["claude-agent-sdk>=0.2"]
+""",
+        encoding="utf-8",
+    )
+    (tmp_path / "uv.lock").write_text(
+        """
+[[package]]
+name = "claude-agent-sdk"
+version = "0.2.1"
+""",
+        encoding="utf-8",
+    )
+    monkeypatch.setattr(
+        "examples.sdk_evolution_agent.cli.snapshot_current_api",
+        lambda package, *, version=None: ApiSnapshot(
+            package=package,
+            version=version,
+            module=package.replace("-", "_"),
+        ),
+    )
+    monkeypatch.setattr(
+        "examples.sdk_evolution_agent.cli.snapshot_candidate_in_venv",
+        lambda package, version: ApiSnapshot(
+            package=package,
+            version=version,
+            module=package.replace("-", "_"),
+            source="isolated-venv",
+        ),
+    )
+    monkeypatch.setattr(
+        "examples.sdk_evolution_agent.cli.collect_release_notes",
+        lambda packages, updates: [],
+    )
+    monkeypatch.setattr(
+        "examples.sdk_evolution_agent.cli.collect_behavior_evidence",
+        lambda packages, updates: {"results": [], "diffs": [], "summary": {"status": "pass"}},
+    )
+    commands: list[tuple[str, ...]] = []
+
+    def runner(
+        command: tuple[str, ...],
+        *,
+        cwd: Path | None = None,
+        env: dict[str, str] | None = None,
+    ) -> CommandResult:
+        del cwd, env
+        commands.append(command)
+        if command[:3] == ("uv", "lock", "--dry-run"):
+            return CommandResult(
+                command=command,
+                returncode=0,
+                stderr="Update claude-agent-sdk v0.2.1 -> v0.3.0\n",
+            )
+        if command[:2] == ("uv", "lock"):
+            return CommandResult(command=command, returncode=0, stdout="updated")
+        return CommandResult(command=command, returncode=0, stdout="ok")
+
+    report_path = await run_agent(
+        RunOptions(
+            workspace=tmp_path,
+            runtime="fake",
+            packages=("claude-agent-sdk",),
+            report_dir=Path("reports"),
+            implementation_enabled=True,
+            refresh_preview=True,
+            create_branch=True,
+            branch_name="sdk-update-test",
+            draft_pr=True,
+            pr_base="main",
+        ),
+        pypi_client=_fake_pypi,
+        command_runner=runner,
+        runtime=PermissiveRuntime(),
+    )
+
+    assert report_path.exists()
+    assert ("git", "switch", "-c", "sdk-update-test") in commands
+    assert ("uv", "lock", "-P", "claude-agent-sdk") in commands
+    assert any(command[:3] == ("git", "commit", "-m") for command in commands)
+    assert any(command[:4] == ("gh", "pr", "create", "--draft") for command in commands)
+    assert ("git", "commit", "-m", "Finalize SDK evolution report") in commands
+    assert commands.count(("git", "push", "-u", "origin", "sdk-update-test")) == 2
+    pr_index = next(
+        i for i, command in enumerate(commands) if command[:3] == ("gh", "pr", "create")
+    )
+    finalize_index = commands.index(("git", "commit", "-m", "Finalize SDK evolution report"))
+    assert finalize_index > pr_index
+
+
 def test_parse_args_and_pr_body() -> None:
     options = parse_args(
         [
@@ -338,6 +979,7 @@ def test_parse_args_and_pr_body() -> None:
 
     assert options.runtime == "claude-agent-sdk"
     assert options.packages == ("claude-agent-sdk",)
+    assert options.inspect_candidates is True
     assert options.implementation_enabled is True
     assert options.draft_pr is True
     assert "No auto-merge" in body
@@ -347,6 +989,7 @@ def test_build_registry_injects_isolated_codex_home() -> None:
     runtime = build_registry().resolve(AgentRuntimeKind.CODEX_AGENT_SDK)
 
     assert isinstance(runtime, CodexAgentRuntime)
+    assert runtime._default_model == SDK_EVOLUTION_CODEX_MODEL
     assert runtime._env is not None
     assert runtime._env["CODEX_HOME"] == str(SDK_EVOLUTION_CODEX_HOME)
 
@@ -378,6 +1021,54 @@ async def cancel(self, task_id: str) -> None:
         del task_id
 
 
+class PermissiveRuntime(RecordingRuntime):
+    async def run(self, task: AgentTask) -> AgentResult:
+        self.task = task
+        stage = task.metadata["stage"]
+        if stage == "direction-analysis":
+            payload = {"packages": [], "themes": [], "uncertainty": []}
+        elif stage == "architecture-decision":
+            payload = {
+                "findings": [],
+                "safe_to_implement": True,
+                "manual_design_required": False,
+                "recursive_self_adaptation_impact": False,
+                "self_adaptation_plan": ["Update SDK lockfile."],
+                "verification_commands": [],
+                "uncertainty": [],
+            }
+        elif stage == "review":
+            payload = {"status": "pass", "reasons": [], "required_changes": []}
+        else:
+            payload = {
+                "applied": False,
+                "changes": [],
+                "verification_results": [],
+                "blocked_reason": "",
+            }
+        return AgentResult(output="{}", parsed_output=payload)
+
+
+def _probe(
+    package: str,
+    version: str,
+    scope: str,
+    status: str,
+    details: dict[str, Any],
+):
+    from examples.sdk_evolution_agent.models import BehaviorProbeResult
+
+    return BehaviorProbeResult(
+        package=package,
+        version=version,
+        scope=scope,
+        probe="adapter-contract",
+        status=status,
+        summary=status,
+        details=details,
+    )
+
+
 def _fake_pypi(package: str) -> dict[str, Any]:
     assert package == "claude-agent-sdk"
     return {