Skip to content

Release: version packages#121

Open
github-actions[bot] wants to merge 1 commit into
mainfrom
changeset-release/main
Open

Release: version packages#121
github-actions[bot] wants to merge 1 commit into
mainfrom
changeset-release/main

Conversation

@github-actions

@github-actions github-actions Bot commented Jun 23, 2026

Copy link
Copy Markdown
Contributor

This PR was opened by the Changesets release GitHub action. When you're ready to do a release, you can merge this and the packages will be published to npm automatically. If you're not ready to do a release yet, that's fine, whenever you add more changesets to main, this PR will be updated.

Releases

@tangle-network/browser-agent-driver@0.35.0

Minor Changes

  • #122 b0f74a4 Thanks @drewstone! - The default provider is now credential-aware instead of a hard openai. A bare run (no --provider/--model, no config-file provider) uses OpenAI when OPENAI_API_KEY is set — unchanged for existing users and CI — and otherwise falls back to an available provider (claude-code, which needs no key) rather than failing on a missing OpenAI key. An explicit provider in CLI flags or a config file is always honored, and the default model maps per-provider as before (e.g. gpt-5.4 → sonnet for claude-code). This removes the last place the no-flag path assumed OpenAI; the engine already supported openai/anthropic/google/claude-code/zai for both text and vision.

  • #124 a2055b2 Thanks @drewstone! - design-audit (reference-grounded): make the redesign engine job-first instead of aesthetic-first. The old engine grounded every page in a world-class exemplar's visual DNA and judged on visual craft, so it regressed functional pages into generic brochures — a docs page lost its table-of-contents and dense reference content for two marketing cards and a hero; an aggregator dropped from 30 items to 9; a status dashboard shed services into spacious cards. The fix:

    • Generator (reference/generate/prompt.ts): persona reframed from art director to product designer. New hard rules in priority order — task-first (design for the page's users and the job in its intent) → preserve functional affordances (never delete navigation/ToC/search to look cleaner) → preserve density where it is the value (docs/dashboards/feeds keep their item count) → right-size the intervention (never turn one kind of page into another) → the exemplar is a source of visual craft only, never a structural template.
    • Functional contract: a per-page preservation block derived from the page's own measured DNA (navigation-affordance count, layout density, archetype) so "keep what works" is concrete and data-driven, not exhortation — and density is required only when the page is actually measured dense, so a genuinely sparse page is never forced to stay dense.
    • Ranker/judge (reference/judge/prompt.ts): scores task fitness and functional preservation BEFORE visual craft; a polished direction that removes navigation or reduces density loses. "Fit to the reference" counts only as visual craft.

    Validated by re-running the regressed pages: docs now keeps its ToC + prev/next nav + dense code examples; HN keeps all 30 stories + nav; the status dashboard stays a dense service grid with real values. No provider coupling; flag-gated reference engine only.

Patch Changes

  • #123 20942c2 Thanks @drewstone! - design-audit (reference-grounded): enforce content fidelity so a redesign never fabricates content the page lacks. On a content-sparse page grounded against a dense exemplar, the generator would invent factual content to fill the layout (e.g. a placeholder page gaining a fake "Recent Activity" feed with timestamps, invented status/RFC/registry data), and the pairwise direction-ranker rewarded that invented density as "richer" — so applied to a real app the audit could inject fabricated data into the UI. Now the generator may restyle/regroup/re-rank only the page's real content (the exemplar governs how it looks, never what content it has; a sparse page stays proportionally restrained), the ranker penalises invented content as unfaithful instead of rewarding it, and the apply prompt carries a defense-in-depth "do not invent content" guardrail. No provider coupling.

  • #120 f11b899 Thanks @drewstone! - design-audit (reference-grounded): make redesign generation work with reasoning models. The generator capped output at 2200 tokens, which a reasoning model (e.g. GLM-5.2, o-series) spends on its thinking before the answer — so the JSON direction came back empty or truncated and the audit fell back with a misleading "no JSON object found". Raise the per-direction budget to 8000 (non-reasoning models stop at the closing brace and never use the extra, so it's free for them), and report empty vs truncated vs non-JSON output distinctly so a budget/limit issue is diagnosable. No coupling to any one provider — the engine already runs on openai/anthropic/google/claude-code/zai.

tangletools
tangletools previously approved these changes Jun 23, 2026

@tangletools tangletools left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

✅ Auto-approved PR — 70dd270f

Blanket team auto-approval is enabled for this reviewer service.
The full PR reviewer audit still runs separately and will publish findings if it detects issues.

tangletools · auto-approval · reason: blanket_auto_approve · 2026-06-23T09:34:16Z

tangletools
tangletools previously approved these changes Jun 23, 2026

@tangletools tangletools left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

✅ Auto-approved PR — 1f81234d

Blanket team auto-approval is enabled for this reviewer service.
The full PR reviewer audit still runs separately and will publish findings if it detects issues.

tangletools · auto-approval · reason: blanket_auto_approve · 2026-06-23T10:44:42Z

@tangletools tangletools left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🟢 Value Audit — sound

Verdict sound
Concerns 0 (none)
Heuristic 0.0s
Duplication 0.0s
Interrogation 241.9s (2 bridge agents)
Total 241.9s

💰 Value — sound

Standard Changesets release PR: bumps 0.34.0 → 0.35.0 with a correct minor (credential-aware provider, #122) + patch (reasoning-token headroom, #120); merges to trigger OIDC npm publish.

  • What it does: Generated by changesets/action on main. Bumps package.json version 0.34.0 → 0.35.0, appends a '## 0.35.0' section to CHANGELOG.md with one Minor entry (PR #122, credential-aware default provider) and one Patch entry (PR #120, redesign-generation token headroom for reasoning models + distinct empty/truncated/non-JSON diagnostics), and deletes the two consumed .changeset files (credential-aware-defa
  • Goals it achieves: Cut a release containing the two merged changes since 0.34.0. The minor bump reflects the credential-aware default-provider behavior change; the patch reflects the design-audit reasoning-model fix. Goal is to ship both to npm consumers.
  • Assessment: Correctly formed Changesets release PR. Semver is right: the provider-default change is additive/backward-compatible (OpenAI unchanged when OPENAI_API_KEY is set; only the bare no-key run stops hard-failing), so minor — not major — is appropriate. Patch entry is correctly classified. CHANGELOG reproduces changeset summaries verbatim with PR/commit attribution, consistent with the repo's documented
  • Better / existing approach: none — this is the right approach. The repo explicitly standardized on Changesets + OIDC (CLAUDE.md 'Releases'); this is the version-bump step of that exact flow. No existing in-repo alternative to reuse — version bumps and CHANGELOG generation are the tool's job.
  • Model: opencode/zai-coding-plan/glm-5.2
  • Bridge attempts: 2
  • Bridge warning: opencode/kimi-for-coding/k2p7: opencode: opencode error

🎯 Usefulness — sound

Release bundles two in-grain, well-wired fixes: a credential-aware default provider that removes the last hardcoded openai on the no-flag path, and a design-audit token-budget/headroom fix that unblocks reasoning models — both reachable and correctly integrated.

  • Integration: Both changes land on live, central paths. resolveDefaultProvider() (provider-defaults.ts:111) is consumed by loadConfig (config.ts:249,256 — the single config entry), plus the run/test-runner/design-audit CLI commands (run.ts:192,332,340; test-runner.ts:1150; cli-design-audit.ts:291). The env-load ordering the design depends on is real: cli.ts:25 calls loadLocalEnvFiles at the top of `main()
  • Fit with existing patterns: Fits the codebase's grain precisely. The engine is already multi-provider (8-entry SupportedProvider union in provider-defaults.ts:1); #122 removes the LAST openai assumption on the bare-run path, exactly as framed. The design-audit fix follows the existing BrainGeneratorOptions.maxOutputTokens override seam and the fail-closed parser discipline — coerceNumber/coerceNumberArray preserve th
  • Real-world viability: Holds up past the happy path. resolveDefaultProvider is an idempotent env read with no race (env is stable post-startup); the no-key fall-through to claude-code is the designed keyless path, and resolveProviderApiKey still pulls ANTHROPIC_API_KEY for it when present. The generator uses Promise.allSettled so a single truncated/errored slot is dropped, never fatal, and the new `extractJson
  • Model: opencode/zai-coding-plan/glm-5.2
  • Bridge attempts: 1

No concerns — sound change, no better or existing approach found. ✅


What this audit checks

It judges the change on its merits — not whether it was tasked out in an issue. Unticketed, fast-moving work is fine; the question is whether the change is good and whether a better or existing approach should be used instead.

Pass What it asks
Heuristic Vague title? Whitespace-only or cruft-bearing diff? (content signals only)
Duplication Do added function/class names already exist elsewhere in the repo?
Value Audit What does it do? What goal does it achieve? Is it good? Better architecture or already-exists?
Usefulness Audit Does it integrate and fit? Will it hold up in real use and actually get used?

Findings are concerns, not blocks — the human reviewer decides what to do with them.

value-audit · 20260623T114548Z

@tangletools

Copy link
Copy Markdown
Contributor

✅ No Blockers — 1f81234d

Readiness 95/100 · Confidence 70/100 · 0 findings (none)

glm deepseek aggregate
Readiness 95 95 95
Confidence 70 70 70
Correctness 95 95 95
Security 95 95 95
Testing 95 95 95
Architecture 95 95 95

Full multi-shot audit completed 2/2 planned shots over 2 changed files. Global verifier still owns final merge decision. | Full multi-shot audit completed 2/2 planned shots over 2 changed files. Global verifier still owns final merge decision.

No findings.


tangletools · 2026-06-23T11:48:44Z · trace

tangletools
tangletools previously approved these changes Jun 23, 2026

@tangletools tangletools left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

✅ Auto-approved PR — 6918e89a

Blanket team auto-approval is enabled for this reviewer service.
The full PR reviewer audit still runs separately and will publish findings if it detects issues.

tangletools · auto-approval · reason: blanket_auto_approve · 2026-06-23T17:47:43Z

@tangletools

Copy link
Copy Markdown
Contributor

✅ No Blockers — 6918e89a

Readiness 95/100 · Confidence 70/100 · 0 findings (none)

glm deepseek aggregate
Readiness 95 95 95
Confidence 70 70 70
Correctness 95 95 95
Security 95 95 95
Testing 95 95 95
Architecture 95 95 95

Full multi-shot audit completed 2/2 planned shots over 2 changed files. Global verifier still owns final merge decision. | Full multi-shot audit completed 2/2 planned shots over 2 changed files. Global verifier still owns final merge decision.

No findings.


tangletools · 2026-06-23T18:29:44Z · trace

@tangletools tangletools left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🟡 Value Audit — sound-with-nits

Verdict sound-with-nits
Concerns 1 (1 weak-concern)
Heuristic 0.0s
Duplication 0.0s
Interrogation 104.7s (2 bridge agents)
Total 104.7s

💰 Value — sound-with-nits

Release PR bundles a content-fidelity fix that stops the reference-grounded redesign engine from inventing page content the app doesn't have — coherent, defense-in-depth, and squarely in this codebase's anti-fabrication grain.

  • What it does: Adds 'do not fabricate content' rules to three prompt surfaces in the design-audit redesign pipeline: (1) the generator system prompt (generate/prompt.ts:54-65) forbids inventing metrics/feeds/dates and tells it to keep sparse pages restrained; (2) the pairwise judge system prompt (judge/prompt.ts:46-50,86) penalizes invented content as unfaithful instead of rewarding it as 'richer'; (3) the codin
  • Goals it achieves: Stop a real observed failure: a content-sparse page grounded against a dense exemplar caused the generator to fabricate factual content (e.g. a placeholder page gaining a fake 'Recent Activity' feed with timestamps), AND the pairwise ranker rewarded that invented density as 'richer' — so applying the audit to a real app could inject fabricated data into the UI. The fix makes the redesign restyle/r
  • Assessment: Good change. Three things make it sound: (a) It's defense-in-depth — fixing only the generator would leave the judge still rewarding fabrication, and fixing only those two would leave the coding-agent apply step free to re-invent content at implementation time; covering generate→judge→apply closes the loop. (b) It's in the grain: 'never fabricate' is a first-class invariant of this engine (~40 occ
  • Better / existing approach: No materially better approach for the immediate goal. I searched for an existing content-inventory/real-content mechanism (contentSnapshot|pageContent|realContent|contentInventory|textContent under src/design) — none exists; the only textContent uses are in tokens/extract.ts and measure/contrast.ts for fingerprinting/contrast, not content fidelity. So nothing to reuse or extend. A stronger long-te
  • Model: opencode/zai-coding-plan/glm-5.2
  • Bridge attempts: 2
  • Bridge warning: opencode/kimi-for-coding/k2p7: opencode: opencode error

🎯 Usefulness — sound

Content-fidelity guardrails added at all three prompt layers (generator, judge, apply) of the reference-grounded redesign pipeline, fully reachable through production call paths and matching the codebase's existing prompt-constraint pattern.

  • Integration: All three modified builders are hot on production paths. buildApplyPrompt is written to .apply-prompt.md for every audited page at src/cli-design-audit.ts:455 and feeds runAgentEvolveLoop via evolve/index.ts:11. buildDirectionPrompt runs once per direction in src/design/audit/reference/generate/generator.ts:78. buildPairwisePrompt runs in both judges at src/design/audit/reference/judge/t
  • Fit with existing patterns: Perfectly in-grain. Every other guardrail in this pipeline (ANTI_POSITION_BIAS at judge/prompt.ts:38, RESPONSE_CONTRACT, the existing 'NEVER invent an exemplar id' rule at generate/prompt.ts:51) is also a prompt-string constraint, not a code-enforced invariant. The new CONTENT_FIDELITY constant and the added generator/apply bullets follow the identical pattern — no competing mechanism, no duplicat
  • Real-world viability: Low risk. The change is additive text inside prompt strings plus regression tests that assert the substrings survive (design-audit-evolve-agent.test.ts:42-47, design-audit-reference-generate.test.ts:193-205, design-audit-reference-judge.test.ts:121-128). There are no new code paths, no concurrency surface, no error-handling changes — the only 'input' is the prompt template itself, which is static.
  • Model: opencode/zai-coding-plan/glm-5.2
  • Bridge attempts: 1

💰 Value Audit

🟡 Forbidden-content example lists are restated in parallel across 3 prompts and can drift [maintenance] ``

The enumerated 'no fabricated metrics, counts, dates, statuses, activity feeds' list appears independently in generate/prompt.ts:55-57, judge/prompt.ts:48-49, and evolve/agent.ts:105 with slightly different wording and item sets. If the canonical set of fabricated-content shapes grows (e.g. 'fake testimonials', 'invented nav items'), three sites must be hand-updated in lockstep or the judge/generator/apply guardrails silently diverge. Role-tailored phrasing justifies not collapsing to one shared


What this audit checks

It judges the change on its merits — not whether it was tasked out in an issue. Unticketed, fast-moving work is fine; the question is whether the change is good and whether a better or existing approach should be used instead.

Pass What it asks
Heuristic Vague title? Whitespace-only or cruft-bearing diff? (content signals only)
Duplication Do added function/class names already exist elsewhere in the repo?
Value Audit What does it do? What goal does it achieve? Is it good? Better architecture or already-exists?
Usefulness Audit Does it integrate and fit? Will it hold up in real use and actually get used?

Findings are concerns, not blocks — the human reviewer decides what to do with them.

value-audit · 20260623T183020Z

@tangletools tangletools left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

✅ Auto-approved PR — b79ee1c9

Blanket team auto-approval is enabled for this reviewer service.
The full PR reviewer audit still runs separately and will publish findings if it detects issues.

tangletools · auto-approval · reason: blanket_auto_approve · 2026-06-23T23:25:56Z

@tangletools

Copy link
Copy Markdown
Contributor

✅ No Blockers — b79ee1c9

Readiness 95/100 · Confidence 70/100 · 0 findings (none)

opencode-kimi glm deepseek aggregate
Readiness 95 95 95 95
Confidence 70 70 70 70
Correctness 95 95 95 95
Security 95 95 95 95
Testing 95 95 95 95
Architecture 95 95 95 95

Full multi-shot audit completed 2/2 planned shots over 2 changed files. Global verifier still owns final merge decision. | Full multi-shot audit completed 2/2 planned shots over 2 changed files. Global verifier still owns final merge decision. | Full multi-shot audit completed 2/2 planned shots over 2 changed files. Global verifier still owns final merge decision.

No findings.


tangletools · 2026-06-24T01:46:25Z · trace

@tangletools tangletools left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🟢 Value Audit — sound

Verdict sound
Concerns 1 (1 low)
Heuristic 0.0s
Duplication 0.0s
Interrogation 193.0s (2 bridge agents)
Total 193.0s

💰 Value — sound

Reframes reference-grounded design audit from aesthetic copycat to task-first product design, with content-fidelity guardrails and data-driven functional preservation — a coherent, worthwhile change.

  • What it does: Changes the reference-grounded redesign engine's prompts from an 'art director' persona to a 'senior product designer' persona. The generator now has hard, priority-ordered rules: task fitness first, preserve navigation/wayfinding, preserve information density on dense pages, never turn one page type into another, use only the page's real content, and never fabricate metrics/feeds/sections. A per-
  • Goals it achieves: Fix the observed failure modes where reference-grounded redesigns turned functional pages (docs, dashboards, aggregators) into sparse marketing brochures by copying the reference's structure, and where sparse pages grounded against dense exemplars were padded with fabricated data/metrics/activity feeds. It makes the redesign serve the page's actual users and job rather than visual mimicry.
  • Assessment: Good change, built in the grain of the codebase. It aligns the reference-grounded engine with the task-first, product-designer framing already used by the v1 classifier (src/design/audit/classify.ts:42) and evaluator (src/design/audit/evaluate.ts:415). It reuses existing DesignDNA fields (src/design/audit/reference/contracts.ts:263-289) for the functional contract instead of inventing new measurem
  • Better / existing approach: none — this is the right approach. I searched the reference engine (src/design/audit/reference/generate, src/design/audit/reference/judge, src/design/audit/reference/artifact, src/design/audit/reference/dna, src/design/audit/reference/generate/parse.ts), the v1 audit path (src/design/audit/evaluate.ts, src/design/audit/classify.ts), and the rubric fragments (src/design/audit/rubric/fragments). No
  • Model: opencode/kimi-for-coding/k2p7
  • Bridge attempts: 1

🎯 Usefulness — sound

A coherent job-first reframe of the design-audit prompt layer — reachable from all production callers, data-driven off measured DNA, and robust to sparse/edge pages; no dead surface and no competing pattern.

  • Integration: All three changed prompt builders have live production callers in this PR's codebase: buildDirectionPrompt is called by generate/generator.ts:78 (the per-exemplar fan-out); buildPairwisePrompt/buildQualityPrompt by judge/text-judge.ts:46-47 and judge/vision-judge.ts:162-163; buildApplyPrompt is re-exported from evolve/index.ts:11 and called by cli-design-audit.ts:455. Nothing is orphaned.
  • Fit with existing patterns: Fits the established grain. The new renderFunctionalContract (generate/prompt.ts:126) is a structural twin of the pre-existing renderConstraints (generate/prompt.ts:149) — both read ctx fields, gate emission on presence, and emit a labeled block. The judge changes keep the same anti-position-bias/RESPONSE_CONTRACT skeleton and only swap the persona/priority ordering. No competing or duplicated cap
  • Real-world viability: Holds up off the happy path. The contract's three inputs are genuinely measured, not hardcoded: components.nav = distinctNavCount (dna/derive.ts:304), layout.density = deriveDensity (derive.ts:308,344), layout.archetype = deriveArchetype (derive.ts:345), and Density is the lowercase 'sparse'|'balanced'|'dense' union (contracts.ts:85) so the '==='dense' gate matches real output. Gating is defensive
  • Model: opencode/zai-coding-plan/glm-5.2
  • Bridge attempts: 1

🔎 Heuristic Signals

🟡 Cruft: todo added src/design/audit/reference/generate/prompt.ts

  • ' facts or use placeholders like "TODO" or "lorem ipsum".',

What this audit checks

It judges the change on its merits — not whether it was tasked out in an issue. Unticketed, fast-moving work is fine; the question is whether the change is good and whether a better or existing approach should be used instead.

Pass What it asks
Heuristic Vague title? Whitespace-only or cruft-bearing diff? (content signals only)
Duplication Do added function/class names already exist elsewhere in the repo?
Value Audit What does it do? What goal does it achieve? Is it good? Better architecture or already-exists?
Usefulness Audit Does it integrate and fit? Will it hold up in real use and actually get used?

Findings are concerns, not blocks — the human reviewer decides what to do with them.

value-audit · 20260624T014759Z

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant