Skip to content

fix(agent-skills): write skill files atomically to prevent partial reads#1144

Merged
BYK merged 3 commits into
mainfrom
byk/fix/agent-skills-atomic-write
Jun 26, 2026
Merged

fix(agent-skills): write skill files atomically to prevent partial reads#1144
BYK merged 3 commits into
mainfrom
byk/fix/agent-skills-atomic-write

Conversation

@BYK

@BYK BYK commented Jun 26, 2026

Copy link
Copy Markdown
Member

Problem

installAgentSkills writes each skill file with an in-place writeFile,
which truncates the destination before rewriting it. AI coding agents
(Claude Code, OpenCode) discover skills by globbing the skills tree for
SKILL.md and parsing the YAML frontmatter. If an agent's skill scan lands
inside that truncate→rewrite window, it reads empty/partial frontmatter and
silently drops the skill.

Because the generated SKILL.md carries a build-timestamp version, its
content changes on every dev build, so the file is rewritten frequently —
making the race realistically reachable during normal install churn.

Note: this hardens the write side. The most common reason a freshly
installed skill doesn't appear is that OpenCode memoizes its skill scan at
session startup and only re-scans on restart — that's a separate,
agent-side concern and is not addressed here.

Fix

Add atomicWriteFile(destPath, content) and use it in writeSkillFiles:

  • Write to a uniquely-named temp file .<name>.<pid>.<rand>.tmp in the
    same directory as the destination, then rename() into place.
  • Same directory ⇒ same filesystem ⇒ the rename is atomic on POSIX, so a
    concurrent reader always sees the complete old or complete new file.
  • The temp name is dot-prefixed and .tmp-suffixed so it never matches the
    SKILL.md glob agents look for, even while it briefly exists.
  • On failure the temp file is removed best-effort (cleanup errors reported at
    debug level, no silent catch) and the original error is re-thrown.

Tests

  • New test asserts no .tmp leftovers remain in the skill dir or
    references/ after install (guards the rename/cleanup path).
  • agent-skills.test.ts: 13/13 pass · cli/setup.test.ts: 30/30 pass ·
    typecheck + biome clean.

In-place writeFile truncates SKILL.md before rewriting it. AI coding
agents (Claude Code, OpenCode) discover skills by globbing the skills
tree for SKILL.md and parsing frontmatter; a scan landing inside that
write window reads empty frontmatter and silently drops the skill.

Write to a uniquely-named temp file (.SKILL.md.<pid>.<rand>.tmp, which
never matches the SKILL.md glob) in the same directory, then rename() it
into place. Same-dir guarantees same filesystem, so the rename is atomic
on POSIX and a concurrent reader always sees the complete old or complete
new file. The temp file is cleaned up best-effort on failure.
@github-actions github-actions Bot added the risk: medium PR risk score: medium label Jun 26, 2026
@github-actions

github-actions Bot commented Jun 26, 2026

Copy link
Copy Markdown
Contributor
PR Preview Action v1.8.1

QR code for preview link

🚀 View preview at
https://cli.sentry.dev/_preview/pr-1144/

Built to branch gh-pages at 2026-06-26 10:32 UTC.
Preview will be ready when the GitHub Pages deployment is complete.

@github-actions

github-actions Bot commented Jun 26, 2026

Copy link
Copy Markdown
Contributor

Codecov Results 📊

✅ Patch coverage is 100.00%. Project has 5113 uncovered lines.
✅ Project coverage is 81.53%. Comparing base (base) to head (head).

Coverage diff
@@            Coverage Diff             @@
##          main       #PR       +/-##
==========================================
+ Coverage    81.52%    81.53%    +0.01%
==========================================
  Files          397       397         —
  Lines        27671     27679        +8
  Branches     17966     17966         —
==========================================
+ Hits         22556     22566       +10
- Misses        5115      5113        -2
- Partials      1867      1868        +1

Generated by Codecov Action

The sole prior atomic-write test ("leaves no temp files behind") passed even
when atomicWriteFile was reverted to a plain in-place writeFile — proven via
mutation — so it guarded nothing.

- Wrap node:fs/promises to observe the temp-file + rename mechanism and to
  inject a rename failure.
- Add a guard asserting every writeFile targets a `.tmp` staging path and the
  destination is published via rename. Fails under a non-atomic mutation.
- Add a rollback test: when rename fails, install returns null, the temp file
  is cleaned up, and no partial SKILL.md is left — covering the previously
  untested catch/cleanup branch.
- Clarify atomicWriteFile JSDoc: correct temp-file name pattern, note the
  POSIX-only atomicity guarantee, and document the dir-exists precondition.
@BYK

BYK commented Jun 26, 2026

Copy link
Copy Markdown
Member Author

CI note: E2E skill-eval failure is an Anthropic API outage, not a regression

The red E2E Tests job is the two test/e2e/skill-eval.test.ts LLM-eval cases (claude-sonnet-4-6/claude-opus-4-6 meets threshold) failing with:

[planner] API error: Invalid response body while trying to fetch https://api.anthropic.com/v1/messages: Premature close
AssertionError: expected 0 to be greater than or equal to 0.75

These call api.anthropic.com directly; the Premature close → score 0 is an upstream API outage. All 126 non-LLM E2E tests pass, and this PR only touches the skill-file write mechanism (agent-skills.ts), which cannot affect eval scores. Re-running once the API recovers should clear it.

Pushed: test hardening (commit 3393368)

An adversarial self-review found the original atomic-write test was tautological — it passed even when atomicWriteFile was reverted to a plain writeFile (verified by mutation). Replaced it with real guards:

  • asserts every writeFile targets a .tmp staging path and the dest is published via rename (fails under a non-atomic mutation);
  • a rollback test covering the previously-untested catch/cleanup branch (rename fails → returns null, temp cleaned up, no partial SKILL.md).

Add a test where both rename and the best-effort temp rm fail, so the inner
cleanup catch in atomicWriteFile (captureException, not rethrow) runs and the
original rename error still propagates to null. Brings agent-skills.ts to 100%
line/function coverage; the new function is now fully exercised.
@BYK BYK merged commit 97e7fcc into main Jun 26, 2026
30 checks passed
@BYK BYK deleted the byk/fix/agent-skills-atomic-write branch June 26, 2026 10:49
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

risk: medium PR risk score: medium

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant