Skip to content

fix: retry LSP/MCP toolsets after tool calls, covering env-wrapped commands (fixes #2457)#2458

Open
simonferquel-clanker wants to merge 4 commits intodocker:mainfrom
simonferquel-clanker:fix/retry-missing-lsp-mcp-2457
Open

fix: retry LSP/MCP toolsets after tool calls, covering env-wrapped commands (fixes #2457)#2458
simonferquel-clanker wants to merge 4 commits intodocker:mainfrom
simonferquel-clanker:fix/retry-missing-lsp-mcp-2457

Conversation

@simonferquel-clanker
Copy link
Copy Markdown
Contributor

Summary

Fixes #2457.

When an LSP or MCP server is configured but not available at session start (binary missing, flag file absent, service not yet running), the agent now retries within the same user turn — after each tool-call batch, without requiring a new user message.

Validated end-to-end with tmux for both MCP and LSP (including the env -C ./backend gopls pattern).

Root causes fixed

MCP — already partially retried, now also covers crashed connections

Toolset.Healthy() returns ts.started. When watchConnection cannot recover the connection it sets ts.started=false, causing StartableToolSet to retry Start() on the next reprobe.

LSP — env -C ./dir gopls was silently swallowed

env always starts successfully (cmd.Start() returns nil), so StartableToolSet.started was set to true even when the inner binary wasn't installed. The new watchExit goroutine is the sole owner of cmd.Wait; it resets h.cmd=nil when the process exits quickly, making Healthy()=false so the next reprobe retries Start().

Changes

File What changed
pkg/tools/startable.go HealthChecker interface; StartableToolSet checks health on already-started toolsets; ShouldReportFailure/ConsumeRecovery for deduplication
pkg/tools/builtin/lsp.go Healthy(), watchExit goroutine (sole cmd.Wait owner, no double-Wait race), exited chan for clean Stop()
pkg/tools/mcp/mcp.go Healthy()
pkg/runtime/loop.go reprobe() called after every tool-call batch; compares by name-set diff; updates agentTools in-place
pkg/runtime/event.go + client.go NoticeEvent / Notice()
pkg/tui/page/chat/runtime_events.go NoticeEventnotification.InfoCmd()
pkg/agent/agent.go Deduplicated warnings; recovery notices

Tests

  • pkg/tools/startable_test.go: 6 new tests covering ShouldReportFailure, ConsumeRecovery, HealthChecker integration
  • pkg/tools/builtin/lsp_test.go: 4 new tests including TestLSPTool_StartableToolSet_RetriesAfterProcessExits (flag-gated fake LSP, the env -C scenario)
  • pkg/agent/agent_test.go: 2 new tests (deduplication, recovery notice)
  • pkg/runtime/runtime_test.go: 2 new tests (TestReprobe_NewToolsAvailableAfterToolCall, TestReprobe_NoChangeMeansNoExtraEvents)

tmux e2e results

MCP (gate flag controls server availability):

✓ Shell  touch /tmp/e2e-test/mcp-ready
✓ hello  name=World  Hello, World! MCP is working.
Tools sidebar: 6 tools available  ← was 5

LSP (env -C /tmp/lsp-e2e python3 fake-gopls.py, gate flag controls inner script):

✓ Shell  touch /tmp/lsp-e2e/gopls-ready
✓ Get Workspace Info  (lsp_workspace called natively)
Tools sidebar: 20 tools available  ← was 5

Both in a single user turn, no new user message needed.

…mmands (docker#2457)

When an LSP or MCP server is not available at session start (binary missing,
service not yet running, etc.), the agent now retries automatically after
every tool-call batch — within the same user turn — without requiring a new
user message.

Key changes:

tools/startable.go
- Add HealthChecker interface: Healthy() bool
  Toolsets that wrap a process implement this to signal when the backing
  process has died unexpectedly (e.g. the binary was not yet installed).
- StartableToolSet.Start() now calls HealthChecker.Healthy() on an already-
  started toolset; if unhealthy it resets started=false and retries Start().
- Add ShouldReportFailure()/ConsumeRecovery() to deduplicate warnings and
  emit a one-time recovery notice instead of repeating the failure message.

tools/builtin/lsp.go
- Implement HealthChecker: Healthy() returns h.cmd != nil.
- Add watchExit goroutine as the sole owner of cmd.Wait. It resets h.cmd=nil
  when the process exits, making Healthy() return false so the next reprobe
  retries Start(). This correctly handles env -C ./backend gopls where env
  starts fine but the inner binary exits 127 if gopls is not installed.
- stopLocked waits on the exited channel instead of calling cmd.Wait itself,
  eliminating the double-Wait race.

tools/mcp/mcp.go
- Implement HealthChecker: Healthy() returns ts.started.
  When watchConnection cannot recover the connection, ts.started=false →
  Healthy()=false → StartableToolSet retries Start() on the next reprobe.

runtime/loop.go
- After processToolCalls (when calls were made), run reprobe():
  re-probes all configured toolsets via getTools() and returns updated
  agentTools if new tools appeared (by name-set diff, not just count).
  The very next tryModelWithFallback call in the same loop iteration
  receives the updated tool list — new tools are visible to the model
  within the same user turn.

runtime/event.go + client.go
- Add NoticeEvent/Notice() as informational counterpart to WarningEvent.

tui/page/chat/runtime_events.go
- Handle NoticeEvent → notification.InfoCmd().

Validated with tmux e2e tests for both MCP and LSP (env -C ./dir pattern):
single user turn installs the server via a tool call, reprobe fires, and
the next model response uses the new tools natively.

Fixes docker#2457

Assisted-By: docker-agent
@simonferquel-clanker simonferquel-clanker requested a review from a team as a code owner April 17, 2026 13:13
- stopLocked: change signature to void (always returned nil)
- agent_test: assert.Nil → assert.Empty for DrainWarnings/DrainNotices
- lsp_test: rename 'tools' local var to 'toolList' (importShadow)
- lsp_test: add explanation to nolint:errcheck directives (gocritic whyNoLint)
- startable_test, lsp.go: gofumpt formatting

Assisted-By: docker-agent
@rumpl
Copy link
Copy Markdown
Member

rumpl commented Apr 17, 2026

I'm not sure I understand the fix, isn't the lsp tooset startable? Wouldn't re-starting the lsp toolset be enough?

@simonferquel-clanker
Copy link
Copy Markdown
Contributor Author

Good question — the LSP toolset is Startable, but there are two separate problems that a naive re-start doesn't cover:

1. env -C ./backend gopls looks like a successful start

When the command is env -C ./backend gopls, exec.Command("env", ...).Start() always returns nilenv is always present. The fact that env then immediately runs gopls, finds it missing, and exits 127 is invisible to cmd.Start(). By the time we return from LSPTool.Start() / startLocked(), h.cmd has been set and the call returned nil, so StartableToolSet sets started = true and considers the toolset live.

A subsequent call to Start() hits the if s.started { return nil } short-circuit and never retries — even though the underlying process died in milliseconds.

2. The HealthChecker interface closes that gap

The fix adds a HealthChecker interface (Healthy() bool) that StartableToolSet checks on every Start() call even when started == true. LSPTool.Healthy() returns h.cmd != nil.

The new watchExit goroutine is the mechanism that transitions h.cmd from non-nil to nil: it is the sole owner of cmd.Wait(), blocks until the process exits, then resets h.cmd = nil. So after env gopls exits 127, Healthy() returns false, StartableToolSet resets started = false, and the next Start() call actually retries — which will succeed once gopls has been installed.

3. The reprobe loop ties it together within a single turn

Even with the above, without the reprobe the agent would only see new tools at the start of the next user turn (when getTools() is called at the top of the run loop). The reprobe() call after each tool-call batch re-runs ensureToolSetsAreStarted immediately, so if a shell tool call installed gopls, the very next model response in the same turn already has the LSP tools in its available-tool list — no new user message needed.

B1 (blocking) — mcp.go: add watcherAlive flag
  Healthy() now returns started||watcherAlive. A reprobe Start() call
  during watchConnection's backoff window sees Healthy()=true and skips,
  preventing a second watcher goroutine and the associated racing doStart
  calls, orphaned processes, and unsafe close/recreate of ts.restarted.
  watchConnection sets watcherAlive=false on all exit paths.

S1 — lsp.go: eliminate Healthy() stale-read window
  Add dead atomic.Bool published by watchExit *before* acquiring h.mu.
  Healthy() checks dead first (lock-free), then confirms h.cmd under lock.
  Eliminates the window between Wait() returning and the lock being taken.

S2 — lsp.go: rename stopLocked→stopAndWait, document mutex-release contract
  The name makes explicit that this function temporarily drops h.mu while
  waiting for watchExit. Added a guard comment in startLocked explaining
  the race window it protects against.

S3 — loop.go: suppress MCPInit* events in reprobe path
  getTools() gains an emitLifecycleEvents bool parameter. Normal calls pass
  true; reprobe passes false. Prevents TUI spinner flicker on every tool
  call batch.

S4 — loop.go: remove dead agentTools write, document reprobe's role
  reprobe() now returns void. The updated tool list is picked up by the
  next iteration's getTools() (single authoritative source). Doc updated.

S5 — startable.go: guard recovery notice until process is stable
  pendingRecovery is only set when Start() succeeds AND HealthChecker
  still reports Healthy() afterwards. Prevents a false 'now available'
  notice in the env-wrapper case where env starts OK but the inner binary
  exits immediately after Start() returns nil.

S6 — lsp_test.go: replace 50ms sleep with deterministic poll
  Replaced time.Sleep(50ms) with 10×5ms Healthy() checks. If watchExit
  fires and the process dies, the test catches it immediately rather than
  relying on a timing assumption that can fail under load.

N2 — startable.go: document health-check re-entry path in struct comment
N3 — startable_test.go: fix typo 'Become Unhealthy()' → 'becomeUnhealthy()'
N4 — startable.go: update HealthChecker doc ('non-blocking' → accurate wording)

Assisted-By: docker-agent
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Missing LSPs/MCPs should be retried on next turn / after tool calls, with AGENTS.md install instructions

2 participants