fix: retry LSP/MCP toolsets after tool calls, covering env-wrapped commands (fixes #2457) by simonferquel-clanker · Pull Request #2458 · docker/docker-agent

simonferquel-clanker · 2026-04-17T13:13:35Z

Summary

When an LSP or MCP server is configured but not available at session start (binary missing, flag file absent, service not yet running), the agent now retries within the same user turn — after each tool-call batch, without requiring a new user message.

Validated end-to-end with tmux for both MCP and LSP (including the env -C ./backend gopls pattern).

Root causes fixed

MCP — already partially retried, now also covers crashed connections

Toolset.Healthy() returns ts.started. When watchConnection cannot recover the connection it sets ts.started=false, causing StartableToolSet to retry Start() on the next reprobe.

LSP — `env -C ./dir gopls` was silently swallowed

env always starts successfully (cmd.Start() returns nil), so StartableToolSet.started was set to true even when the inner binary wasn't installed. The new watchExit goroutine is the sole owner of cmd.Wait; it resets h.cmd=nil when the process exits quickly, making Healthy()=false so the next reprobe retries Start().

Changes

File	What changed
`pkg/tools/startable.go`	`HealthChecker` interface; `StartableToolSet` checks health on already-started toolsets; `ShouldReportFailure`/`ConsumeRecovery` for deduplication
`pkg/tools/builtin/lsp.go`	`Healthy()`, `watchExit` goroutine (sole `cmd.Wait` owner, no double-Wait race), `exited chan` for clean `Stop()`
`pkg/tools/mcp/mcp.go`	`Healthy()`
`pkg/runtime/loop.go`	`reprobe()` called after every tool-call batch; compares by name-set diff; updates `agentTools` in-place
`pkg/runtime/event.go` + `client.go`	`NoticeEvent` / `Notice()`
`pkg/tui/page/chat/runtime_events.go`	`NoticeEvent` → `notification.InfoCmd()`
`pkg/agent/agent.go`	Deduplicated warnings; recovery notices

Tests

pkg/tools/startable_test.go: 6 new tests covering ShouldReportFailure, ConsumeRecovery, HealthChecker integration
pkg/tools/builtin/lsp_test.go: 4 new tests including TestLSPTool_StartableToolSet_RetriesAfterProcessExits (flag-gated fake LSP, the env -C scenario)
pkg/agent/agent_test.go: 2 new tests (deduplication, recovery notice)
pkg/runtime/runtime_test.go: 2 new tests (TestReprobe_NewToolsAvailableAfterToolCall, TestReprobe_NoChangeMeansNoExtraEvents)

tmux e2e results

MCP (gate flag controls server availability):

✓ Shell  touch /tmp/e2e-test/mcp-ready
✓ hello  name=World  Hello, World! MCP is working.
Tools sidebar: 6 tools available  ← was 5

LSP (env -C /tmp/lsp-e2e python3 fake-gopls.py, gate flag controls inner script):

✓ Shell  touch /tmp/lsp-e2e/gopls-ready
✓ Get Workspace Info  (lsp_workspace called natively)
Tools sidebar: 20 tools available  ← was 5

Both in a single user turn, no new user message needed.

…mmands (docker#2457) When an LSP or MCP server is not available at session start (binary missing, service not yet running, etc.), the agent now retries automatically after every tool-call batch — within the same user turn — without requiring a new user message. Key changes: tools/startable.go - Add HealthChecker interface: Healthy() bool Toolsets that wrap a process implement this to signal when the backing process has died unexpectedly (e.g. the binary was not yet installed). - StartableToolSet.Start() now calls HealthChecker.Healthy() on an already- started toolset; if unhealthy it resets started=false and retries Start(). - Add ShouldReportFailure()/ConsumeRecovery() to deduplicate warnings and emit a one-time recovery notice instead of repeating the failure message. tools/builtin/lsp.go - Implement HealthChecker: Healthy() returns h.cmd != nil. - Add watchExit goroutine as the sole owner of cmd.Wait. It resets h.cmd=nil when the process exits, making Healthy() return false so the next reprobe retries Start(). This correctly handles env -C ./backend gopls where env starts fine but the inner binary exits 127 if gopls is not installed. - stopLocked waits on the exited channel instead of calling cmd.Wait itself, eliminating the double-Wait race. tools/mcp/mcp.go - Implement HealthChecker: Healthy() returns ts.started. When watchConnection cannot recover the connection, ts.started=false → Healthy()=false → StartableToolSet retries Start() on the next reprobe. runtime/loop.go - After processToolCalls (when calls were made), run reprobe(): re-probes all configured toolsets via getTools() and returns updated agentTools if new tools appeared (by name-set diff, not just count). The very next tryModelWithFallback call in the same loop iteration receives the updated tool list — new tools are visible to the model within the same user turn. runtime/event.go + client.go - Add NoticeEvent/Notice() as informational counterpart to WarningEvent. tui/page/chat/runtime_events.go - Handle NoticeEvent → notification.InfoCmd(). Validated with tmux e2e tests for both MCP and LSP (env -C ./dir pattern): single user turn installs the server via a tool call, reprobe fires, and the next model response uses the new tools natively. Fixes docker#2457 Assisted-By: docker-agent

- stopLocked: change signature to void (always returned nil) - agent_test: assert.Nil → assert.Empty for DrainWarnings/DrainNotices - lsp_test: rename 'tools' local var to 'toolList' (importShadow) - lsp_test: add explanation to nolint:errcheck directives (gocritic whyNoLint) - startable_test, lsp.go: gofumpt formatting Assisted-By: docker-agent

rumpl · 2026-04-17T13:19:58Z

I'm not sure I understand the fix, isn't the lsp tooset startable? Wouldn't re-starting the lsp toolset be enough?

simonferquel-clanker · 2026-04-17T13:20:33Z

Good question — the LSP toolset is Startable, but there are two separate problems that a naive re-start doesn't cover:

1. env -C ./backend gopls looks like a successful start

When the command is env -C ./backend gopls, exec.Command("env", ...).Start() always returns nil — env is always present. The fact that env then immediately runs gopls, finds it missing, and exits 127 is invisible to cmd.Start(). By the time we return from LSPTool.Start() / startLocked(), h.cmd has been set and the call returned nil, so StartableToolSet sets started = true and considers the toolset live.

A subsequent call to Start() hits the if s.started { return nil } short-circuit and never retries — even though the underlying process died in milliseconds.

2. The HealthChecker interface closes that gap

The fix adds a HealthChecker interface (Healthy() bool) that StartableToolSet checks on every Start() call even when started == true. LSPTool.Healthy() returns h.cmd != nil.

The new watchExit goroutine is the mechanism that transitions h.cmd from non-nil to nil: it is the sole owner of cmd.Wait(), blocks until the process exits, then resets h.cmd = nil. So after env gopls exits 127, Healthy() returns false, StartableToolSet resets started = false, and the next Start() call actually retries — which will succeed once gopls has been installed.

3. The reprobe loop ties it together within a single turn

Even with the above, without the reprobe the agent would only see new tools at the start of the next user turn (when getTools() is called at the top of the run loop). The reprobe() call after each tool-call batch re-runs ensureToolSetsAreStarted immediately, so if a shell tool call installed gopls, the very next model response in the same turn already has the LSP tools in its available-tool list — no new user message needed.

B1 (blocking) — mcp.go: add watcherAlive flag Healthy() now returns started||watcherAlive. A reprobe Start() call during watchConnection's backoff window sees Healthy()=true and skips, preventing a second watcher goroutine and the associated racing doStart calls, orphaned processes, and unsafe close/recreate of ts.restarted. watchConnection sets watcherAlive=false on all exit paths. S1 — lsp.go: eliminate Healthy() stale-read window Add dead atomic.Bool published by watchExit *before* acquiring h.mu. Healthy() checks dead first (lock-free), then confirms h.cmd under lock. Eliminates the window between Wait() returning and the lock being taken. S2 — lsp.go: rename stopLocked→stopAndWait, document mutex-release contract The name makes explicit that this function temporarily drops h.mu while waiting for watchExit. Added a guard comment in startLocked explaining the race window it protects against. S3 — loop.go: suppress MCPInit* events in reprobe path getTools() gains an emitLifecycleEvents bool parameter. Normal calls pass true; reprobe passes false. Prevents TUI spinner flicker on every tool call batch. S4 — loop.go: remove dead agentTools write, document reprobe's role reprobe() now returns void. The updated tool list is picked up by the next iteration's getTools() (single authoritative source). Doc updated. S5 — startable.go: guard recovery notice until process is stable pendingRecovery is only set when Start() succeeds AND HealthChecker still reports Healthy() afterwards. Prevents a false 'now available' notice in the env-wrapper case where env starts OK but the inner binary exits immediately after Start() returns nil. S6 — lsp_test.go: replace 50ms sleep with deterministic poll Replaced time.Sleep(50ms) with 10×5ms Healthy() checks. If watchExit fires and the process dies, the test catches it immediately rather than relying on a timing assumption that can fail under load. N2 — startable.go: document health-check re-entry path in struct comment N3 — startable_test.go: fix typo 'Become Unhealthy()' → 'becomeUnhealthy()' N4 — startable.go: update HealthChecker doc ('non-blocking' → accurate wording) Assisted-By: docker-agent

Assisted-By: docker-agent

simonferquel-clanker requested a review from a team as a code owner April 17, 2026 13:13

simonferquel-clanker added 2 commits April 17, 2026 13:29

fix lint: assert.Len(0) → assert.Empty (testifylint)

ca5b66d

Assisted-By: docker-agent

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix: retry LSP/MCP toolsets after tool calls, covering env-wrapped commands (fixes #2457)#2458

fix: retry LSP/MCP toolsets after tool calls, covering env-wrapped commands (fixes #2457)#2458
simonferquel-clanker wants to merge 4 commits intodocker:mainfrom
simonferquel-clanker:fix/retry-missing-lsp-mcp-2457

simonferquel-clanker commented Apr 17, 2026

Uh oh!

rumpl commented Apr 17, 2026

Uh oh!

simonferquel-clanker commented Apr 17, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

simonferquel-clanker commented Apr 17, 2026

Summary

Root causes fixed

MCP — already partially retried, now also covers crashed connections

LSP — env -C ./dir gopls was silently swallowed

Changes

Tests

tmux e2e results

Uh oh!

rumpl commented Apr 17, 2026

Uh oh!

simonferquel-clanker commented Apr 17, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

LSP — `env -C ./dir gopls` was silently swallowed