fix: retry LSP/MCP toolsets after tool calls, covering env-wrapped commands (fixes #2457)#2458
Conversation
…mmands (docker#2457) When an LSP or MCP server is not available at session start (binary missing, service not yet running, etc.), the agent now retries automatically after every tool-call batch — within the same user turn — without requiring a new user message. Key changes: tools/startable.go - Add HealthChecker interface: Healthy() bool Toolsets that wrap a process implement this to signal when the backing process has died unexpectedly (e.g. the binary was not yet installed). - StartableToolSet.Start() now calls HealthChecker.Healthy() on an already- started toolset; if unhealthy it resets started=false and retries Start(). - Add ShouldReportFailure()/ConsumeRecovery() to deduplicate warnings and emit a one-time recovery notice instead of repeating the failure message. tools/builtin/lsp.go - Implement HealthChecker: Healthy() returns h.cmd != nil. - Add watchExit goroutine as the sole owner of cmd.Wait. It resets h.cmd=nil when the process exits, making Healthy() return false so the next reprobe retries Start(). This correctly handles env -C ./backend gopls where env starts fine but the inner binary exits 127 if gopls is not installed. - stopLocked waits on the exited channel instead of calling cmd.Wait itself, eliminating the double-Wait race. tools/mcp/mcp.go - Implement HealthChecker: Healthy() returns ts.started. When watchConnection cannot recover the connection, ts.started=false → Healthy()=false → StartableToolSet retries Start() on the next reprobe. runtime/loop.go - After processToolCalls (when calls were made), run reprobe(): re-probes all configured toolsets via getTools() and returns updated agentTools if new tools appeared (by name-set diff, not just count). The very next tryModelWithFallback call in the same loop iteration receives the updated tool list — new tools are visible to the model within the same user turn. runtime/event.go + client.go - Add NoticeEvent/Notice() as informational counterpart to WarningEvent. tui/page/chat/runtime_events.go - Handle NoticeEvent → notification.InfoCmd(). Validated with tmux e2e tests for both MCP and LSP (env -C ./dir pattern): single user turn installs the server via a tool call, reprobe fires, and the next model response uses the new tools natively. Fixes docker#2457 Assisted-By: docker-agent
- stopLocked: change signature to void (always returned nil) - agent_test: assert.Nil → assert.Empty for DrainWarnings/DrainNotices - lsp_test: rename 'tools' local var to 'toolList' (importShadow) - lsp_test: add explanation to nolint:errcheck directives (gocritic whyNoLint) - startable_test, lsp.go: gofumpt formatting Assisted-By: docker-agent
|
I'm not sure I understand the fix, isn't the lsp tooset startable? Wouldn't re-starting the lsp toolset be enough? |
|
Good question — the LSP toolset is Startable, but there are two separate problems that a naive re-start doesn't cover: 1. When the command is A subsequent call to 2. The The fix adds a The new 3. The reprobe loop ties it together within a single turn Even with the above, without the reprobe the agent would only see new tools at the start of the next user turn (when |
B1 (blocking) — mcp.go: add watcherAlive flag
Healthy() now returns started||watcherAlive. A reprobe Start() call
during watchConnection's backoff window sees Healthy()=true and skips,
preventing a second watcher goroutine and the associated racing doStart
calls, orphaned processes, and unsafe close/recreate of ts.restarted.
watchConnection sets watcherAlive=false on all exit paths.
S1 — lsp.go: eliminate Healthy() stale-read window
Add dead atomic.Bool published by watchExit *before* acquiring h.mu.
Healthy() checks dead first (lock-free), then confirms h.cmd under lock.
Eliminates the window between Wait() returning and the lock being taken.
S2 — lsp.go: rename stopLocked→stopAndWait, document mutex-release contract
The name makes explicit that this function temporarily drops h.mu while
waiting for watchExit. Added a guard comment in startLocked explaining
the race window it protects against.
S3 — loop.go: suppress MCPInit* events in reprobe path
getTools() gains an emitLifecycleEvents bool parameter. Normal calls pass
true; reprobe passes false. Prevents TUI spinner flicker on every tool
call batch.
S4 — loop.go: remove dead agentTools write, document reprobe's role
reprobe() now returns void. The updated tool list is picked up by the
next iteration's getTools() (single authoritative source). Doc updated.
S5 — startable.go: guard recovery notice until process is stable
pendingRecovery is only set when Start() succeeds AND HealthChecker
still reports Healthy() afterwards. Prevents a false 'now available'
notice in the env-wrapper case where env starts OK but the inner binary
exits immediately after Start() returns nil.
S6 — lsp_test.go: replace 50ms sleep with deterministic poll
Replaced time.Sleep(50ms) with 10×5ms Healthy() checks. If watchExit
fires and the process dies, the test catches it immediately rather than
relying on a timing assumption that can fail under load.
N2 — startable.go: document health-check re-entry path in struct comment
N3 — startable_test.go: fix typo 'Become Unhealthy()' → 'becomeUnhealthy()'
N4 — startable.go: update HealthChecker doc ('non-blocking' → accurate wording)
Assisted-By: docker-agent
Assisted-By: docker-agent
Summary
Fixes #2457.
When an LSP or MCP server is configured but not available at session start (binary missing, flag file absent, service not yet running), the agent now retries within the same user turn — after each tool-call batch, without requiring a new user message.
Validated end-to-end with tmux for both MCP and LSP (including the
env -C ./backend goplspattern).Root causes fixed
MCP — already partially retried, now also covers crashed connections
Toolset.Healthy()returnsts.started. WhenwatchConnectioncannot recover the connection it setsts.started=false, causingStartableToolSetto retryStart()on the next reprobe.LSP —
env -C ./dir goplswas silently swallowedenvalways starts successfully (cmd.Start()returns nil), soStartableToolSet.startedwas set totrueeven when the inner binary wasn't installed. The newwatchExitgoroutine is the sole owner ofcmd.Wait; it resetsh.cmd=nilwhen the process exits quickly, makingHealthy()=falseso the next reprobe retriesStart().Changes
pkg/tools/startable.goHealthCheckerinterface;StartableToolSetchecks health on already-started toolsets;ShouldReportFailure/ConsumeRecoveryfor deduplicationpkg/tools/builtin/lsp.goHealthy(),watchExitgoroutine (solecmd.Waitowner, no double-Wait race),exited chanfor cleanStop()pkg/tools/mcp/mcp.goHealthy()pkg/runtime/loop.goreprobe()called after every tool-call batch; compares by name-set diff; updatesagentToolsin-placepkg/runtime/event.go+client.goNoticeEvent/Notice()pkg/tui/page/chat/runtime_events.goNoticeEvent→notification.InfoCmd()pkg/agent/agent.goTests
pkg/tools/startable_test.go: 6 new tests coveringShouldReportFailure,ConsumeRecovery,HealthCheckerintegrationpkg/tools/builtin/lsp_test.go: 4 new tests includingTestLSPTool_StartableToolSet_RetriesAfterProcessExits(flag-gated fake LSP, theenv -Cscenario)pkg/agent/agent_test.go: 2 new tests (deduplication, recovery notice)pkg/runtime/runtime_test.go: 2 new tests (TestReprobe_NewToolsAvailableAfterToolCall,TestReprobe_NoChangeMeansNoExtraEvents)tmux e2e results
MCP (gate flag controls server availability):
LSP (
env -C /tmp/lsp-e2e python3 fake-gopls.py, gate flag controls inner script):Both in a single user turn, no new user message needed.