Skip to content

HttpServletStreamableServerTransportProvider orphans sessions on first transient SSE write failure (no grace period) #920

@andersenleo

Description

@andersenleo

Summary

In HttpServletStreamableServerTransportProvider, a single failed SSE write
immediately removes the session from the in-memory map, so the client's next
POST gets Session not found even though the failure was transient (LB
response-timeout, NEG rebalance, pod eviction, laptop sleep, mobile network
blip). The client is forced into a full initialize round-trip and loses
any server-side session state.

Adding a short, configurable grace period before sessions.remove(...)
— during which a reconnect with the same session id reattaches — would solve
this without touching where sessions are stored or how resumability works.

A secondary, unrelated observation about a duplicate asyncContext.complete()
call is noted at the bottom.

Relationship to existing work

I want to flag upfront how this differs from related open issues/PRs, so it's
easy to triage.

The ask here is narrower than any of those: don't drop the session on the
first transient write failure
.

Version

  • io.modelcontextprotocol.sdk:mcp-core 1.1.0
  • JDK 25, Tomcat (Spring Boot 4), behind a GKE HTTP(S) LB

Reproduction

  1. Run any gateway built on HttpServletStreamableServerTransportProvider
    behind an L7 LB that caps per-response duration below the session lifetime
    (GKE BackendConfig.timeoutSec ≤ 60 s is a clean repro).

  2. Connect an MCP client (seen with claude-code, but any streamable-HTTP
    client triggers it).

  3. Wait for the LB to close the SSE stream (~60 s in our case).

  4. Observe — server log:

    KeepAliveScheduler: Failed to send keep-alive ping to session ...:
      Did not observe any item or terminal signal within 10000ms in 'source(MonoCreate)'
    ServletStreamableServerTransportProvider:
      Failed to send message to session ...: Client disconnected
    

    Client gets Session not found on its next POST.

Root cause

HttpServletStreamableMcpSessionTransport.sendMessage (v1.1.0, lines 738–767)
hard-codes session removal in the catch block:

@Override
public Mono<Void> sendMessage(McpSchema.JSONRPCMessage message, String messageId) {
    return Mono.fromRunnable(() -> {
        ...
        try {
            ...
            String jsonText = jsonMapper.writeValueAsString(message);
            HttpServletStreamableServerTransportProvider.this.sendEvent(writer, MESSAGE_EVENT_TYPE, jsonText,
                    messageId != null ? messageId : this.sessionId);
            ...
        }
        catch (Exception e) {
            logger.error("Failed to send message to session {}: {}", this.sessionId, e.getMessage());
            HttpServletStreamableServerTransportProvider.this.sessions.remove(this.sessionId); // <— here
            this.asyncContext.complete();
        }
        ...
    });
}

No grace, no policy hook, no listener. sessions is a private final Map
and HttpServletStreamableMcpSessionTransport is a private inner class, so
downstream apps cannot override the behaviour without reflection.

Proposal

Two shapes, from least to most invasive:

1. Configurable session-retention grace period (minimal, recommended)

Add Builder.sessionReconnectGracePeriod(Duration), defaulting to
Duration.ZERO (current behaviour, fully backward compatible). On write
failure:

  • mark the session as detached (new flag on the session transport);
  • schedule a removal task at now + grace on a shared scheduled executor;
  • on an incoming GET /mcp whose session id matches, clear the detach flag,
    cancel the scheduled removal, and let the existing Last-Event-ID replay
    path (SDK lines 327–349) or Support last event Id for resumability of sse #830's event-store path run.

With grace = Duration.ZERO, behaviour is identical to today. With
grace > 0, transient blips no longer orphan clients. Composes naturally
with #914 (pluggable store) and #830 (replay).

2. Session lifecycle listener (larger, general-purpose)

Add a SessionLifecycleListener interface on the builder with
onSessionDetached, onSessionReconnected, onSessionClosed. Ship the
current eager-remove behaviour as the default listener; apps can register
a custom listener with whatever retention policy suits their deployment.

I'd prefer option 1 — it solves the concrete problem without opening a
broader API-surface question.

Secondary observation — duplicate asyncContext.complete()

Same file: the catch block in sendMessage (line 761) and close()
(line ~811) both call asyncContext.complete(). When a write failure is
followed by a close, the second call races with the servlet container's
state machine and produces:

Failed to complete async context ... Async state [COMPLETING]

Cosmetic, but pollutes logs. A flag (e.g. asyncContextCompleted set on
first call, checked before the second) would silence it.

Willing to contribute

Happy to open a PR for either of the above if the maintainers would welcome
the contribution — just want to confirm the approach and that option 1 is
the direction you'd prefer before writing code.

Metadata

Metadata

Assignees

No one assigned

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions