Skip to content

chore: extract main into an embeddable internal/app package#5259

Open
siavashs wants to merge 10 commits into
prometheus:mainfrom
siavashs:feat/app/pkg
Open

chore: extract main into an embeddable internal/app package#5259
siavashs wants to merge 10 commits into
prometheus:mainfrom
siavashs:feat/app/pkg

Conversation

@siavashs
Copy link
Copy Markdown
Contributor

@siavashs siavashs commented May 29, 2026

Summary

Resolves the long-standing TODO from #406 by extracting the Alertmanager process logic out of cmd/alertmanager/main.go into a new internal/app package, and giving it a lifecycle API (New / Start / Addr / Reload / Stop) so tests and other binaries can embed Alertmanager in-process instead of building and shelling out to the compiled binary.

cmd/alertmanager/main.go shrinks from 722 → 207 lines and now owns only: kingpin flag parsing, logger construction, versioncollector registration, feature-flag / GOMEMLIMIT side effects, and translating OS signals into context cancellation (SIGINT/SIGTERM) plus reload events (SIGHUP) consumed by app.Run.

What this changes

A reusable internal/app package

The package is mainly intended for internal use, it can be used by other projects embedding Alertmanager as well.

A lifecycle API for embedders

New(opts) (*App, error)
(*App).Start() error
(*App).Addr() string             // first listener
(*App).Addrs() []string          // all listeners
(*App).Reload(ctx) error
(*App).Stop(ctx) error

Run is a thin wrapper (New + Start + serveLoop + Stop) with a deferred Stop so cleanup also runs on panic. Listeners are bound at New time so Addr()/Addrs() report the real bound ports (including kernel-assigned :0) before serving starts.

Per-instance state, so multiple instances can coexist

The Prometheus collectors that used to be package-level promauto variables in cmd/alertmanager/main.go are now built per Run() against opts.Registerer, which is threaded through every collaborator (versioncollector excepted, which stays process-global in main.go). This unblocks running multiple Alertmanager instances in the same process without duplicate-registration panics.

Deterministic, observable shutdown

setup registers teardown steps on a LIFO cleanup stack that Stop drains in reverse, mirroring Go's defer semantics so shutdown ordering follows construction order automatically. Each step is named; Stop runs them all, logs any failure by name, and returns the errors joined with the HTTP-shutdown error. The HTTP shutdown honors a single timeout derived from the context passed to Stop.

Isolated config-reload logic

The config-scoped subgraph lives in a dedicated reloader type. reloader.reload performs the stop-old → build-new → wait-for-loading → atomic-swap sequence, and reloader.stop tears down the live inhibitor and dispatcher at shutdown. The long-lived singletons (nflog, silences, alerts, cluster peer, API, event recorder, tracing) are constructed once and updated in place on reload (apih.Update, eventRec/tracing ApplyConfig) rather than rebuilt.

Behavioural notes

  • prometheus.DefaultRegisterer is no longer referenced inside app.Run; the binary still passes it in via Options.Registerer, so on-disk behaviour is identical.
  • srv.Shutdown now actually runs on Run exit (previously the deferred srv.Close lived inside the listen goroutine and never ran in practice because os.Exit killed the process first). Behaviour for the binary is unchanged; embedded callers now get clean HTTP teardown.
  • Systemd socket activation and vsock:// listen addresses work under both the binary and embedders; the external URL is derived from the actual bound address.
  • tracingManager.Stop is part of the cleanup stack and always runs, not just on ctx.Done().
  • Start/Stop are concurrency-safe and /-/reload works in embedded mode (no deadlock without Run/serveLoop).
  • --cluster.listen-address default moved from a const in cmd/alertmanager to the exported app.DefaultClusterAddr.

Known follow-ups (out of scope)

  • matcher/compat.InitFromFlags still mutates package-level state; multi-instance tests with different feature flags will collide. Tracked separately.
  • Migrating the v2 acceptance harness to drive app.Run directly instead of building and spawning the binary. Now mechanically possible thanks to Addr() / Stop() on *App; left for a follow-up PR to keep this one reviewable.

Verification

Tests live in internal/app (lifecycle, listen, options, url, reloader, cluster). Highlights:

  • TestApp_StartStop — boot, probe /-/healthy, stop, stop again (idempotency).
  • TestApp_TwoSequentialInstances / TestApp_TwoConcurrentInstances — multiple instances in one process; guards the metrics-per-Registerer behaviour.
  • TestApp_ClusteredStartStop — gossip clustering enabled; exercises the peer-dependent paths and clusterWait.
  • TestApp_ConcurrentStartStop — races Start and Stop (run under -race).
  • TestApp_EmbeddedReloadDoesNotDeadlock/-/reload completes in embedded mode.
  • TestApp_New_SetupFailureDoesNotDeadlock — setup-failure rollback doesn't block.
  • TestApp_Run_ContextCancel — end-to-end Run wrapper with ctx cancellation.
  • TestApp_serveLoop / TestApp_Stop_AggregatesCleanupErrors — serve-loop exits and aggregated shutdown errors.
  • TestReloader_* — component swap on reload, error path leaves prior state intact, nil-safe stop.
  • TestListenAll_* / TestParseVsockPort / TestOptions_Validate / TestClusterWait.

internal/app statement coverage is ~84%; the full suite passes under -race.

Diff size

 cmd/alertmanager/main.go         | -515  (722 → 207)
 internal/app/app.go              | +444
 internal/app/lifecycle.go        | +319
 internal/app/lifecycle_test.go   | +374
 internal/app/reloader.go         | +237
 internal/app/reloader_test.go    | +178
 internal/app/listen_test.go      | +116
 internal/app/options.go          | +112
 internal/app/metrics.go          |  +98
 internal/app/listen.go           |  +96
 internal/app/options_test.go     |  +76
 internal/app/cluster_test.go     |  +67
 internal/app/url.go              |  +57
 internal/app/cluster.go          |  +29
 go.mod                           |   +2 / -2
 internal/app/url_test.go         | rename (was cmd/alertmanager/main_test.go)

Closes #406

Pull Request Checklist

Please check all the applicable boxes.

  • Please list all open issue(s) discussed with maintainers related to this change
  • Is this a new Receiver integration?
  • Is this a bugfix?
    • I have added tests that can reproduce the bug which pass with this bugfix applied
  • Is this a new feature?
    • I have added tests that test the new feature's functionality
  • Does this change affect performance?
    • I have provided benchmarks comparison that shows performance is improved or is not degraded
      • You can use benchstat to compare benchmarks
    • I have added new benchmarks if required or requested by maintainers
  • Is this a breaking change?
    • My changes do not break the existing cluster messages
    • My changes do not break the existing api
  • I have added/updated the required documentation
  • I have signed-off my commits
  • I will follow best practices for contributing to this project

Which user-facing changes does this PR introduce?

NONE

Summary by CodeRabbit

  • Refactor

    • Consolidated startup/shutdown into a dedicated App lifecycle with a single-run entrypoint and cleaner boot, graceful shutdown, and controlled config reloads.
  • New Features

    • App-owned Prometheus metrics for HTTP handlers and config counts.
    • Eager listener binding so the server reports real bound addresses early.
    • External URL normalization and validation.
    • Config reloader that atomically swaps dispatcher/inhibitor state.
  • Bug Fixes

    • Fixed races/deadlocks around start/reload/stop and improved clustered startup robustness.
  • Tests

    • Expanded tests for lifecycle, listeners, options, URL handling, reloader, and metrics.

@coderabbitai
Copy link
Copy Markdown

coderabbitai Bot commented May 29, 2026

Review Change Stack

No actionable comments were generated in the recent review. 🎉

ℹ️ Recent review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Enterprise

Run ID: 46a153ea-02dc-4177-87da-aafd3ed3eac8

📥 Commits

Reviewing files that changed from the base of the PR and between fd8d73e and 768678b.

📒 Files selected for processing (1)
  • internal/app/reloader.go
🚧 Files skipped from review as they are similar to previous changes (1)
  • internal/app/reloader.go

📝 Walkthrough

Walkthrough

Extracts Alertmanager startup into internal/app: adds app.Options and App lifecycle (New/Start/Stop/Reload), listener/URL/metrics helpers, a config reloader that atomically swaps dispatcher/inhibitor, and delegates cmd/alertmanager to app.Run after flag parsing and signal setup.

Changes

App Package Refactoring and Extraction

Layer / File(s) Summary
Options and validation
internal/app/options.go, internal/app/options_test.go
Adds Options with DefaultClusterAddr and validate() plus unit tests covering missing dependencies and web config validation.
Listener binding and vsock parsing
internal/app/listen.go, internal/app/listen_test.go
Implements eager listener binding (systemd/TCP/vsock), listenAll/listenOne, parseVsockPort, and tests for no-listeners, multiple binds, cleanup on partial failure, and vsock parsing.
Cluster helper
internal/app/cluster.go, internal/app/cluster_test.go
Adds clusterWait(p, timeout) func() time.Duration and a test validating wait duration equals time.Duration(peer.Position()) * timeout.
URL and metrics helpers
internal/app/url.go, internal/app/url_test.go, internal/app/metrics.go
Adds extURL to normalize external URLs, moves URL test to app package, and introduces app-owned Prometheus metrics plus handler instrumentation.
Run entrypoint and setup wiring
internal/app/app.go
Adds exported Run(ctx, opts) and (*App).setup() that validates options; initializes metrics/logger/data dir; prepares gossip/cluster; constructs nflog/silences/alerts; binds listeners; initializes tracing and coordinator; builds reloader, applies initial config, and creates instrumented http.Server.
Config reloader implementation and tests
internal/app/reloader.go, internal/app/reloader_test.go
Implements reloader that parses templates, builds integrations for referenced receivers only, constructs and starts inhibitor and dispatcher in order, swaps active components atomically, and includes tests for swap, rollback on error, and nil-safe stop.
App lifecycle type and runtime control + tests
internal/app/lifecycle.go, internal/app/lifecycle_test.go
Introduces App type with New, Start, Stop, Reload, Addr, Addrs, serveLoop, reload multiplexing, graceful shutdown with LIFO teardown and aggregated errors, plus extensive lifecycle and concurrency tests.
cmd/alertmanager main.go refactor
cmd/alertmanager/main.go
Shrinks main.go to handle flags, logger, signal/reload channel, updates flag defaults (--cluster.listen-address -> app.DefaultClusterAddr, dispatch.start-delay), moves versioncollector registration after logger creation, constructs app.Options, and calls app.Run.
go.mod updates
go.mod
Promotes github.com/coreos/go-systemd/v22 and github.com/mdlayher/vsock from indirect to direct requires.

Sequence Diagram(s)

sequenceDiagram
  participant CMD as cmd/alertmanager
  participant AppRun as app.Run
  participant Setup as App.setup
  participant Reloader as reloader.reload
  participant HTTP as http.Server
  participant Cluster as cluster.Peer
  CMD->>AppRun: Run(ctx, opts)
  AppRun->>Setup: New(opts) / setup()
  Setup->>Cluster: optionally create/join gossip peer
  Setup->>Reloader: build reloader and subscribe reloads
  Setup->>HTTP: bind listeners and create server
  Reloader->>Reloader: parse templates, build integrations
  Reloader->>Reloader: start inhibitor -> start dispatcher -> atomic swap
  AppRun->>HTTP: Start ServeMultiple (serve goroutine)
Loading

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~45 minutes

Possibly related PRs

Suggested labels

component/dispatch, component/inhibition

Suggested reviewers

  • Spaceman1701
  • TheMeier
🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 29.27% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (4 passed)
Check name Status Explanation
Title check ✅ Passed The PR title 'chore: extract main into an embeddable internal/app package' directly and clearly describes the primary change: extracting main logic into a reusable internal/app package for embedding.
Description check ✅ Passed The PR description comprehensively covers objectives, implementation details, behavioural notes, testing, and verification. It follows the template structure with sign-off confirmation and proper checklist completion.
Linked Issues check ✅ Passed The PR fully addresses issue #406's objectives: main logic is extracted into an embeddable internal/app package with a complete lifecycle API (New/Start/Addr/Reload/Stop/Run), per-instance registries enabling multiple instances, and comprehensive test coverage (~84% statement coverage, passes under -race).
Out of Scope Changes check ✅ Passed All changes are directly scoped to extracting main logic into internal/app: cmd/alertmanager/main.go refactored, internal/app files added with lifecycle/helpers/tests, go.mod dependencies updated. Follow-ups (matcher/compat state, v2 harness migration) are explicitly noted as out of scope.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

@siavashs siavashs added kind/cleanup go Pull requests that update Go code labels May 29, 2026
Copy link
Copy Markdown

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@cmd/alertmanager/main.go`:
- Line 205: The shutdown log is hardcoded to "Received SIGTERM…" which
misreports SIGINT/Ctrl+C; change the logger.Info call in main.go (the location
that currently logs "Received SIGTERM, exiting gracefully...") to log the actual
signal received (use the signal variable from the signal.Notify/select or, if
you cancel via ctx, log ctx.Err() or a generic "shutting down" message) so the
message reflects the real cause; update the handler that calls app.Run and the
signal.Notify/select branch to pass the received os.Signal (or its String())
into logger.Info instead of the fixed "SIGTERM" text.

In `@internal/app/lifecycle.go`:
- Around line 142-163: The Stop method can block forever on the "for range
a.srvc" if Start's serve goroutine never closes a.srvc; change Stop to perform a
non-blocking drain of a.srvc instead of a blocking range so Stop returns safely
even if Start wasn't run. Specifically, update App.Stop to replace the for range
over a.srvc with a loop that repeatedly attempts a non-blocking receive from
a.srvc (e.g., select with a receive case and a default case) until the channel
is drained/closed or there is nothing to read; reference symbols: App.Stop,
a.srvc, Start, New.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Enterprise

Run ID: cfbc0c18-07d3-4da0-85e9-8fea36348cfa

📥 Commits

Reviewing files that changed from the base of the PR and between cc7bf21 and a59bd1f.

📒 Files selected for processing (9)
  • cmd/alertmanager/main.go
  • internal/app/app.go
  • internal/app/cluster.go
  • internal/app/lifecycle.go
  • internal/app/lifecycle_test.go
  • internal/app/metrics.go
  • internal/app/options.go
  • internal/app/url.go
  • internal/app/url_test.go

Comment thread cmd/alertmanager/main.go Outdated
Comment thread internal/app/lifecycle.go
Move the body of run() from cmd/alertmanager/main.go into a new
internal/app package so Alertmanager can be embedded in tests and
other binaries without shelling out to a compiled binary. Resolves
the long-standing TODO from prometheus#406.

cmd/alertmanager/main.go shrinks from 724 to 196 lines and is now
responsible only for kingpin flag parsing, logger construction,
versioncollector registration, feature-flag / GOMEMLIMIT side
effects, and translating OS signals into context cancellation
(SIGINT/SIGTERM) plus reload events (SIGHUP) consumed by app.Run.

The new internal/app package is split into:

  * options.go  - Options struct, validate(), DefaultClusterAddr
  * app.go      - Run(ctx, opts) error
  * metrics.go  - per-instance Prometheus metrics struct
  * cluster.go  - clusterWait helper
  * url.go      - extURL helper (+ url_test.go for TestExternalURL)

The six previously package-level promauto.NewXxx variables in
cmd/alertmanager/main.go are now constructed per Run() invocation
against opts.Registerer. Combined with threading the registerer
through every collaborator (versioncollector excepted, which stays
in main.go as a process-global), this unblocks running multiple
Alertmanager instances in the same process without duplicate-
registration panics.

Behavioural notes:

  * prometheus.DefaultRegisterer is no longer referenced inside
    app.Run; the binary still passes it in via Options.Registerer
    so on-disk behaviour is identical.
  * app.Run defers srv.Shutdown(5s) on exit. Previously the
    deferred srv.Close lived inside the listen goroutine and never
    ran in practice because os.Exit killed the process first.
    Behaviour for the binary is unchanged; embedded callers now
    get clean HTTP teardown.
  * --cluster.listen-address default moved from a const in
    cmd/alertmanager to the exported app.DefaultClusterAddr.

Known follow-ups intentionally out of scope:

  * matcher/compat.InitFromFlags still mutates package-level
    state; multi-instance tests with different feature flags will
    collide.
  * Richer App lifecycle (New/Start/Addr/Reload/Stop) for tests
    that need :0-port discovery or programmatic reload.
  * Migrating the v2 acceptance harness to use app.Run directly
    instead of building and spawning the binary.

Verification: `go build ./...`, `go vet ./...`, and
`go test -count=1 ./...` all pass, including the existing
test/with_api_v2/acceptance suite which continues to build and
spawn the binary end-to-end.

Closes prometheus#406

Signed-off-by: Siavash Safi <siavash@cloudflare.com>
Copy link
Copy Markdown

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@internal/app/lifecycle.go`:
- Around line 96-113: The Start method can deadlock because the registered
/-/reload handler blocks on sending to the unbuffered a.webReload channel (and
errors from web.ServeMultiple are sent to a.srvc) while the only consumer
(serveLoop, invoked by Run) may not be running for embedders; fix by ensuring
Start spawns the reload-and-error drain loop so a.webReload and a.opts.Reload
are drained even when Run/serveLoop is not used, or alternatively make the
reload handler do a non-blocking send/fail-fast: add a goroutine in Start that
runs the same logic as serveLoop (draining a.webReload, a.opts.Reload and
forwarding errors to a.reload/ reload handler) and ensure web.ServeMultiple
errors sent to a.srvc are observed (do not close a.srvc before draining), or
change the handler to select { case a.webReload <- errc: default: respond with
an immediate error } so the handler never blocks when the drain loop is absent;
update Start, the /-/reload handler, and any use of a.srvc/a.webReload
accordingly.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Enterprise

Run ID: 6a09cb35-1507-4077-b2b1-721ea7afc541

📥 Commits

Reviewing files that changed from the base of the PR and between a59bd1f and 8741088.

📒 Files selected for processing (9)
  • cmd/alertmanager/main.go
  • internal/app/app.go
  • internal/app/cluster.go
  • internal/app/lifecycle.go
  • internal/app/lifecycle_test.go
  • internal/app/metrics.go
  • internal/app/options.go
  • internal/app/url.go
  • internal/app/url_test.go
🚧 Files skipped from review as they are similar to previous changes (7)
  • internal/app/url_test.go
  • internal/app/options.go
  • internal/app/cluster.go
  • internal/app/metrics.go
  • internal/app/url.go
  • internal/app/lifecycle_test.go
  • internal/app/app.go

Comment thread internal/app/lifecycle.go
Copy link
Copy Markdown

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🧹 Nitpick comments (1)
internal/app/lifecycle.go (1)

53-55: ⚡ Quick win

Stale doc comments still point at serveLoop.

webReload is now consumed by reloadRouter, not serveLoop. The same staleness applies to the Reload docstring on Line 175 ("Safe to call concurrently with serveLoop"), since serveLoop no longer routes reloads. In code this deadlock-sensitive, accurate "who consumes this channel" comments matter for future maintainers.

📝 Suggested doc fixes
 	// webReload is the channel exposed by httpserver.Register for the
-	// /-/reload HTTP endpoint. We read from it in serveLoop.
+	// /-/reload HTTP endpoint. We read from it in reloadRouter.
 	webReload chan chan error
 // Reload triggers a configuration reload (the programmatic equivalent of
-// SIGHUP). Safe to call concurrently with serveLoop.
+// SIGHUP). Safe to call concurrently with the running App.
 func (a *App) Reload(_ context.Context) error {
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@internal/app/lifecycle.go` around lines 53 - 55, Doc comments are stale:
update the comment for the webReload channel and the Reload docstring (mentions
of "serveLoop") to reflect that reloads are now consumed by reloadRouter, not
serveLoop; locate the declaration webReload and the Reload method/docstring and
change references from serveLoop to reloadRouter and adjust wording about
concurrency to say "Safe to call concurrently with reloadRouter" (or similar) so
the consumer is accurate and deadlock-sensitive guidance is preserved.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@internal/app/lifecycle_test.go`:
- Around line 161-168: The tests TestApp_EmbeddedReloadDoesNotDeadlock and
TestApp_New_SetupFailureDoesNotDeadlock use require.NoError/Equal/Error inside
spawned goroutines (the anonymous go func that closes done), which can call
t.FailNow from a child goroutine; change these to not call require from the
goroutine: either (A) replace require.* with assert.* inside the goroutine
(e.g., assert.NoError/assert.Equal/assert.Error) or (B) capture the goroutine
results by sending error/status values down a channel (use the existing done
channel or a new result channel) and perform require.* assertions on those
results in the main test goroutine after <-done; update the anonymous functions
and their callers (the POST to "/-/reload" and the setup-failure goroutine) to
use one of these patterns so all require.* calls run on the main test goroutine.

---

Nitpick comments:
In `@internal/app/lifecycle.go`:
- Around line 53-55: Doc comments are stale: update the comment for the
webReload channel and the Reload docstring (mentions of "serveLoop") to reflect
that reloads are now consumed by reloadRouter, not serveLoop; locate the
declaration webReload and the Reload method/docstring and change references from
serveLoop to reloadRouter and adjust wording about concurrency to say "Safe to
call concurrently with reloadRouter" (or similar) so the consumer is accurate
and deadlock-sensitive guidance is preserved.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Enterprise

Run ID: 7df0d969-feab-4bf1-8cf6-684bb59bd394

📥 Commits

Reviewing files that changed from the base of the PR and between 8741088 and b273f3f.

📒 Files selected for processing (3)
  • internal/app/app.go
  • internal/app/lifecycle.go
  • internal/app/lifecycle_test.go
🚧 Files skipped from review as they are similar to previous changes (1)
  • internal/app/app.go

Comment thread internal/app/lifecycle_test.go
introduce an App lifecycle so tests and embedders can drive
Alertmanager without OS signals or os/exec, and discover the bound
HTTP address even when listening on ":0".

API:

  New(opts) (*App, error)
  (*App).Start() error
  (*App).Addr() string             // first listener
  (*App).Addrs() []string          // all listeners
  (*App).Reload(ctx) error
  (*App).Stop(ctx) error

Run is preserved as a thin wrapper (New + Start + serveLoop + Stop)
with a deferred Stop on a fresh 30s context so cleanup also runs on
panic, matching the implicit panic-safety of the previous defer-
based implementation.

Internally, setup uses a cleanup stack (a.onStop) that Stop drains
in LIFO order, mirroring Go's defer semantics so the source order
of the old `defer X` lines in Run is preserved verbatim and the
shutdown ordering does not depend on hand-written reverse-order
code. Listeners are bound at New time via a new listenAll helper
that calls net.Listen directly (so Addr is meaningful before
Start); web.ServeMultiple is then invoked in Start. Systemd socket
activation is not supported when embedding and returns an explicit
error pointing callers back to cmd/alertmanager.

Stop honors its context parameter for the HTTP shutdown step,
capped at 5s, so callers passing a tighter deadline get faster
teardown and callers passing context.Background get the default.

Tests cover: single instance round-trip; two sequential instances
in the same process (guards the Phase A metrics-per-Registerer
fix against duplicate-registration panics); two concurrent
instances on distinct ephemeral ports; and the Run wrapper
end-to-end with ctx cancellation. All pass under -race.

Signed-off-by: Siavash Safi <siavash@cloudflare.com>
@siavashs siavashs marked this pull request as ready for review May 29, 2026 11:30
@siavashs siavashs requested a review from a team as a code owner May 29, 2026 11:30
Comment thread internal/app/app.go
if d := disp.Load(); d != nil {
d.Stop()
}
})
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here in setup() we may also need to do an onStop(...) for the inhibitor or we leak it at Stop()

a.onStop(func() {
if i := inhibitor.Load(); i != nil { i.Stop() }
})

Comment thread internal/app/options.go
Comment thread internal/app/lifecycle.go Outdated
Comment thread internal/app/app.go Outdated
Comment thread internal/app/lifecycle.go
Comment thread internal/app/app.go Outdated
Comment thread internal/app/app.go Outdated
Comment thread internal/app/app.go Outdated
Comment thread internal/app/lifecycle.go Outdated
Comment thread internal/app/lifecycle.go
Signed-off-by: Siavash Safi <siavash@cloudflare.com>
Copy link
Copy Markdown

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)
internal/app/lifecycle_test.go (1)

205-213: ⚠️ Potential issue | 🟡 Minor | ⚡ Quick win

Stabilize the embedded reload regression test startup path.

http.Post is issued immediately after Start(), so this can fail with connection refused before the listener is ready, creating flakes unrelated to the deadlock behavior.

Proposed fix
 func TestApp_EmbeddedReloadDoesNotDeadlock(t *testing.T) {
 	a, err := New(testOptions(t))
 	require.NoError(t, err)
 	require.NoError(t, a.Start())
 	defer func() { _ = a.Stop(t.Context()) }()
+	client := &http.Client{Timeout: 2 * time.Second}
+
+	require.Eventually(t, func() bool {
+		resp, err := client.Get("http://" + a.Addr() + "/-/healthy")
+		if err != nil {
+			return false
+		}
+		_, _ = io.Copy(io.Discard, resp.Body)
+		_ = resp.Body.Close()
+		return resp.StatusCode == http.StatusOK
+	}, 5*time.Second, 50*time.Millisecond)

 	type reloadResult struct {
 		err    error
 		status int
 	}
 	resultCh := make(chan reloadResult, 1)
 	go func() {
-		resp, err := http.Post("http://"+a.Addr()+"/-/reload", "", nil)
+		resp, err := client.Post("http://"+a.Addr()+"/-/reload", "", nil)
 		if err != nil {
 			resultCh <- reloadResult{err: err}
 			return
 		}
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@internal/app/lifecycle_test.go` around lines 205 - 213, The test posts to
"http://"+a.Addr()+"/-/reload" immediately after calling a.Start(), which can
race with listener setup and cause transient connection-refused flakes; replace
the single http.Post call with a short retry/wait loop that probes the listener
(e.g., net.Dial("tcp", a.Addr()) or repeated http.Post attempts) with a small
backoff and overall timeout before sending the request, and only proceed to
io.Copy/resp.Body.Close() and sending resultCh <- reloadResult once a successful
connection/response is obtained; update the goroutine that currently uses
http.Post, a.Addr(), resultCh, and reloadResult to implement this
retry-and-timeout logic.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Outside diff comments:
In `@internal/app/lifecycle_test.go`:
- Around line 205-213: The test posts to "http://"+a.Addr()+"/-/reload"
immediately after calling a.Start(), which can race with listener setup and
cause transient connection-refused flakes; replace the single http.Post call
with a short retry/wait loop that probes the listener (e.g., net.Dial("tcp",
a.Addr()) or repeated http.Post attempts) with a small backoff and overall
timeout before sending the request, and only proceed to
io.Copy/resp.Body.Close() and sending resultCh <- reloadResult once a successful
connection/response is obtained; update the goroutine that currently uses
http.Post, a.Addr(), resultCh, and reloadResult to implement this
retry-and-timeout logic.

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Enterprise

Run ID: 3c018ba7-289c-4178-bd2b-888bb5ea0e2b

📥 Commits

Reviewing files that changed from the base of the PR and between cdd9230 and 52cb2f5.

📒 Files selected for processing (8)
  • go.mod
  • internal/app/app.go
  • internal/app/lifecycle.go
  • internal/app/lifecycle_test.go
  • internal/app/listen.go
  • internal/app/listen_test.go
  • internal/app/options.go
  • internal/app/options_test.go
🚧 Files skipped from review as they are similar to previous changes (2)
  • internal/app/options.go
  • internal/app/app.go

Copy link
Copy Markdown

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)
internal/app/lifecycle.go (1)

116-120: ⚠️ Potential issue | 🔴 Critical | ⚡ Quick win

Fix nil-channel panic when Stop races Start initialization

The below code does not show removed or replaced code.

		a.started.Store(true)
		a.routerQuit = make(chan struct{})
		a.routerDone = make(chan struct{})

		// reloadRouter consumes /-/reload requests and opts.Reload sends so

On internal/app/lifecycle.go, Start publishes a.started.Store(true) before a.routerQuit / a.routerDone are allocated (lines 116-118). If Stop runs concurrently, the if a.started.Load() path (198-221) can execute close(a.routerQuit) and <-a.routerDone while those channels are still nil, panicking.

Serialize the lifecycle transition (e.g., only set started after routerQuit/routerDone are allocated and the related goroutines are ready, or guard the Stop close/wait with a shared mutex/state).

Also applies to: 198-221


ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Enterprise

Run ID: ae85aabc-9577-4342-a55a-0f7062e46ee4

📥 Commits

Reviewing files that changed from the base of the PR and between cdd9230 and 52cb2f5.

📒 Files selected for processing (8)
  • go.mod
  • internal/app/app.go
  • internal/app/lifecycle.go
  • internal/app/lifecycle_test.go
  • internal/app/listen.go
  • internal/app/listen_test.go
  • internal/app/options.go
  • internal/app/options_test.go
🚧 Files skipped from review as they are similar to previous changes (7)
  • internal/app/options_test.go
  • internal/app/listen.go
  • internal/app/options.go
  • internal/app/lifecycle_test.go
  • internal/app/app.go
  • go.mod
  • internal/app/listen_test.go

siavashs added 2 commits June 5, 2026 11:58
Signed-off-by: Siavash Safi <siavash@cloudflare.com>
Extract the configuration-scoped subgraph (routes, receivers, pipeline,
inhibitor, dispatcher) out of the monolithic setup() into a dedicated
reloader type. reloader.reload performs the subtle stop-old/build-new/
wait-for-loading/atomic-swap sequence and reloader.stop tears down the
live inhibitor+dispatcher, so the swap ordering now lives in one cohesive,
independently testable place. The long-lived singletons stay in setup and
continue to be updated in place (apih.Update, eventRec/tracing ApplyConfig)
rather than rebuilt.

Make shutdown observable and harder to misuse:
- onStop now registers named, error-returning teardown steps; Stop runs
  them all LIFO, logs each failure by name, and returns the errors joined
  with the HTTP-shutdown error instead of swallowing them.
- Add trackClose to register a resource's Close right at its acquisition,
  closing the "acquire here, register cleanup elsewhere" gap that
  previously leaked goroutines/handles.

Add tests and raise internal/app coverage from ~77% to ~84%:
- reloader_test: component swap on reload, error path leaves prior state
  intact, nil-safe stop.
- cluster_test / TestApp_ClusteredStartStop: exercise clusterWait and the
  peer-dependent branches in setup and reload.
- TestApp_serveLoop: listener-error, clean-close, and ctx-cancel exits.
- TestApp_Stop_AggregatesCleanupErrors and TestApp_Start_BeforeNewFails.

No behavioral change to teardown order or HTTP-drain sequencing.

Signed-off-by: Siavash Safi <siavash@cloudflare.com>
Copy link
Copy Markdown

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (2)
internal/app/app.go (1)

122-145: ⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Register gossip peer teardown immediately after cluster.Create.

The cleanup is registered too late. If Line 154, Line 190, Line 230, or Line 324 returns an error after the peer is created at Line 124, Stop has no peer cleanup to run, so the startup-failure path leaks the gossip peer's sockets/goroutines. internal/app/cluster_test.go Lines 32-51 already show that peer.Leave(...) is safe on a freshly created peer, so this teardown can be installed right after cluster.Create succeeds and before the later setup steps run.

💡 Suggested fix
-	var peer *cluster.Peer
+	var (
+		peer         *cluster.Peer
+		settleCancel = func() {}
+	)
 	if opts.ClusterBindAddr != "" {
 		peer, err = cluster.Create(
 			logger.With("component", "cluster"),
 			reg,
@@
 		if err != nil {
 			return fmt.Errorf("unable to initialize gossip mesh: %w", err)
 		}
+		a.onStop("cluster peer leave", func() error {
+			settleCancel()
+			if err := peer.Leave(10 * time.Second); err != nil {
+				return fmt.Errorf("unable to leave gossip mesh: %w", err)
+			}
+			return nil
+		})
 		m.clusterEnabled.Set(1)
 	}
@@
-		settleCtx, settleCancel := context.WithTimeout(context.Background(), opts.SettleTimeout)
-		a.onStop("cluster peer leave", func() error {
-			settleCancel()
-			if err := peer.Leave(10 * time.Second); err != nil {
-				return fmt.Errorf("unable to leave gossip mesh: %w", err)
-			}
-			return nil
-		})
+		settleCtx, cancel := context.WithTimeout(context.Background(), opts.SettleTimeout)
+		settleCancel = cancel
 		go peer.Settle(settleCtx, opts.GossipInterval*10)
 		eventRec.SetClusterPeer(peer)
 	}

Also applies to: 247-258

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@internal/app/app.go` around lines 122 - 145, After cluster.Create returns a
non-nil peer, immediately register the peer teardown with the Stop/cleanup
handler (e.g. Stop.Register(func(ctx context.Context){ _ = peer.Leave(ctx) }))
before performing any subsequent setup or calling m.clusterEnabled.Set(1); do
this right after the successful cluster.Create call in the first block that
assigns peer and in the second similar block (the 247-258 region) so that any
later early returns will still invoke peer.Leave and avoid leaking
sockets/goroutines.
internal/app/lifecycle.go (1)

166-175: ⚠️ Potential issue | 🟡 Minor | ⚡ Quick win

Handle closed Options.Reload to avoid reload hot-loop in reloadRouter.

If opts.Reload is ever closed, case <-a.opts.Reload is permanently ready, so reloadRouter will continuously call a.coordinator.Reload() until shutdown (the binary never closes this channel, but embedders could). Gate the case on ok and disable it after the first closed read.

Suggested fix
 func (a *App) reloadRouter() {
 	defer close(a.routerDone)
+	reloadCh := a.opts.Reload
 	for {
 		select {
 		case <-a.routerQuit:
 			return
-		case <-a.opts.Reload:
+		case _, ok := <-reloadCh:
+			if !ok {
+				reloadCh = nil
+				continue
+			}
 			if err := a.coordinator.Reload(); err != nil {
 				a.logger.Error("configuration reload failed", "err", err)
 			}
 		case errc := <-a.webReload:
 			errc <- a.coordinator.Reload()
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@internal/app/lifecycle.go` around lines 166 - 175, The select branch reading
from a.opts.Reload should detect a closed channel and disable further reloads to
avoid a hot-loop; change the case to receive with the comma-ok pattern from
a.opts.Reload (e.g., v, ok := <-a.opts.Reload) and if ok call
a.coordinator.Reload(), otherwise set a.opts.Reload = nil (or otherwise disable
that case) so the select no longer treats it as permanently ready; update the
reloadRouter/select logic accordingly to reference a.opts.Reload and
a.coordinator.Reload.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@internal/app/reloader.go`:
- Around line 90-93: The reload() implementation currently mutates long-lived
components (e.g., eventRec.ApplyConfig and tracingMgr.ApplyConfig) before the
new runtime graph is fully built, which can leave a partially applied config on
failure; change the flow to build the entire new runtime graph first (create new
dispatcher/inhibitor/tracing manager and prepare new event recorder config)
without calling ApplyConfig on existing shared instances, then atomically
swap/publish the new instances and finally call ApplyConfig on the new
components (or call ApplyConfig during construction of new instances);
alternatively, if you must apply to existing instances, implement a rollback
path that restores prior state on any subsequent error. Target symbols:
reload(), eventRec.ApplyConfig, tracingMgr.ApplyConfig, the dispatcher/inhibitor
publish/stop logic, and any code paths at the other mentioned locations (around
lines where ApplyConfig is called) to ensure no in-place mutation happens before
a successful full build.

---

Outside diff comments:
In `@internal/app/app.go`:
- Around line 122-145: After cluster.Create returns a non-nil peer, immediately
register the peer teardown with the Stop/cleanup handler (e.g.
Stop.Register(func(ctx context.Context){ _ = peer.Leave(ctx) })) before
performing any subsequent setup or calling m.clusterEnabled.Set(1); do this
right after the successful cluster.Create call in the first block that assigns
peer and in the second similar block (the 247-258 region) so that any later
early returns will still invoke peer.Leave and avoid leaking sockets/goroutines.

In `@internal/app/lifecycle.go`:
- Around line 166-175: The select branch reading from a.opts.Reload should
detect a closed channel and disable further reloads to avoid a hot-loop; change
the case to receive with the comma-ok pattern from a.opts.Reload (e.g., v, ok :=
<-a.opts.Reload) and if ok call a.coordinator.Reload(), otherwise set
a.opts.Reload = nil (or otherwise disable that case) so the select no longer
treats it as permanently ready; update the reloadRouter/select logic accordingly
to reference a.opts.Reload and a.coordinator.Reload.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Enterprise

Run ID: 78d02d37-9dc1-4753-aa05-9cc4b909a5e3

📥 Commits

Reviewing files that changed from the base of the PR and between 52cb2f5 and 0103e36.

📒 Files selected for processing (6)
  • internal/app/app.go
  • internal/app/cluster_test.go
  • internal/app/lifecycle.go
  • internal/app/lifecycle_test.go
  • internal/app/reloader.go
  • internal/app/reloader_test.go

Comment thread internal/app/reloader.go Outdated
siavashs added 4 commits June 5, 2026 12:48
Signed-off-by: Siavash Safi <siavash@cloudflare.com>
Signed-off-by: Siavash Safi <siavash@cloudflare.com>
Signed-off-by: Siavash Safi <siavash@cloudflare.com>
Signed-off-by: Siavash Safi <siavash@cloudflare.com>
Copy link
Copy Markdown

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)
internal/app/reloader.go (1)

159-160: ⚠️ Potential issue | 🟡 Minor | ⚡ Quick win

Publish the new inhibitor only after it has finished loading.

internal/app/app.go wires the API mute callback to r.inhibitor.Load(), so storing newInhibitor before Run/WaitForLoading exposes an incomplete inhibition cache to concurrent API requests during reload. The dispatcher already avoids this by publishing only after WaitForLoading; the inhibitor should follow the same pattern.

Suggested fix
 	newInhibitor := inhibit.NewInhibitor(r.alerts, conf.InhibitRules, r.logger, r.eventRec)
-	r.inhibitor.Store(newInhibitor)

 	// An interface value that holds a nil concrete value is non-nil.
 	// Therefore we explicitly pass an empty interface, to detect if the
 	// cluster is not enabled in notify.
@@
 	// First, start the inhibitor so the inhibition cache can populate.
 	// Wait for this to load alerts before starting the dispatcher so
 	// we don't accidentally notify for an alert that will be inhibited.
 	go newInhibitor.Run()
 	newInhibitor.WaitForLoading()
+	r.inhibitor.Store(newInhibitor)

 	// Next, start the dispatcher and wait for it to load before swapping
 	// the disp pointer. This ensures that the API doesn't see the new
 	// dispatcher before it finishes populating the aggrGroups.

Also applies to: 185-188, 221-225

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@internal/app/reloader.go` around lines 159 - 160, The new inhibitor instance
created via inhibit.NewInhibitor(...) should finish loading before publishing it
to r.inhibitor; call the inhibitor's Run() and/or WaitForLoading() (the same
sequence used by the dispatcher) and only after those return store it with
r.inhibitor.Store(newInhibitor) so the API callback that calls
r.inhibitor.Load() won't see an incomplete cache. Apply the same change to the
other places where you create and immediately store a new inhibitor (the other
newInhibitor creation sites) so each new instance is
Run/WaitForLoading-completed before being published.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Outside diff comments:
In `@internal/app/reloader.go`:
- Around line 159-160: The new inhibitor instance created via
inhibit.NewInhibitor(...) should finish loading before publishing it to
r.inhibitor; call the inhibitor's Run() and/or WaitForLoading() (the same
sequence used by the dispatcher) and only after those return store it with
r.inhibitor.Store(newInhibitor) so the API callback that calls
r.inhibitor.Load() won't see an incomplete cache. Apply the same change to the
other places where you create and immediately store a new inhibitor (the other
newInhibitor creation sites) so each new instance is
Run/WaitForLoading-completed before being published.

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Enterprise

Run ID: cee9d348-7225-4edf-a338-8c048a2bbc41

📥 Commits

Reviewing files that changed from the base of the PR and between ad9f3d2 and fd8d73e.

📒 Files selected for processing (4)
  • internal/app/app.go
  • internal/app/lifecycle.go
  • internal/app/lifecycle_test.go
  • internal/app/reloader.go
🚧 Files skipped from review as they are similar to previous changes (3)
  • internal/app/lifecycle_test.go
  • internal/app/lifecycle.go
  • internal/app/app.go

Signed-off-by: Siavash Safi <siavash@cloudflare.com>
@siavashs
Copy link
Copy Markdown
Contributor Author

siavashs commented Jun 5, 2026

@coderabbitai review

@coderabbitai
Copy link
Copy Markdown

coderabbitai Bot commented Jun 5, 2026

✅ Action performed

Review finished.

Note: CodeRabbit is an incremental review system and does not re-review already reviewed commits. This command is applicable only when automatic reviews are paused.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

go Pull requests that update Go code kind/cleanup

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Extract main function into package

2 participants