chore: extract main into an embeddable internal/app package by siavashs · Pull Request #5259 · prometheus/alertmanager

siavashs · 2026-05-29T10:30:15Z

Summary

Resolves the long-standing TODO from #406 by extracting the Alertmanager process logic out of cmd/alertmanager/main.go into a new internal/app package, and giving it a lifecycle API (New / Start / Addr / Reload / Stop) so tests and other binaries can embed Alertmanager in-process instead of building and shelling out to the compiled binary.

cmd/alertmanager/main.go shrinks from 722 → 207 lines and now owns only: kingpin flag parsing, logger construction, versioncollector registration, feature-flag / GOMEMLIMIT side effects, and translating OS signals into context cancellation (SIGINT/SIGTERM) plus reload events (SIGHUP) consumed by app.Run.

What this changes

A reusable `internal/app` package

The package is mainly intended for internal use, it can be used by other projects embedding Alertmanager as well.

A lifecycle API for embedders

New(opts) (*App, error)
(*App).Start() error
(*App).Addr() string             // first listener
(*App).Addrs() []string          // all listeners
(*App).Reload(ctx) error
(*App).Stop(ctx) error

Run is a thin wrapper (New + Start + serveLoop + Stop) with a deferred Stop so cleanup also runs on panic. Listeners are bound at New time so Addr()/Addrs() report the real bound ports (including kernel-assigned :0) before serving starts.

Per-instance state, so multiple instances can coexist

The Prometheus collectors that used to be package-level promauto variables in cmd/alertmanager/main.go are now built per Run() against opts.Registerer, which is threaded through every collaborator (versioncollector excepted, which stays process-global in main.go). This unblocks running multiple Alertmanager instances in the same process without duplicate-registration panics.

Deterministic, observable shutdown

setup registers teardown steps on a LIFO cleanup stack that Stop drains in reverse, mirroring Go's defer semantics so shutdown ordering follows construction order automatically. Each step is named; Stop runs them all, logs any failure by name, and returns the errors joined with the HTTP-shutdown error. The HTTP shutdown honors a single timeout derived from the context passed to Stop.

Isolated config-reload logic

The config-scoped subgraph lives in a dedicated reloader type. reloader.reload performs the stop-old → build-new → wait-for-loading → atomic-swap sequence, and reloader.stop tears down the live inhibitor and dispatcher at shutdown. The long-lived singletons (nflog, silences, alerts, cluster peer, API, event recorder, tracing) are constructed once and updated in place on reload (apih.Update, eventRec/tracing ApplyConfig) rather than rebuilt.

Behavioural notes

prometheus.DefaultRegisterer is no longer referenced inside app.Run; the binary still passes it in via Options.Registerer, so on-disk behaviour is identical.
srv.Shutdown now actually runs on Run exit (previously the deferred srv.Close lived inside the listen goroutine and never ran in practice because os.Exit killed the process first). Behaviour for the binary is unchanged; embedded callers now get clean HTTP teardown.
Systemd socket activation and vsock:// listen addresses work under both the binary and embedders; the external URL is derived from the actual bound address.
tracingManager.Stop is part of the cleanup stack and always runs, not just on ctx.Done().
Start/Stop are concurrency-safe and /-/reload works in embedded mode (no deadlock without Run/serveLoop).
--cluster.listen-address default moved from a const in cmd/alertmanager to the exported app.DefaultClusterAddr.

Known follow-ups (out of scope)

matcher/compat.InitFromFlags still mutates package-level state; multi-instance tests with different feature flags will collide. Tracked separately.
Migrating the v2 acceptance harness to drive app.Run directly instead of building and spawning the binary. Now mechanically possible thanks to Addr() / Stop() on *App; left for a follow-up PR to keep this one reviewable.

Verification

Tests live in internal/app (lifecycle, listen, options, url, reloader, cluster). Highlights:

TestApp_StartStop — boot, probe /-/healthy, stop, stop again (idempotency).
TestApp_TwoSequentialInstances / TestApp_TwoConcurrentInstances — multiple instances in one process; guards the metrics-per-Registerer behaviour.
TestApp_ClusteredStartStop — gossip clustering enabled; exercises the peer-dependent paths and clusterWait.
TestApp_ConcurrentStartStop — races Start and Stop (run under -race).
TestApp_EmbeddedReloadDoesNotDeadlock — /-/reload completes in embedded mode.
TestApp_New_SetupFailureDoesNotDeadlock — setup-failure rollback doesn't block.
TestApp_Run_ContextCancel — end-to-end Run wrapper with ctx cancellation.
TestApp_serveLoop / TestApp_Stop_AggregatesCleanupErrors — serve-loop exits and aggregated shutdown errors.
TestReloader_* — component swap on reload, error path leaves prior state intact, nil-safe stop.
TestListenAll_* / TestParseVsockPort / TestOptions_Validate / TestClusterWait.

internal/app statement coverage is ~84%; the full suite passes under -race.

Diff size

 cmd/alertmanager/main.go         | -515  (722 → 207)
 internal/app/app.go              | +444
 internal/app/lifecycle.go        | +319
 internal/app/lifecycle_test.go   | +374
 internal/app/reloader.go         | +237
 internal/app/reloader_test.go    | +178
 internal/app/listen_test.go      | +116
 internal/app/options.go          | +112
 internal/app/metrics.go          |  +98
 internal/app/listen.go           |  +96
 internal/app/options_test.go     |  +76
 internal/app/cluster_test.go     |  +67
 internal/app/url.go              |  +57
 internal/app/cluster.go          |  +29
 go.mod                           |   +2 / -2
 internal/app/url_test.go         | rename (was cmd/alertmanager/main_test.go)

Closes #406

Pull Request Checklist

Please check all the applicable boxes.

Which user-facing changes does this PR introduce?

NONE

Summary by CodeRabbit

Refactor
- Consolidated startup/shutdown into a dedicated App lifecycle with a single-run entrypoint and cleaner boot, graceful shutdown, and controlled config reloads.
New Features
- App-owned Prometheus metrics for HTTP handlers and config counts.
- Eager listener binding so the server reports real bound addresses early.
- External URL normalization and validation.
- Config reloader that atomically swaps dispatcher/inhibitor state.
Bug Fixes
- Fixed races/deadlocks around start/reload/stop and improved clustered startup robustness.
Tests
- Expanded tests for lifecycle, listeners, options, URL handling, reloader, and metrics.

coderabbitai · 2026-05-29T10:30:23Z

No actionable comments were generated in the recent review. 🎉

ℹ️ Recent review info

⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Enterprise

Run ID: 46a153ea-02dc-4177-87da-aafd3ed3eac8

📥 Commits

Reviewing files that changed from the base of the PR and between fd8d73e and 768678b.

📒 Files selected for processing (1)

internal/app/reloader.go

🚧 Files skipped from review as they are similar to previous changes (1)

internal/app/reloader.go

📝 Walkthrough

Walkthrough

Extracts Alertmanager startup into internal/app: adds app.Options and App lifecycle (New/Start/Stop/Reload), listener/URL/metrics helpers, a config reloader that atomically swaps dispatcher/inhibitor, and delegates cmd/alertmanager to app.Run after flag parsing and signal setup.

Changes

App Package Refactoring and Extraction

Layer / File(s)	Summary
Options and validation `internal/app/options.go`, `internal/app/options_test.go`	Adds `Options` with `DefaultClusterAddr` and `validate()` plus unit tests covering missing dependencies and web config validation.
Listener binding and vsock parsing `internal/app/listen.go`, `internal/app/listen_test.go`	Implements eager listener binding (systemd/TCP/vsock), `listenAll`/`listenOne`, `parseVsockPort`, and tests for no-listeners, multiple binds, cleanup on partial failure, and vsock parsing.
Cluster helper `internal/app/cluster.go`, `internal/app/cluster_test.go`	Adds `clusterWait(p, timeout) func() time.Duration` and a test validating wait duration equals `time.Duration(peer.Position()) * timeout`.
URL and metrics helpers `internal/app/url.go`, `internal/app/url_test.go`, `internal/app/metrics.go`	Adds `extURL` to normalize external URLs, moves URL test to `app` package, and introduces app-owned Prometheus metrics plus handler instrumentation.
Run entrypoint and setup wiring `internal/app/app.go`	Adds exported `Run(ctx, opts)` and `(*App).setup()` that validates options; initializes metrics/logger/data dir; prepares gossip/cluster; constructs nflog/silences/alerts; binds listeners; initializes tracing and coordinator; builds `reloader`, applies initial config, and creates instrumented `http.Server`.
Config reloader implementation and tests `internal/app/reloader.go`, `internal/app/reloader_test.go`	Implements `reloader` that parses templates, builds integrations for referenced receivers only, constructs and starts inhibitor and dispatcher in order, swaps active components atomically, and includes tests for swap, rollback on error, and nil-safe stop.
App lifecycle type and runtime control + tests `internal/app/lifecycle.go`, `internal/app/lifecycle_test.go`	Introduces `App` type with `New`, `Start`, `Stop`, `Reload`, `Addr`, `Addrs`, `serveLoop`, reload multiplexing, graceful shutdown with LIFO teardown and aggregated errors, plus extensive lifecycle and concurrency tests.
cmd/alertmanager main.go refactor `cmd/alertmanager/main.go`	Shrinks main.go to handle flags, logger, signal/reload channel, updates flag defaults (`--cluster.listen-address` -> `app.DefaultClusterAddr`, `dispatch.start-delay`), moves `versioncollector` registration after logger creation, constructs `app.Options`, and calls `app.Run`.
go.mod updates `go.mod`	Promotes `github.com/coreos/go-systemd/v22` and `github.com/mdlayher/vsock` from indirect to direct requires.

Sequence Diagram(s)

sequenceDiagram
  participant CMD as cmd/alertmanager
  participant AppRun as app.Run
  participant Setup as App.setup
  participant Reloader as reloader.reload
  participant HTTP as http.Server
  participant Cluster as cluster.Peer
  CMD->>AppRun: Run(ctx, opts)
  AppRun->>Setup: New(opts) / setup()
  Setup->>Cluster: optionally create/join gossip peer
  Setup->>Reloader: build reloader and subscribe reloads
  Setup->>HTTP: bind listeners and create server
  Reloader->>Reloader: parse templates, build integrations
  Reloader->>Reloader: start inhibitor -> start dispatcher -> atomic swap
  AppRun->>HTTP: Start ServeMultiple (serve goroutine)

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~45 minutes

Possibly related PRs

prometheus/alertmanager#5246: Related to event-recorder output model changes interacting with event-recorder wiring in the new app startup.
prometheus/alertmanager#5072: Overlaps with event-recorder and initial config/apply ordering changes in the new app lifecycle.

Suggested labels

component/dispatch, component/inhibition

Suggested reviewers

Spaceman1701
TheMeier

🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 warning)

Check name	Status	Explanation	Resolution
Docstring Coverage	⚠️ Warning	Docstring coverage is 29.27% which is insufficient. The required threshold is 80.00%.	Write docstrings for the functions missing them to satisfy the coverage threshold.

✅ Passed checks (4 passed)

Check name	Status	Explanation
Title check	✅ Passed	The PR title 'chore: extract main into an embeddable internal/app package' directly and clearly describes the primary change: extracting main logic into a reusable internal/app package for embedding.
Description check	✅ Passed	The PR description comprehensively covers objectives, implementation details, behavioural notes, testing, and verification. It follows the template structure with sign-off confirmation and proper checklist completion.
Linked Issues check	✅ Passed	The PR fully addresses issue `#406`'s objectives: main logic is extracted into an embeddable internal/app package with a complete lifecycle API (New/Start/Addr/Reload/Stop/Run), per-instance registries enabling multiple instances, and comprehensive test coverage (~84% statement coverage, passes under -race).
Out of Scope Changes check	✅ Passed	All changes are directly scoped to extracting main logic into internal/app: cmd/alertmanager/main.go refactored, internal/app files added with lifecycle/helpers/tests, go.mod dependencies updated. Follow-ups (matcher/compat state, v2 harness migration) are explicitly noted as out of scope.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing Touches

🧪 Generate unit tests (beta)

Create PR with unit tests

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

coderabbitai

Actionable comments posted: 2

🤖 Prompt for all review comments with AI agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@cmd/alertmanager/main.go`:
- Line 205: The shutdown log is hardcoded to "Received SIGTERM…" which
misreports SIGINT/Ctrl+C; change the logger.Info call in main.go (the location
that currently logs "Received SIGTERM, exiting gracefully...") to log the actual
signal received (use the signal variable from the signal.Notify/select or, if
you cancel via ctx, log ctx.Err() or a generic "shutting down" message) so the
message reflects the real cause; update the handler that calls app.Run and the
signal.Notify/select branch to pass the received os.Signal (or its String())
into logger.Info instead of the fixed "SIGTERM" text.

In `@internal/app/lifecycle.go`:
- Around line 142-163: The Stop method can block forever on the "for range
a.srvc" if Start's serve goroutine never closes a.srvc; change Stop to perform a
non-blocking drain of a.srvc instead of a blocking range so Stop returns safely
even if Start wasn't run. Specifically, update App.Stop to replace the for range
over a.srvc with a loop that repeatedly attempts a non-blocking receive from
a.srvc (e.g., select with a receive case and a default case) until the channel
is drained/closed or there is nothing to read; reference symbols: App.Stop,
a.srvc, Start, New.

🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

Push a commit to this branch (recommended)
Create a new PR with the fixes

ℹ️ Review info

⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Enterprise

Run ID: cfbc0c18-07d3-4da0-85e9-8fea36348cfa

📥 Commits

Reviewing files that changed from the base of the PR and between cc7bf21 and a59bd1f.

📒 Files selected for processing (9)

cmd/alertmanager/main.go
internal/app/app.go
internal/app/cluster.go
internal/app/lifecycle.go
internal/app/lifecycle_test.go
internal/app/metrics.go
internal/app/options.go
internal/app/url.go
internal/app/url_test.go

Move the body of run() from cmd/alertmanager/main.go into a new internal/app package so Alertmanager can be embedded in tests and other binaries without shelling out to a compiled binary. Resolves the long-standing TODO from prometheus#406. cmd/alertmanager/main.go shrinks from 724 to 196 lines and is now responsible only for kingpin flag parsing, logger construction, versioncollector registration, feature-flag / GOMEMLIMIT side effects, and translating OS signals into context cancellation (SIGINT/SIGTERM) plus reload events (SIGHUP) consumed by app.Run. The new internal/app package is split into: * options.go - Options struct, validate(), DefaultClusterAddr * app.go - Run(ctx, opts) error * metrics.go - per-instance Prometheus metrics struct * cluster.go - clusterWait helper * url.go - extURL helper (+ url_test.go for TestExternalURL) The six previously package-level promauto.NewXxx variables in cmd/alertmanager/main.go are now constructed per Run() invocation against opts.Registerer. Combined with threading the registerer through every collaborator (versioncollector excepted, which stays in main.go as a process-global), this unblocks running multiple Alertmanager instances in the same process without duplicate- registration panics. Behavioural notes: * prometheus.DefaultRegisterer is no longer referenced inside app.Run; the binary still passes it in via Options.Registerer so on-disk behaviour is identical. * app.Run defers srv.Shutdown(5s) on exit. Previously the deferred srv.Close lived inside the listen goroutine and never ran in practice because os.Exit killed the process first. Behaviour for the binary is unchanged; embedded callers now get clean HTTP teardown. * --cluster.listen-address default moved from a const in cmd/alertmanager to the exported app.DefaultClusterAddr. Known follow-ups intentionally out of scope: * matcher/compat.InitFromFlags still mutates package-level state; multi-instance tests with different feature flags will collide. * Richer App lifecycle (New/Start/Addr/Reload/Stop) for tests that need :0-port discovery or programmatic reload. * Migrating the v2 acceptance harness to use app.Run directly instead of building and spawning the binary. Verification: `go build ./...`, `go vet ./...`, and `go test -count=1 ./...` all pass, including the existing test/with_api_v2/acceptance suite which continues to build and spawn the binary end-to-end. Closes prometheus#406 Signed-off-by: Siavash Safi <siavash@cloudflare.com>

coderabbitai

Actionable comments posted: 1

🤖 Prompt for all review comments with AI agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@internal/app/lifecycle.go`:
- Around line 96-113: The Start method can deadlock because the registered
/-/reload handler blocks on sending to the unbuffered a.webReload channel (and
errors from web.ServeMultiple are sent to a.srvc) while the only consumer
(serveLoop, invoked by Run) may not be running for embedders; fix by ensuring
Start spawns the reload-and-error drain loop so a.webReload and a.opts.Reload
are drained even when Run/serveLoop is not used, or alternatively make the
reload handler do a non-blocking send/fail-fast: add a goroutine in Start that
runs the same logic as serveLoop (draining a.webReload, a.opts.Reload and
forwarding errors to a.reload/ reload handler) and ensure web.ServeMultiple
errors sent to a.srvc are observed (do not close a.srvc before draining), or
change the handler to select { case a.webReload <- errc: default: respond with
an immediate error } so the handler never blocks when the drain loop is absent;
update Start, the /-/reload handler, and any use of a.srvc/a.webReload
accordingly.

🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

Push a commit to this branch (recommended)
Create a new PR with the fixes

ℹ️ Review info

⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Enterprise

Run ID: 6a09cb35-1507-4077-b2b1-721ea7afc541

📥 Commits

Reviewing files that changed from the base of the PR and between a59bd1f and 8741088.

📒 Files selected for processing (9)

cmd/alertmanager/main.go
internal/app/app.go
internal/app/cluster.go
internal/app/lifecycle.go
internal/app/lifecycle_test.go
internal/app/metrics.go
internal/app/options.go
internal/app/url.go
internal/app/url_test.go

🚧 Files skipped from review as they are similar to previous changes (7)

internal/app/url_test.go
internal/app/options.go
internal/app/cluster.go
internal/app/metrics.go
internal/app/url.go
internal/app/lifecycle_test.go
internal/app/app.go

coderabbitai

Actionable comments posted: 1

🧹 Nitpick comments (1)

internal/app/lifecycle.go (1)

53-55: ⚡ Quick win

Stale doc comments still point at serveLoop.

webReload is now consumed by reloadRouter, not serveLoop. The same staleness applies to the Reload docstring on Line 175 ("Safe to call concurrently with serveLoop"), since serveLoop no longer routes reloads. In code this deadlock-sensitive, accurate "who consumes this channel" comments matter for future maintainers.

📝 Suggested doc fixes

 	// webReload is the channel exposed by httpserver.Register for the
-	// /-/reload HTTP endpoint. We read from it in serveLoop.
+	// /-/reload HTTP endpoint. We read from it in reloadRouter.
 	webReload chan chan error

 // Reload triggers a configuration reload (the programmatic equivalent of
-// SIGHUP). Safe to call concurrently with serveLoop.
+// SIGHUP). Safe to call concurrently with the running App.
 func (a *App) Reload(_ context.Context) error {

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@internal/app/lifecycle.go` around lines 53 - 55, Doc comments are stale:
update the comment for the webReload channel and the Reload docstring (mentions
of "serveLoop") to reflect that reloads are now consumed by reloadRouter, not
serveLoop; locate the declaration webReload and the Reload method/docstring and
change references from serveLoop to reloadRouter and adjust wording about
concurrency to say "Safe to call concurrently with reloadRouter" (or similar) so
the consumer is accurate and deadlock-sensitive guidance is preserved.

🤖 Prompt for all review comments with AI agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@internal/app/lifecycle_test.go`:
- Around line 161-168: The tests TestApp_EmbeddedReloadDoesNotDeadlock and
TestApp_New_SetupFailureDoesNotDeadlock use require.NoError/Equal/Error inside
spawned goroutines (the anonymous go func that closes done), which can call
t.FailNow from a child goroutine; change these to not call require from the
goroutine: either (A) replace require.* with assert.* inside the goroutine
(e.g., assert.NoError/assert.Equal/assert.Error) or (B) capture the goroutine
results by sending error/status values down a channel (use the existing done
channel or a new result channel) and perform require.* assertions on those
results in the main test goroutine after <-done; update the anonymous functions
and their callers (the POST to "/-/reload" and the setup-failure goroutine) to
use one of these patterns so all require.* calls run on the main test goroutine.

---

Nitpick comments:
In `@internal/app/lifecycle.go`:
- Around line 53-55: Doc comments are stale: update the comment for the
webReload channel and the Reload docstring (mentions of "serveLoop") to reflect
that reloads are now consumed by reloadRouter, not serveLoop; locate the
declaration webReload and the Reload method/docstring and change references from
serveLoop to reloadRouter and adjust wording about concurrency to say "Safe to
call concurrently with reloadRouter" (or similar) so the consumer is accurate
and deadlock-sensitive guidance is preserved.

🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

Push a commit to this branch (recommended)
Create a new PR with the fixes

ℹ️ Review info

⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Enterprise

Run ID: 7df0d969-feab-4bf1-8cf6-684bb59bd394

📥 Commits

Reviewing files that changed from the base of the PR and between 8741088 and b273f3f.

📒 Files selected for processing (3)

internal/app/app.go
internal/app/lifecycle.go
internal/app/lifecycle_test.go

🚧 Files skipped from review as they are similar to previous changes (1)

internal/app/app.go

introduce an App lifecycle so tests and embedders can drive Alertmanager without OS signals or os/exec, and discover the bound HTTP address even when listening on ":0". API: New(opts) (*App, error) (*App).Start() error (*App).Addr() string // first listener (*App).Addrs() []string // all listeners (*App).Reload(ctx) error (*App).Stop(ctx) error Run is preserved as a thin wrapper (New + Start + serveLoop + Stop) with a deferred Stop on a fresh 30s context so cleanup also runs on panic, matching the implicit panic-safety of the previous defer- based implementation. Internally, setup uses a cleanup stack (a.onStop) that Stop drains in LIFO order, mirroring Go's defer semantics so the source order of the old `defer X` lines in Run is preserved verbatim and the shutdown ordering does not depend on hand-written reverse-order code. Listeners are bound at New time via a new listenAll helper that calls net.Listen directly (so Addr is meaningful before Start); web.ServeMultiple is then invoked in Start. Systemd socket activation is not supported when embedding and returns an explicit error pointing callers back to cmd/alertmanager. Stop honors its context parameter for the HTTP shutdown step, capped at 5s, so callers passing a tighter deadline get faster teardown and callers passing context.Background get the default. Tests cover: single instance round-trip; two sequential instances in the same process (guards the Phase A metrics-per-Registerer fix against duplicate-registration panics); two concurrent instances on distinct ephemeral ports; and the Run wrapper end-to-end with ctx cancellation. All pass under -race. Signed-off-by: Siavash Safi <siavash@cloudflare.com>

ultrotter · 2026-06-04T17:54:47Z

+		if d := disp.Load(); d != nil {
+			d.Stop()
+		}
+	})


Here in setup() we may also need to do an onStop(...) for the inhibitor or we leak it at Stop()

a.onStop(func() {
if i := inhibitor.Load(); i != nil { i.Stop() }
})

Signed-off-by: Siavash Safi <siavash@cloudflare.com>

coderabbitai

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)

internal/app/lifecycle_test.go (1)

205-213: ⚠️ Potential issue | 🟡 Minor | ⚡ Quick win

Stabilize the embedded reload regression test startup path.

http.Post is issued immediately after Start(), so this can fail with connection refused before the listener is ready, creating flakes unrelated to the deadlock behavior.

Proposed fix

 func TestApp_EmbeddedReloadDoesNotDeadlock(t *testing.T) {
 	a, err := New(testOptions(t))
 	require.NoError(t, err)
 	require.NoError(t, a.Start())
 	defer func() { _ = a.Stop(t.Context()) }()
+	client := &http.Client{Timeout: 2 * time.Second}
+
+	require.Eventually(t, func() bool {
+		resp, err := client.Get("http://" + a.Addr() + "/-/healthy")
+		if err != nil {
+			return false
+		}
+		_, _ = io.Copy(io.Discard, resp.Body)
+		_ = resp.Body.Close()
+		return resp.StatusCode == http.StatusOK
+	}, 5*time.Second, 50*time.Millisecond)

 	type reloadResult struct {
 		err    error
 		status int
 	}
 	resultCh := make(chan reloadResult, 1)
 	go func() {
-		resp, err := http.Post("http://"+a.Addr()+"/-/reload", "", nil)
+		resp, err := client.Post("http://"+a.Addr()+"/-/reload", "", nil)
 		if err != nil {
 			resultCh <- reloadResult{err: err}
 			return
 		}

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@internal/app/lifecycle_test.go` around lines 205 - 213, The test posts to
"http://"+a.Addr()+"/-/reload" immediately after calling a.Start(), which can
race with listener setup and cause transient connection-refused flakes; replace
the single http.Post call with a short retry/wait loop that probes the listener
(e.g., net.Dial("tcp", a.Addr()) or repeated http.Post attempts) with a small
backoff and overall timeout before sending the request, and only proceed to
io.Copy/resp.Body.Close() and sending resultCh <- reloadResult once a successful
connection/response is obtained; update the goroutine that currently uses
http.Post, a.Addr(), resultCh, and reloadResult to implement this
retry-and-timeout logic.

🤖 Prompt for all review comments with AI agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Outside diff comments:
In `@internal/app/lifecycle_test.go`:
- Around line 205-213: The test posts to "http://"+a.Addr()+"/-/reload"
immediately after calling a.Start(), which can race with listener setup and
cause transient connection-refused flakes; replace the single http.Post call
with a short retry/wait loop that probes the listener (e.g., net.Dial("tcp",
a.Addr()) or repeated http.Post attempts) with a small backoff and overall
timeout before sending the request, and only proceed to
io.Copy/resp.Body.Close() and sending resultCh <- reloadResult once a successful
connection/response is obtained; update the goroutine that currently uses
http.Post, a.Addr(), resultCh, and reloadResult to implement this
retry-and-timeout logic.

ℹ️ Review info

⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Enterprise

Run ID: 3c018ba7-289c-4178-bd2b-888bb5ea0e2b

📥 Commits

Reviewing files that changed from the base of the PR and between cdd9230 and 52cb2f5.

📒 Files selected for processing (8)

go.mod
internal/app/app.go
internal/app/lifecycle.go
internal/app/lifecycle_test.go
internal/app/listen.go
internal/app/listen_test.go
internal/app/options.go
internal/app/options_test.go

🚧 Files skipped from review as they are similar to previous changes (2)

internal/app/options.go
internal/app/app.go

coderabbitai

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)

internal/app/lifecycle.go (1)
116-120: ⚠️ Potential issue | 🔴 Critical | ⚡ Quick win

Fix nil-channel panic when Stop races Start initialization

The below code does not show removed or replaced code.
		a.started.Store(true)
		a.routerQuit = make(chan struct{})
		a.routerDone = make(chan struct{})

		// reloadRouter consumes /-/reload requests and opts.Reload sends so
On internal/app/lifecycle.go, Start publishes a.started.Store(true) before a.routerQuit / a.routerDone are allocated (lines 116-118). If Stop runs concurrently, the if a.started.Load() path (198-221) can execute close(a.routerQuit) and <-a.routerDone while those channels are still nil, panicking.

Serialize the lifecycle transition (e.g., only set started after routerQuit/routerDone are allocated and the related goroutines are ready, or guard the Stop close/wait with a shared mutex/state).

Also applies to: 198-221

ℹ️ Review info

⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Enterprise

Run ID: ae85aabc-9577-4342-a55a-0f7062e46ee4

📥 Commits

Reviewing files that changed from the base of the PR and between cdd9230 and 52cb2f5.

📒 Files selected for processing (8)

go.mod
internal/app/app.go
internal/app/lifecycle.go
internal/app/lifecycle_test.go
internal/app/listen.go
internal/app/listen_test.go
internal/app/options.go
internal/app/options_test.go

🚧 Files skipped from review as they are similar to previous changes (7)

internal/app/options_test.go
internal/app/listen.go
internal/app/options.go
internal/app/lifecycle_test.go
internal/app/app.go
go.mod
internal/app/listen_test.go

Signed-off-by: Siavash Safi <siavash@cloudflare.com>

Extract the configuration-scoped subgraph (routes, receivers, pipeline, inhibitor, dispatcher) out of the monolithic setup() into a dedicated reloader type. reloader.reload performs the subtle stop-old/build-new/ wait-for-loading/atomic-swap sequence and reloader.stop tears down the live inhibitor+dispatcher, so the swap ordering now lives in one cohesive, independently testable place. The long-lived singletons stay in setup and continue to be updated in place (apih.Update, eventRec/tracing ApplyConfig) rather than rebuilt. Make shutdown observable and harder to misuse: - onStop now registers named, error-returning teardown steps; Stop runs them all LIFO, logs each failure by name, and returns the errors joined with the HTTP-shutdown error instead of swallowing them. - Add trackClose to register a resource's Close right at its acquisition, closing the "acquire here, register cleanup elsewhere" gap that previously leaked goroutines/handles. Add tests and raise internal/app coverage from ~77% to ~84%: - reloader_test: component swap on reload, error path leaves prior state intact, nil-safe stop. - cluster_test / TestApp_ClusteredStartStop: exercise clusterWait and the peer-dependent branches in setup and reload. - TestApp_serveLoop: listener-error, clean-close, and ctx-cancel exits. - TestApp_Stop_AggregatesCleanupErrors and TestApp_Start_BeforeNewFails. No behavioral change to teardown order or HTTP-drain sequencing. Signed-off-by: Siavash Safi <siavash@cloudflare.com>

coderabbitai

Actionable comments posted: 1

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (2)

internal/app/app.go (1)

122-145: ⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Register gossip peer teardown immediately after cluster.Create.

The cleanup is registered too late. If Line 154, Line 190, Line 230, or Line 324 returns an error after the peer is created at Line 124, Stop has no peer cleanup to run, so the startup-failure path leaks the gossip peer's sockets/goroutines. internal/app/cluster_test.go Lines 32-51 already show that peer.Leave(...) is safe on a freshly created peer, so this teardown can be installed right after cluster.Create succeeds and before the later setup steps run.

💡 Suggested fix

-	var peer *cluster.Peer
+	var (
+		peer         *cluster.Peer
+		settleCancel = func() {}
+	)
 	if opts.ClusterBindAddr != "" {
 		peer, err = cluster.Create(
 			logger.With("component", "cluster"),
 			reg,
@@
 		if err != nil {
 			return fmt.Errorf("unable to initialize gossip mesh: %w", err)
 		}
+		a.onStop("cluster peer leave", func() error {
+			settleCancel()
+			if err := peer.Leave(10 * time.Second); err != nil {
+				return fmt.Errorf("unable to leave gossip mesh: %w", err)
+			}
+			return nil
+		})
 		m.clusterEnabled.Set(1)
 	}
@@
-		settleCtx, settleCancel := context.WithTimeout(context.Background(), opts.SettleTimeout)
-		a.onStop("cluster peer leave", func() error {
-			settleCancel()
-			if err := peer.Leave(10 * time.Second); err != nil {
-				return fmt.Errorf("unable to leave gossip mesh: %w", err)
-			}
-			return nil
-		})
+		settleCtx, cancel := context.WithTimeout(context.Background(), opts.SettleTimeout)
+		settleCancel = cancel
 		go peer.Settle(settleCtx, opts.GossipInterval*10)
 		eventRec.SetClusterPeer(peer)
 	}

Also applies to: 247-258

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@internal/app/app.go` around lines 122 - 145, After cluster.Create returns a
non-nil peer, immediately register the peer teardown with the Stop/cleanup
handler (e.g. Stop.Register(func(ctx context.Context){ _ = peer.Leave(ctx) }))
before performing any subsequent setup or calling m.clusterEnabled.Set(1); do
this right after the successful cluster.Create call in the first block that
assigns peer and in the second similar block (the 247-258 region) so that any
later early returns will still invoke peer.Leave and avoid leaking
sockets/goroutines.

internal/app/lifecycle.go (1)

166-175: ⚠️ Potential issue | 🟡 Minor | ⚡ Quick win

Handle closed Options.Reload to avoid reload hot-loop in reloadRouter.

If opts.Reload is ever closed, case <-a.opts.Reload is permanently ready, so reloadRouter will continuously call a.coordinator.Reload() until shutdown (the binary never closes this channel, but embedders could). Gate the case on ok and disable it after the first closed read.

Suggested fix

 func (a *App) reloadRouter() {
 	defer close(a.routerDone)
+	reloadCh := a.opts.Reload
 	for {
 		select {
 		case <-a.routerQuit:
 			return
-		case <-a.opts.Reload:
+		case _, ok := <-reloadCh:
+			if !ok {
+				reloadCh = nil
+				continue
+			}
 			if err := a.coordinator.Reload(); err != nil {
 				a.logger.Error("configuration reload failed", "err", err)
 			}
 		case errc := <-a.webReload:
 			errc <- a.coordinator.Reload()

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@internal/app/lifecycle.go` around lines 166 - 175, The select branch reading
from a.opts.Reload should detect a closed channel and disable further reloads to
avoid a hot-loop; change the case to receive with the comma-ok pattern from
a.opts.Reload (e.g., v, ok := <-a.opts.Reload) and if ok call
a.coordinator.Reload(), otherwise set a.opts.Reload = nil (or otherwise disable
that case) so the select no longer treats it as permanently ready; update the
reloadRouter/select logic accordingly to reference a.opts.Reload and
a.coordinator.Reload.

🤖 Prompt for all review comments with AI agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@internal/app/reloader.go`:
- Around line 90-93: The reload() implementation currently mutates long-lived
components (e.g., eventRec.ApplyConfig and tracingMgr.ApplyConfig) before the
new runtime graph is fully built, which can leave a partially applied config on
failure; change the flow to build the entire new runtime graph first (create new
dispatcher/inhibitor/tracing manager and prepare new event recorder config)
without calling ApplyConfig on existing shared instances, then atomically
swap/publish the new instances and finally call ApplyConfig on the new
components (or call ApplyConfig during construction of new instances);
alternatively, if you must apply to existing instances, implement a rollback
path that restores prior state on any subsequent error. Target symbols:
reload(), eventRec.ApplyConfig, tracingMgr.ApplyConfig, the dispatcher/inhibitor
publish/stop logic, and any code paths at the other mentioned locations (around
lines where ApplyConfig is called) to ensure no in-place mutation happens before
a successful full build.

---

Outside diff comments:
In `@internal/app/app.go`:
- Around line 122-145: After cluster.Create returns a non-nil peer, immediately
register the peer teardown with the Stop/cleanup handler (e.g.
Stop.Register(func(ctx context.Context){ _ = peer.Leave(ctx) })) before
performing any subsequent setup or calling m.clusterEnabled.Set(1); do this
right after the successful cluster.Create call in the first block that assigns
peer and in the second similar block (the 247-258 region) so that any later
early returns will still invoke peer.Leave and avoid leaking sockets/goroutines.

In `@internal/app/lifecycle.go`:
- Around line 166-175: The select branch reading from a.opts.Reload should
detect a closed channel and disable further reloads to avoid a hot-loop; change
the case to receive with the comma-ok pattern from a.opts.Reload (e.g., v, ok :=
<-a.opts.Reload) and if ok call a.coordinator.Reload(), otherwise set
a.opts.Reload = nil (or otherwise disable that case) so the select no longer
treats it as permanently ready; update the reloadRouter/select logic accordingly
to reference a.opts.Reload and a.coordinator.Reload.

🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

Push a commit to this branch (recommended)
Create a new PR with the fixes

ℹ️ Review info

⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Enterprise

Run ID: 78d02d37-9dc1-4753-aa05-9cc4b909a5e3

📥 Commits

Reviewing files that changed from the base of the PR and between 52cb2f5 and 0103e36.

📒 Files selected for processing (6)

internal/app/app.go
internal/app/cluster_test.go
internal/app/lifecycle.go
internal/app/lifecycle_test.go
internal/app/reloader.go
internal/app/reloader_test.go

Signed-off-by: Siavash Safi <siavash@cloudflare.com>

coderabbitai

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)

internal/app/reloader.go (1)

159-160: ⚠️ Potential issue | 🟡 Minor | ⚡ Quick win

Publish the new inhibitor only after it has finished loading.

internal/app/app.go wires the API mute callback to r.inhibitor.Load(), so storing newInhibitor before Run/WaitForLoading exposes an incomplete inhibition cache to concurrent API requests during reload. The dispatcher already avoids this by publishing only after WaitForLoading; the inhibitor should follow the same pattern.

Suggested fix

 	newInhibitor := inhibit.NewInhibitor(r.alerts, conf.InhibitRules, r.logger, r.eventRec)
-	r.inhibitor.Store(newInhibitor)

 	// An interface value that holds a nil concrete value is non-nil.
 	// Therefore we explicitly pass an empty interface, to detect if the
 	// cluster is not enabled in notify.
@@
 	// First, start the inhibitor so the inhibition cache can populate.
 	// Wait for this to load alerts before starting the dispatcher so
 	// we don't accidentally notify for an alert that will be inhibited.
 	go newInhibitor.Run()
 	newInhibitor.WaitForLoading()
+	r.inhibitor.Store(newInhibitor)

 	// Next, start the dispatcher and wait for it to load before swapping
 	// the disp pointer. This ensures that the API doesn't see the new
 	// dispatcher before it finishes populating the aggrGroups.

Also applies to: 185-188, 221-225

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@internal/app/reloader.go` around lines 159 - 160, The new inhibitor instance
created via inhibit.NewInhibitor(...) should finish loading before publishing it
to r.inhibitor; call the inhibitor's Run() and/or WaitForLoading() (the same
sequence used by the dispatcher) and only after those return store it with
r.inhibitor.Store(newInhibitor) so the API callback that calls
r.inhibitor.Load() won't see an incomplete cache. Apply the same change to the
other places where you create and immediately store a new inhibitor (the other
newInhibitor creation sites) so each new instance is
Run/WaitForLoading-completed before being published.

🤖 Prompt for all review comments with AI agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Outside diff comments:
In `@internal/app/reloader.go`:
- Around line 159-160: The new inhibitor instance created via
inhibit.NewInhibitor(...) should finish loading before publishing it to
r.inhibitor; call the inhibitor's Run() and/or WaitForLoading() (the same
sequence used by the dispatcher) and only after those return store it with
r.inhibitor.Store(newInhibitor) so the API callback that calls
r.inhibitor.Load() won't see an incomplete cache. Apply the same change to the
other places where you create and immediately store a new inhibitor (the other
newInhibitor creation sites) so each new instance is
Run/WaitForLoading-completed before being published.

ℹ️ Review info

⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Enterprise

Run ID: cee9d348-7225-4edf-a338-8c048a2bbc41

📥 Commits

Reviewing files that changed from the base of the PR and between ad9f3d2 and fd8d73e.

📒 Files selected for processing (4)

internal/app/app.go
internal/app/lifecycle.go
internal/app/lifecycle_test.go
internal/app/reloader.go

🚧 Files skipped from review as they are similar to previous changes (3)

internal/app/lifecycle_test.go
internal/app/lifecycle.go
internal/app/app.go

Signed-off-by: Siavash Safi <siavash@cloudflare.com>

siavashs · 2026-06-05T11:46:04Z

@coderabbitai review

coderabbitai · 2026-06-05T11:46:10Z

✅ Action performed

Review finished.

Note: CodeRabbit is an incremental review system and does not re-review already reviewed commits. This command is applicable only when automatic reviews are paused.

siavashs added kind/cleanup go Pull requests that update Go code labels May 29, 2026

coderabbitai Bot reviewed May 29, 2026

View reviewed changes

Comment thread cmd/alertmanager/main.go Outdated

Comment thread internal/app/lifecycle.go

siavashs force-pushed the feat/app/pkg branch from a59bd1f to 8741088 Compare May 29, 2026 10:49

coderabbitai Bot reviewed May 29, 2026

View reviewed changes

Comment thread internal/app/lifecycle.go

siavashs force-pushed the feat/app/pkg branch from 8741088 to b273f3f Compare May 29, 2026 11:11

coderabbitai Bot reviewed May 29, 2026

View reviewed changes

Comment thread internal/app/lifecycle_test.go

siavashs force-pushed the feat/app/pkg branch from b273f3f to cdd9230 Compare May 29, 2026 11:30

siavashs marked this pull request as ready for review May 29, 2026 11:30

siavashs requested a review from a team as a code owner May 29, 2026 11:30

siavashs requested review from Spaceman1701 and ultrotter May 29, 2026 13:20

ultrotter reviewed Jun 4, 2026

View reviewed changes

address commend and add more tests

52cb2f5

Signed-off-by: Siavash Safi <siavash@cloudflare.com>

coderabbitai Bot reviewed Jun 4, 2026

View reviewed changes

coderabbitai Bot reviewed Jun 5, 2026

View reviewed changes

siavashs added 2 commits June 5, 2026 11:58

address rabbit comments

a05e47c

Signed-off-by: Siavash Safi <siavash@cloudflare.com>

coderabbitai Bot reviewed Jun 5, 2026

View reviewed changes

Comment thread internal/app/reloader.go Outdated

siavashs added 4 commits June 5, 2026 12:48

fix lint issues

ad9f3d2

Signed-off-by: Siavash Safi <siavash@cloudflare.com>

fix potential peer leak on setup failure

303ebc9

Signed-off-by: Siavash Safi <siavash@cloudflare.com>

fix partial config apply on failed reload

ec08a16

Signed-off-by: Siavash Safi <siavash@cloudflare.com>

fix opts.Reload hot-loop for embedders

fd8d73e

Signed-off-by: Siavash Safi <siavash@cloudflare.com>

coderabbitai Bot reviewed Jun 5, 2026

View reviewed changes

publish the new inhibitor only after it has finished loading

768678b

Signed-off-by: Siavash Safi <siavash@cloudflare.com>

Conversation

siavashs commented May 29, 2026 • edited by coderabbitai Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

What this changes

A reusable internal/app package

A lifecycle API for embedders

Per-instance state, so multiple instances can coexist

Deterministic, observable shutdown

Isolated config-reload logic

Behavioural notes

Known follow-ups (out of scope)

Verification

Diff size

Pull Request Checklist

Which user-facing changes does this PR introduce?

Summary by CodeRabbit

Uh oh!

coderabbitai Bot commented May 29, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Walkthrough

Changes

Sequence Diagram(s)

Estimated code review effort

Possibly related PRs

Suggested labels

Suggested reviewers

❌ Failed checks (1 warning)

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

ultrotter Jun 4, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

siavashs commented Jun 5, 2026

Uh oh!

coderabbitai Bot commented Jun 5, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

siavashs commented May 29, 2026 •

edited by coderabbitai Bot

Loading

A reusable `internal/app` package

coderabbitai Bot commented May 29, 2026 •

edited

Loading

coderabbitai Bot commented Jun 5, 2026 •

edited

Loading