chore: extract main into an embeddable internal/app package#5259
chore: extract main into an embeddable internal/app package#5259siavashs wants to merge 10 commits into
Conversation
|
No actionable comments were generated in the recent review. 🎉 ℹ️ Recent review info⚙️ Run configurationConfiguration used: Path: .coderabbit.yaml Review profile: CHILL Plan: Enterprise Run ID: 📒 Files selected for processing (1)
🚧 Files skipped from review as they are similar to previous changes (1)
📝 WalkthroughWalkthroughExtracts Alertmanager startup into internal/app: adds app.Options and App lifecycle (New/Start/Stop/Reload), listener/URL/metrics helpers, a config reloader that atomically swaps dispatcher/inhibitor, and delegates cmd/alertmanager to app.Run after flag parsing and signal setup. ChangesApp Package Refactoring and Extraction
Sequence Diagram(s)sequenceDiagram
participant CMD as cmd/alertmanager
participant AppRun as app.Run
participant Setup as App.setup
participant Reloader as reloader.reload
participant HTTP as http.Server
participant Cluster as cluster.Peer
CMD->>AppRun: Run(ctx, opts)
AppRun->>Setup: New(opts) / setup()
Setup->>Cluster: optionally create/join gossip peer
Setup->>Reloader: build reloader and subscribe reloads
Setup->>HTTP: bind listeners and create server
Reloader->>Reloader: parse templates, build integrations
Reloader->>Reloader: start inhibitor -> start dispatcher -> atomic swap
AppRun->>HTTP: Start ServeMultiple (serve goroutine)
Estimated code review effort🎯 4 (Complex) | ⏱️ ~45 minutes Possibly related PRs
Suggested labels
Suggested reviewers
🚥 Pre-merge checks | ✅ 4 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (4 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Actionable comments posted: 2
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In `@cmd/alertmanager/main.go`:
- Line 205: The shutdown log is hardcoded to "Received SIGTERM…" which
misreports SIGINT/Ctrl+C; change the logger.Info call in main.go (the location
that currently logs "Received SIGTERM, exiting gracefully...") to log the actual
signal received (use the signal variable from the signal.Notify/select or, if
you cancel via ctx, log ctx.Err() or a generic "shutting down" message) so the
message reflects the real cause; update the handler that calls app.Run and the
signal.Notify/select branch to pass the received os.Signal (or its String())
into logger.Info instead of the fixed "SIGTERM" text.
In `@internal/app/lifecycle.go`:
- Around line 142-163: The Stop method can block forever on the "for range
a.srvc" if Start's serve goroutine never closes a.srvc; change Stop to perform a
non-blocking drain of a.srvc instead of a blocking range so Stop returns safely
even if Start wasn't run. Specifically, update App.Stop to replace the for range
over a.srvc with a loop that repeatedly attempts a non-blocking receive from
a.srvc (e.g., select with a receive case and a default case) until the channel
is drained/closed or there is nothing to read; reference symbols: App.Stop,
a.srvc, Start, New.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Enterprise
Run ID: cfbc0c18-07d3-4da0-85e9-8fea36348cfa
📒 Files selected for processing (9)
cmd/alertmanager/main.gointernal/app/app.gointernal/app/cluster.gointernal/app/lifecycle.gointernal/app/lifecycle_test.gointernal/app/metrics.gointernal/app/options.gointernal/app/url.gointernal/app/url_test.go
Move the body of run() from cmd/alertmanager/main.go into a new internal/app package so Alertmanager can be embedded in tests and other binaries without shelling out to a compiled binary. Resolves the long-standing TODO from prometheus#406. cmd/alertmanager/main.go shrinks from 724 to 196 lines and is now responsible only for kingpin flag parsing, logger construction, versioncollector registration, feature-flag / GOMEMLIMIT side effects, and translating OS signals into context cancellation (SIGINT/SIGTERM) plus reload events (SIGHUP) consumed by app.Run. The new internal/app package is split into: * options.go - Options struct, validate(), DefaultClusterAddr * app.go - Run(ctx, opts) error * metrics.go - per-instance Prometheus metrics struct * cluster.go - clusterWait helper * url.go - extURL helper (+ url_test.go for TestExternalURL) The six previously package-level promauto.NewXxx variables in cmd/alertmanager/main.go are now constructed per Run() invocation against opts.Registerer. Combined with threading the registerer through every collaborator (versioncollector excepted, which stays in main.go as a process-global), this unblocks running multiple Alertmanager instances in the same process without duplicate- registration panics. Behavioural notes: * prometheus.DefaultRegisterer is no longer referenced inside app.Run; the binary still passes it in via Options.Registerer so on-disk behaviour is identical. * app.Run defers srv.Shutdown(5s) on exit. Previously the deferred srv.Close lived inside the listen goroutine and never ran in practice because os.Exit killed the process first. Behaviour for the binary is unchanged; embedded callers now get clean HTTP teardown. * --cluster.listen-address default moved from a const in cmd/alertmanager to the exported app.DefaultClusterAddr. Known follow-ups intentionally out of scope: * matcher/compat.InitFromFlags still mutates package-level state; multi-instance tests with different feature flags will collide. * Richer App lifecycle (New/Start/Addr/Reload/Stop) for tests that need :0-port discovery or programmatic reload. * Migrating the v2 acceptance harness to use app.Run directly instead of building and spawning the binary. Verification: `go build ./...`, `go vet ./...`, and `go test -count=1 ./...` all pass, including the existing test/with_api_v2/acceptance suite which continues to build and spawn the binary end-to-end. Closes prometheus#406 Signed-off-by: Siavash Safi <siavash@cloudflare.com>
There was a problem hiding this comment.
Actionable comments posted: 1
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In `@internal/app/lifecycle.go`:
- Around line 96-113: The Start method can deadlock because the registered
/-/reload handler blocks on sending to the unbuffered a.webReload channel (and
errors from web.ServeMultiple are sent to a.srvc) while the only consumer
(serveLoop, invoked by Run) may not be running for embedders; fix by ensuring
Start spawns the reload-and-error drain loop so a.webReload and a.opts.Reload
are drained even when Run/serveLoop is not used, or alternatively make the
reload handler do a non-blocking send/fail-fast: add a goroutine in Start that
runs the same logic as serveLoop (draining a.webReload, a.opts.Reload and
forwarding errors to a.reload/ reload handler) and ensure web.ServeMultiple
errors sent to a.srvc are observed (do not close a.srvc before draining), or
change the handler to select { case a.webReload <- errc: default: respond with
an immediate error } so the handler never blocks when the drain loop is absent;
update Start, the /-/reload handler, and any use of a.srvc/a.webReload
accordingly.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Enterprise
Run ID: 6a09cb35-1507-4077-b2b1-721ea7afc541
📒 Files selected for processing (9)
cmd/alertmanager/main.gointernal/app/app.gointernal/app/cluster.gointernal/app/lifecycle.gointernal/app/lifecycle_test.gointernal/app/metrics.gointernal/app/options.gointernal/app/url.gointernal/app/url_test.go
🚧 Files skipped from review as they are similar to previous changes (7)
- internal/app/url_test.go
- internal/app/options.go
- internal/app/cluster.go
- internal/app/metrics.go
- internal/app/url.go
- internal/app/lifecycle_test.go
- internal/app/app.go
There was a problem hiding this comment.
Actionable comments posted: 1
🧹 Nitpick comments (1)
internal/app/lifecycle.go (1)
53-55: ⚡ Quick winStale doc comments still point at
serveLoop.
webReloadis now consumed byreloadRouter, notserveLoop. The same staleness applies to theReloaddocstring on Line 175 ("Safe to call concurrently with serveLoop"), sinceserveLoopno longer routes reloads. In code this deadlock-sensitive, accurate "who consumes this channel" comments matter for future maintainers.📝 Suggested doc fixes
// webReload is the channel exposed by httpserver.Register for the - // /-/reload HTTP endpoint. We read from it in serveLoop. + // /-/reload HTTP endpoint. We read from it in reloadRouter. webReload chan chan error// Reload triggers a configuration reload (the programmatic equivalent of -// SIGHUP). Safe to call concurrently with serveLoop. +// SIGHUP). Safe to call concurrently with the running App. func (a *App) Reload(_ context.Context) error {🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@internal/app/lifecycle.go` around lines 53 - 55, Doc comments are stale: update the comment for the webReload channel and the Reload docstring (mentions of "serveLoop") to reflect that reloads are now consumed by reloadRouter, not serveLoop; locate the declaration webReload and the Reload method/docstring and change references from serveLoop to reloadRouter and adjust wording about concurrency to say "Safe to call concurrently with reloadRouter" (or similar) so the consumer is accurate and deadlock-sensitive guidance is preserved.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In `@internal/app/lifecycle_test.go`:
- Around line 161-168: The tests TestApp_EmbeddedReloadDoesNotDeadlock and
TestApp_New_SetupFailureDoesNotDeadlock use require.NoError/Equal/Error inside
spawned goroutines (the anonymous go func that closes done), which can call
t.FailNow from a child goroutine; change these to not call require from the
goroutine: either (A) replace require.* with assert.* inside the goroutine
(e.g., assert.NoError/assert.Equal/assert.Error) or (B) capture the goroutine
results by sending error/status values down a channel (use the existing done
channel or a new result channel) and perform require.* assertions on those
results in the main test goroutine after <-done; update the anonymous functions
and their callers (the POST to "/-/reload" and the setup-failure goroutine) to
use one of these patterns so all require.* calls run on the main test goroutine.
---
Nitpick comments:
In `@internal/app/lifecycle.go`:
- Around line 53-55: Doc comments are stale: update the comment for the
webReload channel and the Reload docstring (mentions of "serveLoop") to reflect
that reloads are now consumed by reloadRouter, not serveLoop; locate the
declaration webReload and the Reload method/docstring and change references from
serveLoop to reloadRouter and adjust wording about concurrency to say "Safe to
call concurrently with reloadRouter" (or similar) so the consumer is accurate
and deadlock-sensitive guidance is preserved.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Enterprise
Run ID: 7df0d969-feab-4bf1-8cf6-684bb59bd394
📒 Files selected for processing (3)
internal/app/app.gointernal/app/lifecycle.gointernal/app/lifecycle_test.go
🚧 Files skipped from review as they are similar to previous changes (1)
- internal/app/app.go
introduce an App lifecycle so tests and embedders can drive Alertmanager without OS signals or os/exec, and discover the bound HTTP address even when listening on ":0". API: New(opts) (*App, error) (*App).Start() error (*App).Addr() string // first listener (*App).Addrs() []string // all listeners (*App).Reload(ctx) error (*App).Stop(ctx) error Run is preserved as a thin wrapper (New + Start + serveLoop + Stop) with a deferred Stop on a fresh 30s context so cleanup also runs on panic, matching the implicit panic-safety of the previous defer- based implementation. Internally, setup uses a cleanup stack (a.onStop) that Stop drains in LIFO order, mirroring Go's defer semantics so the source order of the old `defer X` lines in Run is preserved verbatim and the shutdown ordering does not depend on hand-written reverse-order code. Listeners are bound at New time via a new listenAll helper that calls net.Listen directly (so Addr is meaningful before Start); web.ServeMultiple is then invoked in Start. Systemd socket activation is not supported when embedding and returns an explicit error pointing callers back to cmd/alertmanager. Stop honors its context parameter for the HTTP shutdown step, capped at 5s, so callers passing a tighter deadline get faster teardown and callers passing context.Background get the default. Tests cover: single instance round-trip; two sequential instances in the same process (guards the Phase A metrics-per-Registerer fix against duplicate-registration panics); two concurrent instances on distinct ephemeral ports; and the Run wrapper end-to-end with ctx cancellation. All pass under -race. Signed-off-by: Siavash Safi <siavash@cloudflare.com>
| if d := disp.Load(); d != nil { | ||
| d.Stop() | ||
| } | ||
| }) |
There was a problem hiding this comment.
Here in setup() we may also need to do an onStop(...) for the inhibitor or we leak it at Stop()
a.onStop(func() {
if i := inhibitor.Load(); i != nil { i.Stop() }
})
Signed-off-by: Siavash Safi <siavash@cloudflare.com>
There was a problem hiding this comment.
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (1)
internal/app/lifecycle_test.go (1)
205-213:⚠️ Potential issue | 🟡 Minor | ⚡ Quick winStabilize the embedded reload regression test startup path.
http.Postis issued immediately afterStart(), so this can fail withconnection refusedbefore the listener is ready, creating flakes unrelated to the deadlock behavior.Proposed fix
func TestApp_EmbeddedReloadDoesNotDeadlock(t *testing.T) { a, err := New(testOptions(t)) require.NoError(t, err) require.NoError(t, a.Start()) defer func() { _ = a.Stop(t.Context()) }() + client := &http.Client{Timeout: 2 * time.Second} + + require.Eventually(t, func() bool { + resp, err := client.Get("http://" + a.Addr() + "/-/healthy") + if err != nil { + return false + } + _, _ = io.Copy(io.Discard, resp.Body) + _ = resp.Body.Close() + return resp.StatusCode == http.StatusOK + }, 5*time.Second, 50*time.Millisecond) type reloadResult struct { err error status int } resultCh := make(chan reloadResult, 1) go func() { - resp, err := http.Post("http://"+a.Addr()+"/-/reload", "", nil) + resp, err := client.Post("http://"+a.Addr()+"/-/reload", "", nil) if err != nil { resultCh <- reloadResult{err: err} return }🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@internal/app/lifecycle_test.go` around lines 205 - 213, The test posts to "http://"+a.Addr()+"/-/reload" immediately after calling a.Start(), which can race with listener setup and cause transient connection-refused flakes; replace the single http.Post call with a short retry/wait loop that probes the listener (e.g., net.Dial("tcp", a.Addr()) or repeated http.Post attempts) with a small backoff and overall timeout before sending the request, and only proceed to io.Copy/resp.Body.Close() and sending resultCh <- reloadResult once a successful connection/response is obtained; update the goroutine that currently uses http.Post, a.Addr(), resultCh, and reloadResult to implement this retry-and-timeout logic.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Outside diff comments:
In `@internal/app/lifecycle_test.go`:
- Around line 205-213: The test posts to "http://"+a.Addr()+"/-/reload"
immediately after calling a.Start(), which can race with listener setup and
cause transient connection-refused flakes; replace the single http.Post call
with a short retry/wait loop that probes the listener (e.g., net.Dial("tcp",
a.Addr()) or repeated http.Post attempts) with a small backoff and overall
timeout before sending the request, and only proceed to
io.Copy/resp.Body.Close() and sending resultCh <- reloadResult once a successful
connection/response is obtained; update the goroutine that currently uses
http.Post, a.Addr(), resultCh, and reloadResult to implement this
retry-and-timeout logic.
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Enterprise
Run ID: 3c018ba7-289c-4178-bd2b-888bb5ea0e2b
📒 Files selected for processing (8)
go.modinternal/app/app.gointernal/app/lifecycle.gointernal/app/lifecycle_test.gointernal/app/listen.gointernal/app/listen_test.gointernal/app/options.gointernal/app/options_test.go
🚧 Files skipped from review as they are similar to previous changes (2)
- internal/app/options.go
- internal/app/app.go
There was a problem hiding this comment.
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (1)
internal/app/lifecycle.go (1)
116-120:⚠️ Potential issue | 🔴 Critical | ⚡ Quick winFix nil-channel panic when
StopracesStartinitializationThe below code does not show removed or replaced code.
a.started.Store(true) a.routerQuit = make(chan struct{}) a.routerDone = make(chan struct{}) // reloadRouter consumes /-/reload requests and opts.Reload sends soOn
internal/app/lifecycle.go,Startpublishesa.started.Store(true)beforea.routerQuit/a.routerDoneare allocated (lines 116-118). IfStopruns concurrently, theif a.started.Load()path (198-221) can executeclose(a.routerQuit)and<-a.routerDonewhile those channels are still nil, panicking.Serialize the lifecycle transition (e.g., only set
startedafterrouterQuit/routerDoneare allocated and the related goroutines are ready, or guard theStopclose/wait with a shared mutex/state).Also applies to: 198-221
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Enterprise
Run ID: ae85aabc-9577-4342-a55a-0f7062e46ee4
📒 Files selected for processing (8)
go.modinternal/app/app.gointernal/app/lifecycle.gointernal/app/lifecycle_test.gointernal/app/listen.gointernal/app/listen_test.gointernal/app/options.gointernal/app/options_test.go
🚧 Files skipped from review as they are similar to previous changes (7)
- internal/app/options_test.go
- internal/app/listen.go
- internal/app/options.go
- internal/app/lifecycle_test.go
- internal/app/app.go
- go.mod
- internal/app/listen_test.go
Signed-off-by: Siavash Safi <siavash@cloudflare.com>
Extract the configuration-scoped subgraph (routes, receivers, pipeline, inhibitor, dispatcher) out of the monolithic setup() into a dedicated reloader type. reloader.reload performs the subtle stop-old/build-new/ wait-for-loading/atomic-swap sequence and reloader.stop tears down the live inhibitor+dispatcher, so the swap ordering now lives in one cohesive, independently testable place. The long-lived singletons stay in setup and continue to be updated in place (apih.Update, eventRec/tracing ApplyConfig) rather than rebuilt. Make shutdown observable and harder to misuse: - onStop now registers named, error-returning teardown steps; Stop runs them all LIFO, logs each failure by name, and returns the errors joined with the HTTP-shutdown error instead of swallowing them. - Add trackClose to register a resource's Close right at its acquisition, closing the "acquire here, register cleanup elsewhere" gap that previously leaked goroutines/handles. Add tests and raise internal/app coverage from ~77% to ~84%: - reloader_test: component swap on reload, error path leaves prior state intact, nil-safe stop. - cluster_test / TestApp_ClusteredStartStop: exercise clusterWait and the peer-dependent branches in setup and reload. - TestApp_serveLoop: listener-error, clean-close, and ctx-cancel exits. - TestApp_Stop_AggregatesCleanupErrors and TestApp_Start_BeforeNewFails. No behavioral change to teardown order or HTTP-drain sequencing. Signed-off-by: Siavash Safi <siavash@cloudflare.com>
There was a problem hiding this comment.
Actionable comments posted: 1
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (2)
internal/app/app.go (1)
122-145:⚠️ Potential issue | 🟠 Major | ⚡ Quick winRegister gossip peer teardown immediately after
cluster.Create.The cleanup is registered too late. If Line 154, Line 190, Line 230, or Line 324 returns an error after the peer is created at Line 124,
Stophas no peer cleanup to run, so the startup-failure path leaks the gossip peer's sockets/goroutines.internal/app/cluster_test.goLines 32-51 already show thatpeer.Leave(...)is safe on a freshly created peer, so this teardown can be installed right aftercluster.Createsucceeds and before the later setup steps run.💡 Suggested fix
- var peer *cluster.Peer + var ( + peer *cluster.Peer + settleCancel = func() {} + ) if opts.ClusterBindAddr != "" { peer, err = cluster.Create( logger.With("component", "cluster"), reg, @@ if err != nil { return fmt.Errorf("unable to initialize gossip mesh: %w", err) } + a.onStop("cluster peer leave", func() error { + settleCancel() + if err := peer.Leave(10 * time.Second); err != nil { + return fmt.Errorf("unable to leave gossip mesh: %w", err) + } + return nil + }) m.clusterEnabled.Set(1) } @@ - settleCtx, settleCancel := context.WithTimeout(context.Background(), opts.SettleTimeout) - a.onStop("cluster peer leave", func() error { - settleCancel() - if err := peer.Leave(10 * time.Second); err != nil { - return fmt.Errorf("unable to leave gossip mesh: %w", err) - } - return nil - }) + settleCtx, cancel := context.WithTimeout(context.Background(), opts.SettleTimeout) + settleCancel = cancel go peer.Settle(settleCtx, opts.GossipInterval*10) eventRec.SetClusterPeer(peer) }Also applies to: 247-258
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@internal/app/app.go` around lines 122 - 145, After cluster.Create returns a non-nil peer, immediately register the peer teardown with the Stop/cleanup handler (e.g. Stop.Register(func(ctx context.Context){ _ = peer.Leave(ctx) })) before performing any subsequent setup or calling m.clusterEnabled.Set(1); do this right after the successful cluster.Create call in the first block that assigns peer and in the second similar block (the 247-258 region) so that any later early returns will still invoke peer.Leave and avoid leaking sockets/goroutines.internal/app/lifecycle.go (1)
166-175:⚠️ Potential issue | 🟡 Minor | ⚡ Quick winHandle closed
Options.Reloadto avoid reload hot-loop inreloadRouter.If
opts.Reloadis ever closed,case <-a.opts.Reloadis permanently ready, soreloadRouterwill continuously calla.coordinator.Reload()until shutdown (the binary never closes this channel, but embedders could). Gate the case onokand disable it after the first closed read.Suggested fix
func (a *App) reloadRouter() { defer close(a.routerDone) + reloadCh := a.opts.Reload for { select { case <-a.routerQuit: return - case <-a.opts.Reload: + case _, ok := <-reloadCh: + if !ok { + reloadCh = nil + continue + } if err := a.coordinator.Reload(); err != nil { a.logger.Error("configuration reload failed", "err", err) } case errc := <-a.webReload: errc <- a.coordinator.Reload()🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@internal/app/lifecycle.go` around lines 166 - 175, The select branch reading from a.opts.Reload should detect a closed channel and disable further reloads to avoid a hot-loop; change the case to receive with the comma-ok pattern from a.opts.Reload (e.g., v, ok := <-a.opts.Reload) and if ok call a.coordinator.Reload(), otherwise set a.opts.Reload = nil (or otherwise disable that case) so the select no longer treats it as permanently ready; update the reloadRouter/select logic accordingly to reference a.opts.Reload and a.coordinator.Reload.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In `@internal/app/reloader.go`:
- Around line 90-93: The reload() implementation currently mutates long-lived
components (e.g., eventRec.ApplyConfig and tracingMgr.ApplyConfig) before the
new runtime graph is fully built, which can leave a partially applied config on
failure; change the flow to build the entire new runtime graph first (create new
dispatcher/inhibitor/tracing manager and prepare new event recorder config)
without calling ApplyConfig on existing shared instances, then atomically
swap/publish the new instances and finally call ApplyConfig on the new
components (or call ApplyConfig during construction of new instances);
alternatively, if you must apply to existing instances, implement a rollback
path that restores prior state on any subsequent error. Target symbols:
reload(), eventRec.ApplyConfig, tracingMgr.ApplyConfig, the dispatcher/inhibitor
publish/stop logic, and any code paths at the other mentioned locations (around
lines where ApplyConfig is called) to ensure no in-place mutation happens before
a successful full build.
---
Outside diff comments:
In `@internal/app/app.go`:
- Around line 122-145: After cluster.Create returns a non-nil peer, immediately
register the peer teardown with the Stop/cleanup handler (e.g.
Stop.Register(func(ctx context.Context){ _ = peer.Leave(ctx) })) before
performing any subsequent setup or calling m.clusterEnabled.Set(1); do this
right after the successful cluster.Create call in the first block that assigns
peer and in the second similar block (the 247-258 region) so that any later
early returns will still invoke peer.Leave and avoid leaking sockets/goroutines.
In `@internal/app/lifecycle.go`:
- Around line 166-175: The select branch reading from a.opts.Reload should
detect a closed channel and disable further reloads to avoid a hot-loop; change
the case to receive with the comma-ok pattern from a.opts.Reload (e.g., v, ok :=
<-a.opts.Reload) and if ok call a.coordinator.Reload(), otherwise set
a.opts.Reload = nil (or otherwise disable that case) so the select no longer
treats it as permanently ready; update the reloadRouter/select logic accordingly
to reference a.opts.Reload and a.coordinator.Reload.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Enterprise
Run ID: 78d02d37-9dc1-4753-aa05-9cc4b909a5e3
📒 Files selected for processing (6)
internal/app/app.gointernal/app/cluster_test.gointernal/app/lifecycle.gointernal/app/lifecycle_test.gointernal/app/reloader.gointernal/app/reloader_test.go
Signed-off-by: Siavash Safi <siavash@cloudflare.com>
Signed-off-by: Siavash Safi <siavash@cloudflare.com>
Signed-off-by: Siavash Safi <siavash@cloudflare.com>
Signed-off-by: Siavash Safi <siavash@cloudflare.com>
There was a problem hiding this comment.
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (1)
internal/app/reloader.go (1)
159-160:⚠️ Potential issue | 🟡 Minor | ⚡ Quick winPublish the new inhibitor only after it has finished loading.
internal/app/app.gowires the API mute callback tor.inhibitor.Load(), so storingnewInhibitorbeforeRun/WaitForLoadingexposes an incomplete inhibition cache to concurrent API requests during reload. The dispatcher already avoids this by publishing only afterWaitForLoading; the inhibitor should follow the same pattern.Suggested fix
newInhibitor := inhibit.NewInhibitor(r.alerts, conf.InhibitRules, r.logger, r.eventRec) - r.inhibitor.Store(newInhibitor) // An interface value that holds a nil concrete value is non-nil. // Therefore we explicitly pass an empty interface, to detect if the // cluster is not enabled in notify. @@ // First, start the inhibitor so the inhibition cache can populate. // Wait for this to load alerts before starting the dispatcher so // we don't accidentally notify for an alert that will be inhibited. go newInhibitor.Run() newInhibitor.WaitForLoading() + r.inhibitor.Store(newInhibitor) // Next, start the dispatcher and wait for it to load before swapping // the disp pointer. This ensures that the API doesn't see the new // dispatcher before it finishes populating the aggrGroups.Also applies to: 185-188, 221-225
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@internal/app/reloader.go` around lines 159 - 160, The new inhibitor instance created via inhibit.NewInhibitor(...) should finish loading before publishing it to r.inhibitor; call the inhibitor's Run() and/or WaitForLoading() (the same sequence used by the dispatcher) and only after those return store it with r.inhibitor.Store(newInhibitor) so the API callback that calls r.inhibitor.Load() won't see an incomplete cache. Apply the same change to the other places where you create and immediately store a new inhibitor (the other newInhibitor creation sites) so each new instance is Run/WaitForLoading-completed before being published.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Outside diff comments:
In `@internal/app/reloader.go`:
- Around line 159-160: The new inhibitor instance created via
inhibit.NewInhibitor(...) should finish loading before publishing it to
r.inhibitor; call the inhibitor's Run() and/or WaitForLoading() (the same
sequence used by the dispatcher) and only after those return store it with
r.inhibitor.Store(newInhibitor) so the API callback that calls
r.inhibitor.Load() won't see an incomplete cache. Apply the same change to the
other places where you create and immediately store a new inhibitor (the other
newInhibitor creation sites) so each new instance is
Run/WaitForLoading-completed before being published.
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Enterprise
Run ID: cee9d348-7225-4edf-a338-8c048a2bbc41
📒 Files selected for processing (4)
internal/app/app.gointernal/app/lifecycle.gointernal/app/lifecycle_test.gointernal/app/reloader.go
🚧 Files skipped from review as they are similar to previous changes (3)
- internal/app/lifecycle_test.go
- internal/app/lifecycle.go
- internal/app/app.go
Signed-off-by: Siavash Safi <siavash@cloudflare.com>
|
@coderabbitai review |
✅ Action performedReview finished.
|
Summary
Resolves the long-standing TODO from #406 by extracting the Alertmanager process logic out of
cmd/alertmanager/main.gointo a newinternal/apppackage, and giving it a lifecycle API (New/Start/Addr/Reload/Stop) so tests and other binaries can embed Alertmanager in-process instead of building and shelling out to the compiled binary.cmd/alertmanager/main.goshrinks from 722 → 207 lines and now owns only: kingpin flag parsing, logger construction,versioncollectorregistration, feature-flag / GOMEMLIMIT side effects, and translating OS signals into context cancellation (SIGINT/SIGTERM) plus reload events (SIGHUP) consumed byapp.Run.What this changes
A reusable
internal/apppackageThe package is mainly intended for internal use, it can be used by other projects embedding Alertmanager as well.
A lifecycle API for embedders
Runis a thin wrapper (New+Start+serveLoop+Stop) with a deferredStopso cleanup also runs on panic. Listeners are bound atNewtime soAddr()/Addrs()report the real bound ports (including kernel-assigned:0) before serving starts.Per-instance state, so multiple instances can coexist
The Prometheus collectors that used to be package-level
promautovariables incmd/alertmanager/main.goare now built perRun()againstopts.Registerer, which is threaded through every collaborator (versioncollectorexcepted, which stays process-global inmain.go). This unblocks running multiple Alertmanager instances in the same process without duplicate-registration panics.Deterministic, observable shutdown
setupregisters teardown steps on a LIFO cleanup stack thatStopdrains in reverse, mirroring Go'sdefersemantics so shutdown ordering follows construction order automatically. Each step is named;Stopruns them all, logs any failure by name, and returns the errors joined with the HTTP-shutdown error. The HTTP shutdown honors a single timeout derived from the context passed toStop.Isolated config-reload logic
The config-scoped subgraph lives in a dedicated
reloadertype.reloader.reloadperforms the stop-old → build-new → wait-for-loading → atomic-swap sequence, andreloader.stoptears down the live inhibitor and dispatcher at shutdown. The long-lived singletons (nflog, silences, alerts, cluster peer, API, event recorder, tracing) are constructed once and updated in place on reload (apih.Update,eventRec/tracingApplyConfig) rather than rebuilt.Behavioural notes
prometheus.DefaultRegistereris no longer referenced insideapp.Run; the binary still passes it in viaOptions.Registerer, so on-disk behaviour is identical.srv.Shutdownnow actually runs onRunexit (previously the deferredsrv.Closelived inside the listen goroutine and never ran in practice becauseos.Exitkilled the process first). Behaviour for the binary is unchanged; embedded callers now get clean HTTP teardown.vsock://listen addresses work under both the binary and embedders; the external URL is derived from the actual bound address.tracingManager.Stopis part of the cleanup stack and always runs, not just onctx.Done().Start/Stopare concurrency-safe and/-/reloadworks in embedded mode (no deadlock withoutRun/serveLoop).--cluster.listen-addressdefault moved from a const incmd/alertmanagerto the exportedapp.DefaultClusterAddr.Known follow-ups (out of scope)
matcher/compat.InitFromFlagsstill mutates package-level state; multi-instance tests with different feature flags will collide. Tracked separately.app.Rundirectly instead of building and spawning the binary. Now mechanically possible thanks toAddr()/Stop()on*App; left for a follow-up PR to keep this one reviewable.Verification
Tests live in
internal/app(lifecycle, listen, options, url, reloader, cluster). Highlights:TestApp_StartStop— boot, probe/-/healthy, stop, stop again (idempotency).TestApp_TwoSequentialInstances/TestApp_TwoConcurrentInstances— multiple instances in one process; guards the metrics-per-Registerer behaviour.TestApp_ClusteredStartStop— gossip clustering enabled; exercises the peer-dependent paths andclusterWait.TestApp_ConcurrentStartStop— racesStartandStop(run under-race).TestApp_EmbeddedReloadDoesNotDeadlock—/-/reloadcompletes in embedded mode.TestApp_New_SetupFailureDoesNotDeadlock— setup-failure rollback doesn't block.TestApp_Run_ContextCancel— end-to-endRunwrapper with ctx cancellation.TestApp_serveLoop/TestApp_Stop_AggregatesCleanupErrors— serve-loop exits and aggregated shutdown errors.TestReloader_*— component swap on reload, error path leaves prior state intact, nil-safe stop.TestListenAll_*/TestParseVsockPort/TestOptions_Validate/TestClusterWait.internal/appstatement coverage is ~84%; the full suite passes under-race.Diff size
Closes #406
Pull Request Checklist
Please check all the applicable boxes.
benchstatto compare benchmarksWhich user-facing changes does this PR introduce?
Summary by CodeRabbit
Refactor
New Features
Bug Fixes
Tests