fix(agent): systematic failure recovery for agent tunnel enrollment#1802
Conversation
Let maintainers know that an action is required on their side
|
d8d613c to
4b90f5a
Compare
4b90f5a to
4acaf72
Compare
There was a problem hiding this comment.
Pull request overview
This PR hardens agent tunnel enrollment (Rust agent.exe up + Windows MSI custom actions) to be failure-recoverable end-to-end, avoiding partial on-disk state (orphaned certs, clobbered gateway-ca.pem, truncated agent.json) when either phase fails.
Changes:
- Windows MSI (C#): Adds marker-driven rollback for tunnel enrollment, atomic file writes for
agent.json/marker, and concurrent stdout/stderr draining to avoid pipe deadlocks. - Agent (Rust): Makes enrollment persistence transactional (validate config before writes, back up/restore
gateway-ca.pemon failure) and writesagent.jsonatomically while ensuring parent directories exist. - Installer sequencing (C#): Adds a dedicated rollback custom action tied to the enrollment feature and per-install installId marker.
Reviewed changes
Copilot reviewed 4 out of 4 changed files in this pull request and generated 7 comments.
| File | Description |
|---|---|
| package/AgentWindowsManaged/Actions/CustomActions.cs | Adds marker-driven rollback + atomic writes + improved process output handling for enrollment/rollback. |
| package/AgentWindowsManaged/Actions/AgentActions.cs | Schedules the new rollback custom action and bubbles installId through CA data. |
| devolutions-agent/src/enrollment.rs | Makes enrollment persistence rollback-safe (backup/restore CA, cleanup on failure) and validates config before writing. |
| devolutions-agent/src/config.rs | Ensures config directory exists and writes agent.json via temp + rename for atomicity. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
Benoît Cortier (CBenoit)
left a comment
There was a problem hiding this comment.
LGTM, but address Copilot’s relevant comments before merging
4acaf72 to
aba2b26
Compare
Agent tunnel enrollment spans two non-transactional write phases — the agent's `up` (Rust: cert/key + fixed-name gateway-ca.pem + agent.json) and the MSI custom action (advertise subnets/domains + rollback bookkeeping). A failure in either left the machine in a partial state: orphaned cert files, a clobbered gateway-ca.pem, or a half-written agent.json that even rollback couldn't recover from. Make the whole enrollment recoverable end to end. Agent (Rust): - persist_enrollment_response is transactional: load/validate config before any write (corrupt agent.json fails before touching disk), back up the fixed-name gateway-ca.pem, and roll back partial cert/CA writes on any failure. - save_config creates its parent directory (fixes fresh standalone `up`) and writes atomically (temp + rename) so a mid-write failure never corrupts agent.json. Installer (C#): - EnrollAgentTunnel snapshots the pre-enrollment Tunnel section and gateway-ca into a per-install rollback marker, written atomically and treated as required: if it can't be recorded, the enrollment is undone inline and the CA fails. - New marker-driven RollbackEnrollAgentTunnel (Execute.rollback) only cleans up / restores when this install recorded a marker, so it never touches pre-existing or partial state; restores the original Tunnel section and gateway-ca, deletes the certs this install wrote. - All agent.json writes go through an atomic temp-replace helper. - Drain stdout/stderr concurrently with WaitForExit (pipe-buffer deadlock) and fail loudly when advertise subnets/domains can't persist. Changelog: ignore
aba2b26 to
fc55e6d
Compare
b3ab657
into
master
Summary
Agent tunnel enrollment spans two non-transactional write phases — the agent's
up(Rust: client cert/key + the fixed-namegateway-ca.pem+agent.json) and the MSI custom action (advertise subnets/domains + rollback bookkeeping). A failure in either phase left the machine in a partial state: orphaned cert files, a clobberedgateway-ca.pem, or a half-writtenagent.jsonthat even a rollback couldn't recover from.This makes the whole enrollment recoverable end to end — every failure path leaves the machine exactly as it was before enroll.
Agent (Rust)
persist_enrollment_responseis now transactional: load/validate the config before any write (a corruptagent.jsonfails before touching disk), back up the fixed-namegateway-ca.pem, and roll back partial cert/CA writes on any failure.save_configcreates its parent directory (fixes fresh standaloneagent.exe upon a clean machine) and writes atomically (temp + rename) so a mid-write failure never truncatesagent.json.Installer (C#)
EnrollAgentTunnelsnapshots the pre-enrollmentTunnelsection andgateway-ca.peminto a per-install rollback marker (%TEMP%\{installId}-tunnel-rollback.json), written atomically. The marker is required: if it can't be recorded, the enrollment is undone inline and the CA fails.RollbackEnrollAgentTunnel(Execute.rollback) only cleans up / restores when this install recorded a marker — so it never touches pre-existing or partial state. It restores the originalTunnelsection andgateway-ca.pem, and deletes the certs this install wrote.agent.jsonwrites go through an atomic temp-replace helper, so the rollback can always re-parse it.WaitForExit(fixes a pipe-buffer deadlock that could kill a healthyup) and fails loudly when operator-supplied advertise subnets/domains can't be persisted (instead of silently dropping them).Failure-recovery matrix
upfails mid-write (bad json / save error)upok, marker write failsupok, advertisements write failsTest
cargo check -p devolutions-agent— clean.dotnet build DevolutionsAgent.csproj -c Debug— 0 errors (8 pre-existing WiX CNDL warnings), MSI builds.Changelog: ignore