Skip to content

fix(agent): systematic failure recovery for agent tunnel enrollment#1802

Merged
irvingouj@Devolutions (irvingoujAtDevolution) merged 1 commit into
masterfrom
fix/agent-installer-enroll-hardening
May 29, 2026
Merged

fix(agent): systematic failure recovery for agent tunnel enrollment#1802
irvingouj@Devolutions (irvingoujAtDevolution) merged 1 commit into
masterfrom
fix/agent-installer-enroll-hardening

Conversation

@irvingoujAtDevolution
Copy link
Copy Markdown
Contributor

@irvingoujAtDevolution irvingouj@Devolutions (irvingoujAtDevolution) commented May 26, 2026

Summary

Agent tunnel enrollment spans two non-transactional write phases — the agent's up (Rust: client cert/key + the fixed-name gateway-ca.pem + agent.json) and the MSI custom action (advertise subnets/domains + rollback bookkeeping). A failure in either phase left the machine in a partial state: orphaned cert files, a clobbered gateway-ca.pem, or a half-written agent.json that even a rollback couldn't recover from.

This makes the whole enrollment recoverable end to end — every failure path leaves the machine exactly as it was before enroll.

Agent (Rust)

  • persist_enrollment_response is now transactional: load/validate the config before any write (a corrupt agent.json fails before touching disk), back up the fixed-name gateway-ca.pem, and roll back partial cert/CA writes on any failure.
  • save_config creates its parent directory (fixes fresh standalone agent.exe up on a clean machine) and writes atomically (temp + rename) so a mid-write failure never truncates agent.json.

Installer (C#)

  • EnrollAgentTunnel snapshots the pre-enrollment Tunnel section and gateway-ca.pem into a per-install rollback marker (%TEMP%\{installId}-tunnel-rollback.json), written atomically. The marker is required: if it can't be recorded, the enrollment is undone inline and the CA fails.
  • New marker-driven RollbackEnrollAgentTunnel (Execute.rollback) only cleans up / restores when this install recorded a marker — so it never touches pre-existing or partial state. It restores the original Tunnel section and gateway-ca.pem, and deletes the certs this install wrote.
  • All agent.json writes go through an atomic temp-replace helper, so the rollback can always re-parse it.
  • Drains stdout/stderr concurrently with WaitForExit (fixes a pipe-buffer deadlock that could kill a healthy up) and fails loudly when operator-supplied advertise subnets/domains can't be persisted (instead of silently dropping them).

Failure-recovery matrix

Failure point Recovery
Empty / pre-write failure No marker → rollback no-op; nothing was written
up fails mid-write (bad json / save error) Rust self-rolls-back its partial writes → non-zero exit → no marker
up ok, marker write fails Inline cleanup + CA fails
up ok, advertisements write fails Marker present → rollback restores
Later MSI action fails after success Marker present → rollback restores

Test

  • cargo check -p devolutions-agent — clean.
  • dotnet build DevolutionsAgent.csproj -c Debug — 0 errors (8 pre-existing WiX CNDL warnings), MSI builds.

Changelog: ignore

@github-actions
Copy link
Copy Markdown

Let maintainers know that an action is required on their side

  • Add the label release-required Please cut a new release (Devolutions Gateway, Devolutions Agent, Jetsocat, PowerShell module) when you request a maintainer to cut a new release (Devolutions Gateway, Devolutions Agent, Jetsocat, PowerShell module)

  • Add the label release-blocker Follow-up is required before cutting a new release if a follow-up is required before cutting a new release

  • Add the label publish-required Please publish libraries (`Devolutions.Gateway.Utils`, OpenAPI clients, etc) when you request a maintainer to publish libraries (Devolutions.Gateway.Utils, OpenAPI clients, etc.)

  • Add the label publish-blocker Follow-up is required before publishing libraries if a follow-up is required before publishing libraries

@irvingoujAtDevolution irvingouj@Devolutions (irvingoujAtDevolution) changed the title fix(agent-installer): harden EnrollAgentTunnel against pipe deadlock and silent config loss fix(agent-installer): systematic failure recovery for agent tunnel enrollment May 26, 2026
@irvingoujAtDevolution irvingouj@Devolutions (irvingoujAtDevolution) changed the title fix(agent-installer): systematic failure recovery for agent tunnel enrollment fix(agent): systematic failure recovery for agent tunnel enrollment May 26, 2026
@irvingoujAtDevolution irvingouj@Devolutions (irvingoujAtDevolution) marked this pull request as ready for review May 26, 2026 21:54
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR hardens agent tunnel enrollment (Rust agent.exe up + Windows MSI custom actions) to be failure-recoverable end-to-end, avoiding partial on-disk state (orphaned certs, clobbered gateway-ca.pem, truncated agent.json) when either phase fails.

Changes:

  • Windows MSI (C#): Adds marker-driven rollback for tunnel enrollment, atomic file writes for agent.json/marker, and concurrent stdout/stderr draining to avoid pipe deadlocks.
  • Agent (Rust): Makes enrollment persistence transactional (validate config before writes, back up/restore gateway-ca.pem on failure) and writes agent.json atomically while ensuring parent directories exist.
  • Installer sequencing (C#): Adds a dedicated rollback custom action tied to the enrollment feature and per-install installId marker.

Reviewed changes

Copilot reviewed 4 out of 4 changed files in this pull request and generated 7 comments.

File Description
package/AgentWindowsManaged/Actions/CustomActions.cs Adds marker-driven rollback + atomic writes + improved process output handling for enrollment/rollback.
package/AgentWindowsManaged/Actions/AgentActions.cs Schedules the new rollback custom action and bubbles installId through CA data.
devolutions-agent/src/enrollment.rs Makes enrollment persistence rollback-safe (backup/restore CA, cleanup on failure) and validates config before writing.
devolutions-agent/src/config.rs Ensures config directory exists and writes agent.json via temp + rename for atomicity.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread package/AgentWindowsManaged/Actions/CustomActions.cs Outdated
Comment thread package/AgentWindowsManaged/Actions/CustomActions.cs
Comment thread package/AgentWindowsManaged/Actions/CustomActions.cs Outdated
Comment thread package/AgentWindowsManaged/Actions/CustomActions.cs
Comment thread package/AgentWindowsManaged/Actions/CustomActions.cs
Comment thread devolutions-agent/src/config.rs Outdated
Comment thread package/AgentWindowsManaged/Actions/CustomActions.cs Outdated
Copy link
Copy Markdown
Member

@CBenoit Benoît Cortier (CBenoit) left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, but address Copilot’s relevant comments before merging

Agent tunnel enrollment spans two non-transactional write phases — the
agent's `up` (Rust: cert/key + fixed-name gateway-ca.pem + agent.json)
and the MSI custom action (advertise subnets/domains + rollback
bookkeeping). A failure in either left the machine in a partial state:
orphaned cert files, a clobbered gateway-ca.pem, or a half-written
agent.json that even rollback couldn't recover from. Make the whole
enrollment recoverable end to end.

Agent (Rust):
- persist_enrollment_response is transactional: load/validate config
  before any write (corrupt agent.json fails before touching disk), back
  up the fixed-name gateway-ca.pem, and roll back partial cert/CA writes
  on any failure.
- save_config creates its parent directory (fixes fresh standalone `up`)
  and writes atomically (temp + rename) so a mid-write failure never
  corrupts agent.json.

Installer (C#):
- EnrollAgentTunnel snapshots the pre-enrollment Tunnel section and
  gateway-ca into a per-install rollback marker, written atomically and
  treated as required: if it can't be recorded, the enrollment is undone
  inline and the CA fails.
- New marker-driven RollbackEnrollAgentTunnel (Execute.rollback) only
  cleans up / restores when this install recorded a marker, so it never
  touches pre-existing or partial state; restores the original Tunnel
  section and gateway-ca, deletes the certs this install wrote.
- All agent.json writes go through an atomic temp-replace helper.
- Drain stdout/stderr concurrently with WaitForExit (pipe-buffer
  deadlock) and fail loudly when advertise subnets/domains can't persist.

Changelog: ignore
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 4 out of 4 changed files in this pull request and generated no new comments.

@irvingoujAtDevolution irvingouj@Devolutions (irvingoujAtDevolution) merged commit b3ab657 into master May 29, 2026
43 checks passed
@irvingoujAtDevolution irvingouj@Devolutions (irvingoujAtDevolution) deleted the fix/agent-installer-enroll-hardening branch May 29, 2026 16:47
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Development

Successfully merging this pull request may close these issues.

4 participants