Skip to content

[NAS Backup] Suppress Errors in Disk Usage Calculation that Caused Backup to Fail.#13424

Open
daviftorres wants to merge 10 commits into
apache:mainfrom
daviftorres:nas-backup-failed
Open

[NAS Backup] Suppress Errors in Disk Usage Calculation that Caused Backup to Fail.#13424
daviftorres wants to merge 10 commits into
apache:mainfrom
daviftorres:nas-backup-failed

Conversation

@daviftorres

@daviftorres daviftorres commented Jun 15, 2026

Copy link
Copy Markdown
Contributor

Description

This PR tried to prevent the failure of the job at the statistics section of a backup that has actually succeeded.

image

Apparently, it also fixes some silent failures I previously reported in #11727

Types of changes

  • Breaking change (fix or feature that would cause existing functionality to change)
  • New feature (non-breaking change which adds functionality)
  • Bug fix (non-breaking change which fixes an issue)
  • Enhancement (improves an existing feature and functionality)
  • Cleanup (Code refactoring and cleanup, that may add test cases)
  • Build/CI
  • Test (unit or integration test code)

Feature/Enhancement Scale or Bug Severity

Feature/Enhancement Scale

  • Major
  • Minor

Bug Severity

  • BLOCKER
  • Critical
  • Major
  • Minor
  • Trivial

Screenshots (if appropriate):

How Has This Been Tested?

How did you try to break this feature and the system with this change?

@daviftorres

daviftorres commented Jun 15, 2026

Copy link
Copy Markdown
Contributor Author

This is the equivalent command for applying the fix:

sed -i 's_du -sb $dest | cut -f1_du -sb $dest 2>/dev/null | cut -f1 || true_g' /usr/share/cloudstack-common/scripts/vm/hypervisor/kvm/nasbackup.sh

We haven't confirmed the exact root cause of the du failure yet. As a precaution, we applied this fix to all servers and will monitor backups over the next few days.

So, I am running tests with 2>>/var/log/cloudstack/agent/nasbackup.err so I can see what is the error message.

@daviftorres daviftorres marked this pull request as ready for review June 16, 2026 15:02
@daviftorres

daviftorres commented Jun 17, 2026

Copy link
Copy Markdown
Contributor Author

Proposed Changes Rationale

backup_size=$(du -sb "$dest" 2>/dev/null | cut -f1) || true
  • NFS issues may cause du command to fail.
  • A size retrieval failure should not invalidate a successful backup.
timeout 60 umount "$mount_point" 2>/dev/null || true
rmdir "$mount_point" 2>/dev/null || true
  • Another process may keep the device busy (e.g., parallel backups).
  • Network issues may cause hangs on NFS.
  • Cleanup failures should not invalidate a successful backup.
echo -n "$backup_size"
  • Outputs the size at the end to confirm the script completed past the potentially problematic commands.

@daviftorres

Copy link
Copy Markdown
Contributor Author

Dear @abh1sar , do you think you can help me with this bug? Regards,

@codecov

codecov Bot commented Jun 18, 2026

Copy link
Copy Markdown

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 18.94%. Comparing base (5ed4894) to head (8dea747).
⚠️ Report is 32 commits behind head on main.

Additional details and impacted files
@@             Coverage Diff              @@
##               main   #13424      +/-   ##
============================================
- Coverage     18.94%   18.94%   -0.01%     
+ Complexity    18376    18375       -1     
============================================
  Files          6192     6192              
  Lines        556550   556558       +8     
  Branches      67954    67955       +1     
============================================
- Hits         105454   105453       -1     
- Misses       439517   439526       +9     
  Partials      11579    11579              
Flag Coverage Δ
uitests 3.51% <ø> (-0.01%) ⬇️
unittests 20.15% <ø> (-0.01%) ⬇️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Harness.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.
  • 📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

Copilot AI review requested due to automatic review settings June 18, 2026 20:07

Copilot AI left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR adjusts the KVM NAS backup script’s “statistics/cleanup” section so that failures while computing backup disk usage (and related cleanup commands) don’t cause an otherwise successful backup job to be marked as failed.

Changes:

  • Capture du output into backup_size and suppress du stderr to avoid failing the script during size calculation.
  • Add timeout around umount and suppress errors from umount/rmdir.
  • Emit the computed backup size at the end of backup_running_vm().

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread scripts/vm/hypervisor/kvm/nasbackup.sh Outdated

@DaanHoogland DaanHoogland left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

clgtm

@DaanHoogland DaanHoogland requested a review from abh1sar June 23, 2026 06:53
Copilot AI review requested due to automatic review settings June 23, 2026 12:20

Copilot AI left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 1 out of 1 changed files in this pull request and generated 2 comments.

Comment thread scripts/vm/hypervisor/kvm/nasbackup.sh
Comment thread scripts/vm/hypervisor/kvm/nasbackup.sh Outdated
Comment thread scripts/vm/hypervisor/kvm/nasbackup.sh Outdated
Comment thread scripts/vm/hypervisor/kvm/nasbackup.sh Outdated
Comment thread scripts/vm/hypervisor/kvm/nasbackup.sh Outdated
Copilot AI review requested due to automatic review settings June 24, 2026 20:55

Copilot AI left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 1 out of 1 changed files in this pull request and generated 2 comments.

Comment thread scripts/vm/hypervisor/kvm/nasbackup.sh Outdated
Comment thread scripts/vm/hypervisor/kvm/nasbackup.sh Outdated

@abh1sar abh1sar left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

hi @daviftorres
would it be possible to test script changes in your env where it is reproducible?
Some log outputs would be nice.

Comment thread scripts/vm/hypervisor/kvm/nasbackup.sh Outdated
Comment thread scripts/vm/hypervisor/kvm/nasbackup.sh Outdated
Comment thread scripts/vm/hypervisor/kvm/nasbackup.sh Outdated
Comment thread scripts/vm/hypervisor/kvm/nasbackup.sh Outdated
Copilot AI review requested due to automatic review settings June 25, 2026 13:19

Copilot AI left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 1 out of 1 changed files in this pull request and generated 1 comment.

Comment thread scripts/vm/hypervisor/kvm/nasbackup.sh Outdated
@daviftorres

Copy link
Copy Markdown
Contributor Author

would it be possible to test script changes in your env where it i

image
2026-06-25 13:22:42,360 DEBUG [cloud.agent.Agent] (AgentRequest-Handler-1:[]) (logid:e4aca261) Request:Seq 218-281474976710814:  { Cmd , MgmtId: 90520736259046, via: 218, Ver: v1, Flags: 100111, [{"org.apache.cloudstack.backup.TakeBackupCommand":{"vmName":"i-84-6493-VM","backupPath":"i-84-6493-VM/2026.06.25.13.22.42","backupRepoType":"nfs","backupRepoAddress":"10.4.2.145:/mnt/VAN3-NAS01-STOR-POOL-A1/VAN3-NAS01-DS-01","wait":"0","bypassHostMaintenance":"false"}}] }
2026-06-25 13:22:42,360 DEBUG [cloud.agent.Agent] (AgentRequest-Handler-1:[]) (logid:e4aca261) Processing command: org.apache.cloudstack.backup.TakeBackupCommand
2026-06-25 13:25:04,120 DEBUG [cloud.agent.Agent] (AgentRequest-Handler-1:[]) (logid:e4aca261) Seq 218-281474976710814:  { Ans: , MgmtId: 90520736259046, via: 218, Ver: v1, Flags: 110, [{"org.apache.cloudstack.backup.BackupAnswer":{"size":"(47.67 GB) 51190241222","result":"true","details":"Job type:         Completed

Nothing else that I can find in the logs are related to the backup job.

Copilot AI review requested due to automatic review settings June 26, 2026 01:49

Copilot AI left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copilot was unable to review this pull request because the user who requested the review has reached their quota limit.

@abh1sar

abh1sar commented Jun 26, 2026

Copy link
Copy Markdown
Contributor

@blueorangutan package

@blueorangutan

Copy link
Copy Markdown

@abh1sar a [SL] Jenkins job has been kicked to build packages. It will be bundled with no SystemVM templates. I'll keep you posted as I make progress.

@blueorangutan

Copy link
Copy Markdown

Packaging result [SF]: ✔️ el8 ✔️ el9 ✔️ el10 ✔️ debian ✔️ suse15. SL-JID 18383

@DaanHoogland

Copy link
Copy Markdown
Contributor

@blueorangutan test

@blueorangutan

Copy link
Copy Markdown

@DaanHoogland a [SL] Trillian-Jenkins test job (ol8 mgmt + kvm-ol8) has been kicked to run smoke tests

@blueorangutan

Copy link
Copy Markdown

[SF] Trillian test result (tid-16435)
Environment: kvm-ol8 (x2), zone: Advanced Networking with Mgmt server ol8
Total time taken: 49818 seconds
Marvin logs: https://github.com/blueorangutan/acs-prs/releases/download/trillian/pr13424-t16435-kvm-ol8.zip
Smoke tests completed. 151 look OK, 0 have errors, 0 did not run
Only failed and skipped tests results shown below:

Test Result Time (s) Test File

@weizhouapache weizhouapache added this to the 4.23.0 milestone Jun 29, 2026
daviftorres and others added 9 commits June 29, 2026 08:58
Handle potential errors when calculating disk usage.
Add timeout for unmounting backup mount point and cleanup.
Co-authored-by: Abhisar Sinha <63767682+abh1sar@users.noreply.github.com>
Co-authored-by: Copilot Autofix powered by AI <175728472+Copilot@users.noreply.github.com>
@DaanHoogland

Copy link
Copy Markdown
Contributor

@davift

davift commented Jun 29, 2026

Copy link
Copy Markdown

@daviftorres , can you look at https://github.com/apache/cloudstack/actions/runs/28373728590/job/84079353133?pr=13424#step:7:11412 ?

Hey @DaanHoogland , sure!

trim trailing whitespace.................................................Failed
- hook id: trailing-whitespace
- exit code: 1
- files were modified by this hook

If I understand it right, it is caused by the echo command that produces a new line by default. I had -n in the past, but I accepted GitHub Copilot advise to remove it to prevent some error. Honestly, I am not sure if it is related or not. I am re-adding the argument now and will see if the build fails again.

Please point me in the right direction if you see that I am going the wrong way. Regards!

Copilot AI review requested due to automatic review settings June 29, 2026 15:15

Copilot AI left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 1 out of 1 changed files in this pull request and generated 2 comments.

rmdir $mount_point
backup_size=$(du -sb "$dest" 2>>"$logFile" | cut -f1) || { log -ne "WARNING: du failed for $dest, reporting size as 0"; backup_size=0; }

timeout "$UNMOUNT_TIMEOUT" umount "$mount_point" 2>>"$logFile" || { log "WARNING: umount of $mount_point failed or timed out"; true; }
Comment on lines +201 to +202
timeout "$UNMOUNT_TIMEOUT" umount "$mount_point" 2>>"$logFile" || { log "WARNING: umount of $mount_point failed or timed out"; true; }
rmdir "$mount_point" 2>>"$logFile" || { log "WARNING: rmdir of $mount_point failed"; true; }
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

7 participants