Skip to content

[ES-1911239] Retry transient S3 errors on staging PUT/GET/REMOVE#361

Open
vikrantpuppala wants to merge 1 commit into
mainfrom
fix/es-1911239-staging-retry
Open

[ES-1911239] Retry transient S3 errors on staging PUT/GET/REMOVE#361
vikrantpuppala wants to merge 1 commit into
mainfrom
fix/es-1911239-staging-retry

Conversation

@vikrantpuppala
Copy link
Copy Markdown
Collaborator

@vikrantpuppala vikrantpuppala commented May 21, 2026

Summary

Sibling of ES-1892645 / PR #355 (the CloudFetch retry fix that just merged). Same FactSet customer, same root cause class (transient S3 5xx), different code path.

The three staging-operation HTTP wrappers in connection.go (handleStagingPut, handleStagingGet, handleStagingRemove) make a single client.Do(req) call with no retry. Under FactSet's load test (~30 PUTs / 2 min against a UC external volume), S3 intermittently returns 503 SlowDown and the driver fails the entire SQL statement permanently:

staging operation over HTTP was unsuccessful: 503-<S3 SlowDown body>

Changes

  • connection.go — adds doStagingRequestWithRetry, a per-conn helper that wraps a func(attempt int) (*http.Request, error) factory in a retry loop. All three handleStaging* methods use it.
  • PUT body lifecyclehttp.Client.Do consumes the request body (an *os.File) on each attempt, so the retry helper Seek(0, SeekStart)s the file between attempts. The file is also wrapped in io.NopCloser so the client can't close it; the outer defer dat.Close() owns the lifecycle.
  • internal/retry/retry.go (new) — factors out RetryableStatuses, IsRetryableStatus, and Backoff so the CloudFetch path and the staging path share one implementation. Addresses the "two divergent retry implementations" follow-up from [ES-1892645] Retry transient S3 errors in CloudFetch downloads #355.
  • internal/rows/arrowbased/batchloader.go — migrated to use the shared retry package. Same behavior, no functional change.

Retry semantics — consistent with PR #355 (the customer asked)

Behavior This PR (staging) PR #355 (CloudFetch)
Retryable statuses 408/429/500/502/503/504 identical (shared via internal/retry)
Backoff curve Exponential with equal jitter, capped at RetryWaitMax identical
Integer Retry-After honored yes (capped at RetryWaitMax) identical
Context cancel during backoff aborts promptly identical
Config knobs RetryMax / RetryWaitMin / RetryWaitMax identical

Test plan

Added TestConn_handleStagingRetry in connection_test.go with 8 subtests:

  • PUT retries transient 503 and eventually succeeds
  • GET retries transient 503 and eventually succeeds
  • REMOVE retries transient 503 and eventually succeeds
  • PUT retries transient HTTP 500
  • PUT fails after exhausting retries on persistent 503 (verifies attempt count = RetryMax+1)
  • PUT does not retry non-retryable status (403)
  • PUT replays the file body on each retry — server verifies full payload bytes arrive on attempts 1, 2, and 3 (catches the Seek/NopCloser regression class)
  • PUT respects context cancellation during backoff

Plus moved TestCloudFetchBackoff and TestCloudFetchRetryableStatus into the new internal/retry/retry_test.go package (no behavioral changes, just relocated to live with the shared helpers).

Verified

  • All new tests fail on main (origin/main SHA a97b104) — confirmed reproduction of the bug.
  • All new tests pass on this branch.
  • Full test suite passes locally: go test ./... -short.
  • go vet ./... clean.
  • gofmt -l clean (only pre-existing generated-file diffs in internal/cli_service/).

Related-pattern audit

Per the /fix-github-issue Step 7, searched for other single-shot client.Do(req) sites that might need the same treatment:

Site Status
internal/rows/arrowbased/batchloader.go (CloudFetch) already retried via #355; now shares helpers
telemetry/exporter.go already has its own retry loop
telemetry/featureflag.go feature-flag fetch, not customer data path — out of scope
auth/tokenprovider/exchange.go token exchange — different concern (auth), file a separate ticket if needed

No other staging-like sites need retrofitting in this PR.

Customer-facing answers (from JIRA)

Are we going to expect consistent behavior across both the in-flight CloudFetch fix and this more recent PUT/GET/REMOVE use case?

Yes — same retryable statuses, same backoff curve, same config knobs. The two paths now share one implementation in internal/retry.

Will the backoff retry be configurable?

Yes, via the existing RetryMax / RetryWaitMin / RetryWaitMax config knobs. No new API surface; the staging path opts into the same retry budget the driver already exposes for Thrift and (post-#355) CloudFetch.

This pull request and its description were written by Isaac.

@vikrantpuppala vikrantpuppala force-pushed the fix/es-1911239-staging-retry branch from 4e2e074 to 46921fe Compare May 21, 2026 06:58
@gopalldb
Copy link
Copy Markdown
Collaborator

P1 (Important)

  1. RetryMax = 0 user-config quirk now applies to staging too
  • config/config.go WithDefaults overwrites explicit RetryMax = 0 with the default 4. Users can't disable retries except via -1 (clamped to 0 at connection.go:766-768). This PR widens the blast radius to a
    third call site without documenting the workaround.
  • Suggested fix: Note -1 as the opt-out sentinel in the PR description / comment, or do a follow-up to make explicit zero stick.
  1. REMOVE is now retried — "503 then 404" surfaces as failure
  • Post-fix, if the first DELETE actually applied but returned 503, a retry will see 404 (non-retryable) and surface a terminal error instead of success.
  • Suggested fix: Treat 404 on REMOVE after at least one retry as success, or document the change and add a test case for the 503-then-404 sequence.

P2 (Minor)

  1. http.Request.GetBody not set on PUT — auto-redirect replay won't work (pre-existing; presigned S3 doesn't redirect anyway).
  2. Context-cancellation test asserts elapsed < 1s with 500ms backoffs — tight headroom on slow CI. Bump to 2s.
  3. "PUT replays the file body" test has a buffered-channel mutex around append but reads len() outside it. Single-threaded by construction, but the idiom is misleading — use sync.Mutex or drop it.
  4. "Non-retryable 403" test could also assert elapsed < 50ms to prove no backoff fired.
  5. lastBody in the exhausted-retries error message is unbounded — consider truncating to ~512 bytes for unwieldy proxy bodies.
  6. Error-string prefix differs from CloudFetch path (staging operation over HTTP was unsuccessful vs. errArrowRowsCloudFetchDownloadFailure). Not a regression.
  7. Pre-PR REMOVE error string had a literal , nil typo that's silently fixed by unifying into doStagingRequestWithRetry. Worth mentioning in the PR body.
  8. New http.Client{} per call — no keep-alive across staging ops. Same as before, not a regression.

@vikrantpuppala vikrantpuppala force-pushed the fix/es-1911239-staging-retry branch from 46921fe to bd4d7d4 Compare May 21, 2026 10:49
@vikrantpuppala
Copy link
Copy Markdown
Collaborator Author

Thanks for the review! Force-pushed bd4d7d4 addressing all P1 items + most P2. Per-comment response:

P1

1. RetryMax = 0 user-config quirk — agreed, this is a real footgun and now applies to a third call site. Not fixing in this PR because making explicit 0 stick changes behavior for existing users who currently get 4 retries silently (i.e. anyone who didn't realize 0 was clobbered) — that's a behavior-change discussion worth its own PR.

For this PR, documenting the -1 opt-out: users who want to disable retries on the staging path should set retries=-1 in the DSN (or RetryMax: -1 programmatically). The staging helper clamps negative to 0 (connection.go: c.cfg.RetryMax; if retryMax < 0 { retryMax = 0 }), giving exactly 1 attempt with no retry. Same behavior the Thrift and CloudFetch paths offer.

I'll file a follow-up to make explicit 0 stick across all three paths.

2. REMOVE 503-then-404 surfaces as failure — fixed in bd4d7d4. Went with the aggressive idempotent-delete semantics: 404 on REMOVE is now always treated as success, not just after a retry has fired. Rationale: DELETE's post-condition is "object absent", which is already true if you get 404. This also means REMOVE of a non-existent object — which previously failed pre-PR — now succeeds. Behavior change worth noting; documented in the PR body.

Plumbed via a new acceptStatus func(int) bool parameter on the retry helper:

  • New doStagingRequestWithRetryAccept takes the predicate; back-compat doStagingRequestWithRetry keeps the default 2xx-only behavior for PUT/GET.
  • REMOVE passes func(s int) bool { return statusInSuccessRange(s) || s == http.StatusNotFound }.

Three new tests cover it:

  • REMOVE treats 503-then-404 as success (idempotent delete) — the exact sequence you flagged.
  • REMOVE treats first-attempt 404 as success (idempotent delete) — pins the documented behavior change.
  • PUT first-attempt 404 still fails (only REMOVE treats 404 as success) — guard test, prevents the 404-as-success behavior leaking into PUT/GET.

P2

3. Request.GetBody not set on PUT for redirect replay — won't fix per your own note (pre-existing, S3 presigned URLs don't redirect). Acknowledged.

4. Context-cancel test elapsed bound bumped 1s → 2s — done in bd4d7d4. The cancel fires at 50ms with 500ms+ backoff per retry × 5 retries = 2.5s+ floor without cancel; 2s elapsed bound still proves cancel landed promptly while giving slow CI headroom.

5. PUT replays the file body test channel-mutex — fixed in bd4d7d4. Replaced the chan struct{} with a proper sync.Mutex; len() is read inside the lock now. Race detector clean (go test -race).

6. 403 test elapsed assertion — added in bd4d7d4: assert.Less(t, elapsed, 50*time.Millisecond, "non-retryable status must not trigger backoff"). With RetryWaitMin=1ms, any accidental backoff would be observable.

7. lastBody unbounded in error messages — fixed in bd4d7d4. Added truncateErrorBody with maxStagingErrorBodyBytes = 512. Applied to both the non-retryable-status path and the exhausted-retries path. Truncated bodies append ... (N bytes total, truncated) so the reader knows.

8. Error-string prefix differs from CloudFetch path — acknowledged, not a regression (the prefixes describe different operations, both human-readable).

9. Pre-PR REMOVE error-string , nil typo silently fixed — yes, noting here for the record: the pre-PR handleStagingRemove had a literal ", nil" string suffix that's gone post-PR because all three handlers now share doStagingRequestWithRetryAccept's single error-formatting path. Minor unintentional cleanup.

10. New http.Client{} per call, no keep-alive — acknowledged, not a regression (matches pre-PR shape). Worth a future optimization but out of scope for an incident fix.


Diff stat

connection.go      | +67 lines
connection_test.go | +88 lines

Verification

  • All 11 staging-retry tests pass (8 original + 3 new for 404 behavior).
  • go test -race ./... clean for the new tests.
  • go vet and golangci-lint clean (post-format).

Co-authored-by: Isaac

Sibling of ES-1892645/PR #355. The three staging-operation HTTP wrappers
in connection.go (handleStagingPut/Get/Remove) make a single client.Do
call with no retry — any single transient S3 5xx (e.g. 503 SlowDown
during load) fails the entire SQL statement permanently.

Adds retry-with-exponential-backoff to the staging path with the same
semantics as the CloudFetch fix:
- Retryable statuses: 408/429/500/502/503/504
- Equal-jitter exponential backoff capped at RetryWaitMax
- Integer Retry-After response header honored
- Context cancellation aborts backoff promptly
- Reuses existing RetryMax/RetryWaitMin/RetryWaitMax config knobs
  (consistent with the CloudFetch path the customer asked about)

The PUT path needs special handling: http.Client.Do consumes the request
body (an *os.File), so the retry helper rewinds the file with Seek(0,
SeekStart) between attempts and wraps it in io.NopCloser so the client
can't close the file on us.

Factors the shared retry primitives (RetryableStatuses, IsRetryableStatus,
Backoff) into a new internal/retry package so the CloudFetch path
(internal/rows/arrowbased/batchloader.go) and the staging path share one
implementation. This addresses the "two divergent retry implementations"
follow-up from the #355 review.

Co-authored-by: Isaac
Signed-off-by: Vikrant Puppala <vikrant.puppala@databricks.com>
@vikrantpuppala vikrantpuppala force-pushed the fix/es-1911239-staging-retry branch from bd4d7d4 to 0be1256 Compare May 21, 2026 10:52
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants