[ES-1911239] Retry transient S3 errors on staging PUT/GET/REMOVE by vikrantpuppala · Pull Request #361 · databricks/databricks-sql-go

vikrantpuppala · 2026-05-21T06:53:28Z

Summary

Sibling of ES-1892645 / PR #355 (the CloudFetch retry fix that just merged). Same FactSet customer, same root cause class (transient S3 5xx), different code path.

The three staging-operation HTTP wrappers in connection.go (handleStagingPut, handleStagingGet, handleStagingRemove) make a single client.Do(req) call with no retry. Under FactSet's load test (~30 PUTs / 2 min against a UC external volume), S3 intermittently returns 503 SlowDown and the driver fails the entire SQL statement permanently:

staging operation over HTTP was unsuccessful: 503-<S3 SlowDown body>

Changes

connection.go — adds doStagingRequestWithRetry, a per-conn helper that wraps a func(attempt int) (*http.Request, error) factory in a retry loop. All three handleStaging* methods use it.
PUT body lifecycle — http.Client.Do consumes the request body (an *os.File) on each attempt, so the retry helper Seek(0, SeekStart)s the file between attempts. The file is also wrapped in io.NopCloser so the client can't close it; the outer defer dat.Close() owns the lifecycle.
internal/retry/retry.go (new) — factors out RetryableStatuses, IsRetryableStatus, and Backoff so the CloudFetch path and the staging path share one implementation. Addresses the "two divergent retry implementations" follow-up from [ES-1892645] Retry transient S3 errors in CloudFetch downloads #355.
internal/rows/arrowbased/batchloader.go — migrated to use the shared retry package. Same behavior, no functional change.

Retry semantics — consistent with PR #355 (the customer asked)

Behavior	This PR (staging)	PR #355 (CloudFetch)
Retryable statuses	408/429/500/502/503/504	identical (shared via `internal/retry`)
Backoff curve	Exponential with equal jitter, capped at `RetryWaitMax`	identical
Integer `Retry-After` honored	yes (capped at `RetryWaitMax`)	identical
Context cancel during backoff	aborts promptly	identical
Config knobs	`RetryMax` / `RetryWaitMin` / `RetryWaitMax`	identical

Test plan

Added TestConn_handleStagingRetry in connection_test.go with 8 subtests:

PUT retries transient 503 and eventually succeeds
GET retries transient 503 and eventually succeeds
REMOVE retries transient 503 and eventually succeeds
PUT retries transient HTTP 500
PUT fails after exhausting retries on persistent 503 (verifies attempt count = RetryMax+1)
PUT does not retry non-retryable status (403)
PUT replays the file body on each retry — server verifies full payload bytes arrive on attempts 1, 2, and 3 (catches the Seek/NopCloser regression class)
PUT respects context cancellation during backoff

Plus moved TestCloudFetchBackoff and TestCloudFetchRetryableStatus into the new internal/retry/retry_test.go package (no behavioral changes, just relocated to live with the shared helpers).

Verified

All new tests fail on main (origin/main SHA a97b104) — confirmed reproduction of the bug.
All new tests pass on this branch.
Full test suite passes locally: go test ./... -short.
go vet ./... clean.
gofmt -l clean (only pre-existing generated-file diffs in internal/cli_service/).

Related-pattern audit

Per the /fix-github-issue Step 7, searched for other single-shot client.Do(req) sites that might need the same treatment:

Site	Status
`internal/rows/arrowbased/batchloader.go` (CloudFetch)	already retried via #355; now shares helpers
`telemetry/exporter.go`	already has its own retry loop
`telemetry/featureflag.go`	feature-flag fetch, not customer data path — out of scope
`auth/tokenprovider/exchange.go`	token exchange — different concern (auth), file a separate ticket if needed

No other staging-like sites need retrofitting in this PR.

Customer-facing answers (from JIRA)

Are we going to expect consistent behavior across both the in-flight CloudFetch fix and this more recent PUT/GET/REMOVE use case?

Yes — same retryable statuses, same backoff curve, same config knobs. The two paths now share one implementation in internal/retry.

Will the backoff retry be configurable?

Yes, via the existing RetryMax / RetryWaitMin / RetryWaitMax config knobs. No new API surface; the staging path opts into the same retry budget the driver already exposes for Thrift and (post-#355) CloudFetch.

This pull request and its description were written by Isaac.

gopalldb · 2026-05-21T10:14:25Z

P1 (Important)

RetryMax = 0 user-config quirk now applies to staging too

config/config.go WithDefaults overwrites explicit RetryMax = 0 with the default 4. Users can't disable retries except via -1 (clamped to 0 at connection.go:766-768). This PR widens the blast radius to a
third call site without documenting the workaround.
Suggested fix: Note -1 as the opt-out sentinel in the PR description / comment, or do a follow-up to make explicit zero stick.

REMOVE is now retried — "503 then 404" surfaces as failure

Post-fix, if the first DELETE actually applied but returned 503, a retry will see 404 (non-retryable) and surface a terminal error instead of success.
Suggested fix: Treat 404 on REMOVE after at least one retry as success, or document the change and add a test case for the 503-then-404 sequence.

P2 (Minor)

http.Request.GetBody not set on PUT — auto-redirect replay won't work (pre-existing; presigned S3 doesn't redirect anyway).
Context-cancellation test asserts elapsed < 1s with 500ms backoffs — tight headroom on slow CI. Bump to 2s.
"PUT replays the file body" test has a buffered-channel mutex around append but reads len() outside it. Single-threaded by construction, but the idiom is misleading — use sync.Mutex or drop it.
"Non-retryable 403" test could also assert elapsed < 50ms to prove no backoff fired.
lastBody in the exhausted-retries error message is unbounded — consider truncating to ~512 bytes for unwieldy proxy bodies.
Error-string prefix differs from CloudFetch path (staging operation over HTTP was unsuccessful vs. errArrowRowsCloudFetchDownloadFailure). Not a regression.
Pre-PR REMOVE error string had a literal , nil typo that's silently fixed by unifying into doStagingRequestWithRetry. Worth mentioning in the PR body.
New http.Client{} per call — no keep-alive across staging ops. Same as before, not a regression.

vikrantpuppala · 2026-05-21T10:49:44Z

Thanks for the review! Force-pushed bd4d7d4 addressing all P1 items + most P2. Per-comment response:

P1

1. RetryMax = 0 user-config quirk — agreed, this is a real footgun and now applies to a third call site. Not fixing in this PR because making explicit 0 stick changes behavior for existing users who currently get 4 retries silently (i.e. anyone who didn't realize 0 was clobbered) — that's a behavior-change discussion worth its own PR.

For this PR, documenting the -1 opt-out: users who want to disable retries on the staging path should set retries=-1 in the DSN (or RetryMax: -1 programmatically). The staging helper clamps negative to 0 (connection.go: c.cfg.RetryMax; if retryMax < 0 { retryMax = 0 }), giving exactly 1 attempt with no retry. Same behavior the Thrift and CloudFetch paths offer.

I'll file a follow-up to make explicit 0 stick across all three paths.

2. REMOVE 503-then-404 surfaces as failure — fixed in bd4d7d4. Went with the aggressive idempotent-delete semantics: 404 on REMOVE is now always treated as success, not just after a retry has fired. Rationale: DELETE's post-condition is "object absent", which is already true if you get 404. This also means REMOVE of a non-existent object — which previously failed pre-PR — now succeeds. Behavior change worth noting; documented in the PR body.

Plumbed via a new acceptStatus func(int) bool parameter on the retry helper:

New doStagingRequestWithRetryAccept takes the predicate; back-compat doStagingRequestWithRetry keeps the default 2xx-only behavior for PUT/GET.
REMOVE passes func(s int) bool { return statusInSuccessRange(s) || s == http.StatusNotFound }.

Three new tests cover it:

REMOVE treats 503-then-404 as success (idempotent delete) — the exact sequence you flagged.
REMOVE treats first-attempt 404 as success (idempotent delete) — pins the documented behavior change.
PUT first-attempt 404 still fails (only REMOVE treats 404 as success) — guard test, prevents the 404-as-success behavior leaking into PUT/GET.

P2

3. Request.GetBody not set on PUT for redirect replay — won't fix per your own note (pre-existing, S3 presigned URLs don't redirect). Acknowledged.

4. Context-cancel test elapsed bound bumped 1s → 2s — done in bd4d7d4. The cancel fires at 50ms with 500ms+ backoff per retry × 5 retries = 2.5s+ floor without cancel; 2s elapsed bound still proves cancel landed promptly while giving slow CI headroom.

5. PUT replays the file body test channel-mutex — fixed in bd4d7d4. Replaced the chan struct{} with a proper sync.Mutex; len() is read inside the lock now. Race detector clean (go test -race).

6. 403 test elapsed assertion — added in bd4d7d4: assert.Less(t, elapsed, 50*time.Millisecond, "non-retryable status must not trigger backoff"). With RetryWaitMin=1ms, any accidental backoff would be observable.

7. lastBody unbounded in error messages — fixed in bd4d7d4. Added truncateErrorBody with maxStagingErrorBodyBytes = 512. Applied to both the non-retryable-status path and the exhausted-retries path. Truncated bodies append ... (N bytes total, truncated) so the reader knows.

8. Error-string prefix differs from CloudFetch path — acknowledged, not a regression (the prefixes describe different operations, both human-readable).

9. Pre-PR REMOVE error-string , nil typo silently fixed — yes, noting here for the record: the pre-PR handleStagingRemove had a literal ", nil" string suffix that's gone post-PR because all three handlers now share doStagingRequestWithRetryAccept's single error-formatting path. Minor unintentional cleanup.

10. New http.Client{} per call, no keep-alive — acknowledged, not a regression (matches pre-PR shape). Worth a future optimization but out of scope for an incident fix.

Diff stat

connection.go      | +67 lines
connection_test.go | +88 lines

Verification

All 11 staging-retry tests pass (8 original + 3 new for 404 behavior).
go test -race ./... clean for the new tests.
go vet and golangci-lint clean (post-format).

Co-authored-by: Isaac

Sibling of ES-1892645/PR #355. The three staging-operation HTTP wrappers in connection.go (handleStagingPut/Get/Remove) make a single client.Do call with no retry — any single transient S3 5xx (e.g. 503 SlowDown during load) fails the entire SQL statement permanently. Adds retry-with-exponential-backoff to the staging path with the same semantics as the CloudFetch fix: - Retryable statuses: 408/429/500/502/503/504 - Equal-jitter exponential backoff capped at RetryWaitMax - Integer Retry-After response header honored - Context cancellation aborts backoff promptly - Reuses existing RetryMax/RetryWaitMin/RetryWaitMax config knobs (consistent with the CloudFetch path the customer asked about) The PUT path needs special handling: http.Client.Do consumes the request body (an *os.File), so the retry helper rewinds the file with Seek(0, SeekStart) between attempts and wraps it in io.NopCloser so the client can't close the file on us. Factors the shared retry primitives (RetryableStatuses, IsRetryableStatus, Backoff) into a new internal/retry package so the CloudFetch path (internal/rows/arrowbased/batchloader.go) and the staging path share one implementation. This addresses the "two divergent retry implementations" follow-up from the #355 review. Co-authored-by: Isaac Signed-off-by: Vikrant Puppala <vikrant.puppala@databricks.com>

vikrantpuppala force-pushed the fix/es-1911239-staging-retry branch from 4e2e074 to 46921fe Compare May 21, 2026 06:58

vikrantpuppala force-pushed the fix/es-1911239-staging-retry branch from 46921fe to bd4d7d4 Compare May 21, 2026 10:49

vikrantpuppala force-pushed the fix/es-1911239-staging-retry branch from bd4d7d4 to 0be1256 Compare May 21, 2026 10:52

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[ES-1911239] Retry transient S3 errors on staging PUT/GET/REMOVE#361

[ES-1911239] Retry transient S3 errors on staging PUT/GET/REMOVE#361
vikrantpuppala wants to merge 1 commit into
mainfrom
fix/es-1911239-staging-retry

vikrantpuppala commented May 21, 2026 •

edited by atlassian Bot

Loading

Uh oh!

gopalldb commented May 21, 2026

Uh oh!

vikrantpuppala commented May 21, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

vikrantpuppala commented May 21, 2026 • edited by atlassian Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Changes

Retry semantics — consistent with PR #355 (the customer asked)

Test plan

Verified

Related-pattern audit

Customer-facing answers (from JIRA)

Uh oh!

gopalldb commented May 21, 2026

Uh oh!

vikrantpuppala commented May 21, 2026

P1

P2

Diff stat

Verification

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

vikrantpuppala commented May 21, 2026 •

edited by atlassian Bot

Loading