[ES-1892645] Retry transient S3 errors in CloudFetch downloads by jayantsing-db · Pull Request #355 · databricks/databricks-sql-go

jayantsing-db · 2026-05-07T19:48:16Z

fetchBatchBytes had no retry on HTTP failures, so any single 5xx from S3 cancelled the entire query. With thousands of concurrent GETs on large result sets, even a sub-percent per-request failure rate makes at least one failure near-certain. Adds exponential backoff with equal jitter for 408/429/500/502/503/504 plus connection errors, honoring the existing RetryMax / RetryWaitMin / RetryWaitMax config and parseable integer Retry-After response headers. Link expiry is re-checked after each backoff so retries don't outlive the presigned URL.

Co-authored-by: Isaac

gopalldb · 2026-05-14T08:23:22Z

+	if half <= 0 {
+		return base
+	}
+	return half + time.Duration(rand.Int63n(int64(half))) //nolint:gosec // G404: jitter only, non-cryptographic


This uses the global rand. Concurrent calls across MaxDownloadThreads goroutines will contend on the global rand mutex. Probably
immaterial at typical MaxDownloadThreads values, but for many-thousands-concurrent-downloads workloads (which is the stated motivation), worth swapping in a per-goroutine rand.Rand or math/rand/v2's Uint64N.

Won't fix — at MaxDownloadThreads=10 (default) the global rand mutex is held for sub-microsecond Int63n draws, and only once per backoff event (not per request). Per-goroutine rand.Rand is non-trivial code for no measurable gain. Happy to revisit if a future config tunes MaxDownloadThreads significantly higher.

gopalldb · 2026-05-14T08:25:43Z

+
+	if lastStatus != 0 {
+		msg := fmt.Sprintf("%s: %s %d (after %d retries)", errArrowRowsCloudFetchDownloadFailure, "HTTP error", lastStatus, retryMax)
+		return nil, dbsqlerrint.NewDriverError(ctx, msg, nil)


shall we include lastErr here as well instead of nil?

Won't fix — lastErr is nil by construction at this point. The HTTP-status branch (line ~475) sets lastErr = nil on every iteration whenever a response is received, so when the loop exits with lastStatus != 0, there's no underlying error to wrap. The status code is already captured in the message. Added a code comment near this branch documenting the invariant so a future reader doesn't "fix" it by passing lastErr.

gopalldb

Added some minor comments

fetchBatchBytes had no retry on HTTP failures, so any single 5xx from S3 cancelled the entire query. With thousands of concurrent GETs on large result sets, even a sub-percent per-request failure rate makes at least one failure near-certain. Adds exponential backoff with equal jitter for 408/429/500/502/503/504 plus connection errors, honoring the existing RetryMax / RetryWaitMin / RetryWaitMax config and parseable integer Retry-After response headers. Link expiry is re-checked after each backoff so retries don't outlive the presigned URL. Co-authored-by: Isaac Signed-off-by: Jayant Singh <jayant.singh@databricks.com>

Two follow-ups from review feedback: 1. fetchBatchBytes returned res.Body on 200 OK and let the caller's io.ReadAll happen outside the retry loop, so a TCP RST or truncated body during streaming surfaced as a hard failure. With multi-MB S3 objects this is the dominant transient mode at scale and exactly the failure we want covered. Buffer the body inside the retry loop and treat read errors as transient. Decompression stays outside the loop — malformed LZ4 is data corruption, not transient. 2. The expiry-between-retries test used ExpiryTime = floor(now), which could already be expired before iter 0's check if construction crossed a Unix-second boundary (attempts=0 instead of 1). Add 1s of guaranteed headroom and bump RetryWaitMin to 4s so the post-backoff sleep deterministically pushes floor(now) past expiry on iter 1. Co-authored-by: Isaac Signed-off-by: Jayant Singh <jayant.singh@databricks.com>

Signed-off-by: Jayant Singh <jayant.singh@databricks.com>

vikrantpuppala · 2026-05-21T06:29:01Z

Force-pushed b511e86 to rebase onto current main and address review findings. Summary of changes from 4b72cac:

Rebase / conflict resolution

Rebased onto current main (was behind by Fix CloudFetch goroutine leak that retains Arrow buffers after Close #357, Prepare for v1.11.1 release #358, [SIRT-1753] Bump go-jose/go-jose/v3 to v3.0.5 (CVE-2026-34986) #360).
Resolved the Fix CloudFetch goroutine leak that retains Arrow buffers after Close #357 conflict in Run(): all three result-send sites in the retry-aware Run() now go through cft.sendResult(...) (the goroutine-leak guard from Fix CloudFetch goroutine leak that retains Arrow buffers after Close #357) instead of raw cft.resultChan <- .... Without this, Fix CloudFetch goroutine leak that retains Arrow buffers after Close #357's fix for issue Memory issue: Heap retention after CloudFetch burst on v1.7.1 #356 silently regressed — and the regression is worse than before Fix CloudFetch goroutine leak that retains Arrow buffers after Close #357 because the new code holds a fully-buffered []byte in the result instead of a small io.ReadCloser.

Fixes

time.After(wait) in the backoff select → t := time.NewTimer(wait) with t.Stop() + drain on ctx.Done(). Prevents up to MaxDownloadThreads lingering timers (each up to RetryWaitMax=30s) when a query is cancelled mid-retry.
Dropped the dead if res.Header != nil guard around res.Header.Get("Retry-After") — http.Response.Header is always non-nil after httpClient.Do.
writeTruncatedOK test helper: replaced t.Fatal with t.Errorf + return — FailNow is undefined behavior from a non-test goroutine (it's called from inside an httptest handler).
Added a code comment at the exhaust-after-retries error path explaining why lastErr is nil here by construction (resolves @gopalldb's open thread — see below).

Regression test

Added TestCloudFetchIterator_CloseReleasesAfterRetry. Companion to TestCloudFetchIterator_CloseReleasesInFlightDownloads but exercises the retry path (server flaps 503 → 200) so the goroutine produces its result after having gone through cloudFetchBackoff. Verified locally:
- With sendResult (this PR): test passes, all goroutines exit after Close().
- With raw cft.resultChan <- (simulating a naive Fix CloudFetch goroutine leak that retains Arrow buffers after Close #357 conflict resolution): test fails with 18 leaked cloudFetchDownloadTask.Run goroutines.

Open review threads — responses

Global rand mutex contention: won't fix. At MaxDownloadThreads=10 default, rand.Int63n is sub-microsecond under lock and called only once per backoff (not per request). Per-goroutine rand.Rand is non-trivial code for no measurable gain.
Pass lastErr instead of nil on HTTP-status exhaust: won't fix. The HTTP-status branch explicitly sets lastErr = nil on every iteration, so when the loop exits with lastStatus != 0, lastErr is nil by construction. The status code is captured in the error message. Now documented as a code comment.

Known tradeoffs (worth flagging, not addressed in this PR)

LZ4 path now holds rawBody (compressed) + buf (decompressed) simultaneously during decompression. Necessary for retry-on-truncated-body to work (can't replay a consumed stream). ~200-700 MB extra peak heap on the FactSet workload at default concurrency.
logCloudFetchSpeed now reports actual len(buf) instead of res.ContentLength — strictly more accurate, but emits speed log lines for chunked responses that previously produced none.
downloadMs telemetry now silently includes backoff time when retries fire. Future follow-up: separate cloudFetchRetryCount dimension.

Deferred follow-ups (not blocking)

Retry-After: 0 jitter floor (hot-loop guard if a misbehaving server returns 0).
Retry-After integer overflow guard.

Co-authored-by: Isaac

Sibling of ES-1892645/PR #355. The three staging-operation HTTP wrappers in connection.go (handleStagingPut/Get/Remove) make a single client.Do call with no retry — any single transient S3 5xx (e.g. 503 SlowDown during load) fails the entire SQL statement permanently. Adds retry-with-exponential-backoff to the staging path with the same semantics as the CloudFetch fix: - Retryable statuses: 408/429/500/502/503/504 - Equal-jitter exponential backoff capped at RetryWaitMax - Integer Retry-After response header honored - Context cancellation aborts backoff promptly - Reuses existing RetryMax/RetryWaitMin/RetryWaitMax config knobs (consistent with the CloudFetch path the customer asked about) The PUT path needs special handling: http.Client.Do consumes the request body (an *os.File), so the retry helper rewinds the file with Seek(0, SeekStart) between attempts and wraps it in io.NopCloser so the client can't close the file on us. Factors the shared retry primitives (RetryableStatuses, IsRetryableStatus, Backoff) into a new internal/retry package so the CloudFetch path (internal/rows/arrowbased/batchloader.go) and the staging path share one implementation. This addresses the "two divergent retry implementations" follow-up from the #355 review. Co-authored-by: Isaac Signed-off-by: Vikrant Puppala <vikrant.puppala@databricks.com>

## Summary Sibling of [ES-1892645](https://databricks.atlassian.net/browse/ES-1892645) / PR #355 (the CloudFetch retry fix that just merged). Same FactSet customer, same root cause class (transient S3 5xx), different code path. The three staging-operation HTTP wrappers in `connection.go` (`handleStagingPut`, `handleStagingGet`, `handleStagingRemove`) make a single `client.Do(req)` call with no retry. Under FactSet's load test (~30 PUTs / 2 min against a UC external volume), S3 intermittently returns `503 SlowDown` and the driver fails the entire SQL statement permanently: ``` staging operation over HTTP was unsuccessful: 503-<S3 SlowDown body> ``` ## Changes - **`connection.go`** — adds `doStagingRequestWithRetry`, a per-conn helper that wraps a `func(attempt int) (*http.Request, error)` factory in a retry loop. All three `handleStaging*` methods use it. - **PUT body lifecycle** — `http.Client.Do` consumes the request body (an `*os.File`) on each attempt, so the retry helper `Seek(0, SeekStart)`s the file between attempts. The file is also wrapped in `io.NopCloser` so the client can't close it; the outer `defer dat.Close()` owns the lifecycle. - **`internal/retry/retry.go`** (new) — factors out `RetryableStatuses`, `IsRetryableStatus`, and `Backoff` so the CloudFetch path and the staging path share one implementation. Addresses the "two divergent retry implementations" follow-up from #355. - **`internal/rows/arrowbased/batchloader.go`** — migrated to use the shared `retry` package. Same behavior, no functional change. ## Retry semantics — consistent with PR #355 (the customer asked) | Behavior | This PR (staging) | PR #355 (CloudFetch) | |---|---|---| | Retryable statuses | 408/429/500/502/503/504 | identical (shared via `internal/retry`) | | Backoff curve | Exponential with equal jitter, capped at `RetryWaitMax` | identical | | Integer `Retry-After` honored | yes (capped at `RetryWaitMax`) | identical | | Context cancel during backoff | aborts promptly | identical | | Config knobs | `RetryMax` / `RetryWaitMin` / `RetryWaitMax` | identical | ## Test plan Added `TestConn_handleStagingRetry` in `connection_test.go` with 8 subtests: - [x] PUT retries transient 503 and eventually succeeds - [x] GET retries transient 503 and eventually succeeds - [x] REMOVE retries transient 503 and eventually succeeds - [x] PUT retries transient HTTP 500 - [x] PUT fails after exhausting retries on persistent 503 (verifies attempt count = `RetryMax+1`) - [x] PUT does not retry non-retryable status (403) - [x] **PUT replays the file body on each retry** — server verifies full payload bytes arrive on attempts 1, 2, and 3 (catches the `Seek`/`NopCloser` regression class) - [x] PUT respects context cancellation during backoff Plus moved `TestCloudFetchBackoff` and `TestCloudFetchRetryableStatus` into the new `internal/retry/retry_test.go` package (no behavioral changes, just relocated to live with the shared helpers). ### Verified - [x] All new tests fail on `main` (`origin/main` SHA `a97b104`) — confirmed reproduction of the bug. - [x] All new tests pass on this branch. - [x] Full test suite passes locally: `go test ./... -short`. - [x] `go vet ./...` clean. - [x] `gofmt -l` clean (only pre-existing generated-file diffs in `internal/cli_service/`). ## Related-pattern audit Per the `/fix-github-issue` Step 7, searched for other single-shot `client.Do(req)` sites that might need the same treatment: | Site | Status | |---|---| | `internal/rows/arrowbased/batchloader.go` (CloudFetch) | already retried via #355; now shares helpers | | `telemetry/exporter.go` | already has its own retry loop | | `telemetry/featureflag.go` | feature-flag fetch, not customer data path — out of scope | | `auth/tokenprovider/exchange.go` | token exchange — different concern (auth), file a separate ticket if needed | No other staging-like sites need retrofitting in this PR. ## Customer-facing answers (from JIRA) > Are we going to expect consistent behavior across both the in-flight CloudFetch fix and this more recent PUT/GET/REMOVE use case? **Yes** — same retryable statuses, same backoff curve, same config knobs. The two paths now share one implementation in `internal/retry`. > Will the backoff retry be configurable? **Yes, via the existing `RetryMax` / `RetryWaitMin` / `RetryWaitMax` config knobs.** No new API surface; the staging path opts into the same retry budget the driver already exposes for Thrift and (post-#355) CloudFetch. This pull request and its description were written by Isaac. [ES-1892645]: https://databricks.atlassian.net/browse/ES-1892645?atlOrigin=eyJpIjoiNWRkNTljNzYxNjVmNDY3MDlhMDU5Y2ZhYzA5YTRkZjUiLCJwIjoiZ2l0aHViLWNvbS1KU1cifQ Signed-off-by: Vikrant Puppala <vikrant.puppala@databricks.com>

jayantsing-db force-pushed the jayantsing-db/fix/es-1892645-cloudfetch-retry-transient-errors branch from 5d1790d to 4b72cac Compare May 7, 2026 20:24

gopalldb reviewed May 14, 2026

View reviewed changes

gopalldb approved these changes May 14, 2026

View reviewed changes

jayantsing-db added 3 commits May 21, 2026 06:18

Improvements

b511e86

Signed-off-by: Jayant Singh <jayant.singh@databricks.com>

vikrantpuppala force-pushed the jayantsing-db/fix/es-1892645-cloudfetch-retry-transient-errors branch from 4b72cac to b511e86 Compare May 21, 2026 06:28

vikrantpuppala requested a review from gopalldb May 21, 2026 06:29

gopalldb approved these changes May 21, 2026

View reviewed changes

vikrantpuppala merged commit a97b104 into main May 21, 2026
3 checks passed

vikrantpuppala deleted the jayantsing-db/fix/es-1892645-cloudfetch-retry-transient-errors branch May 21, 2026 06:34

vikrantpuppala mentioned this pull request May 21, 2026

[ES-1911239] Retry transient S3 errors on staging PUT/GET/REMOVE #361

Merged

13 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[ES-1892645] Retry transient S3 errors in CloudFetch downloads#355

[ES-1892645] Retry transient S3 errors in CloudFetch downloads#355
vikrantpuppala merged 3 commits into
mainfrom
jayantsing-db/fix/es-1892645-cloudfetch-retry-transient-errors

jayantsing-db commented May 7, 2026

Uh oh!

gopalldb May 14, 2026

Uh oh!

vikrantpuppala May 21, 2026

Uh oh!

gopalldb May 14, 2026

Uh oh!

vikrantpuppala May 21, 2026

Uh oh!

gopalldb left a comment

Uh oh!

vikrantpuppala commented May 21, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

jayantsing-db commented May 7, 2026

Uh oh!

gopalldb May 14, 2026

Choose a reason for hiding this comment

Uh oh!

vikrantpuppala May 21, 2026

Choose a reason for hiding this comment

Uh oh!

gopalldb May 14, 2026

Choose a reason for hiding this comment

Uh oh!

vikrantpuppala May 21, 2026

Choose a reason for hiding this comment

Uh oh!

gopalldb left a comment

Choose a reason for hiding this comment

Uh oh!

vikrantpuppala commented May 21, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants