Skip to content

[ES-1892645] Retry transient S3 errors in CloudFetch downloads#355

Merged
vikrantpuppala merged 3 commits into
mainfrom
jayantsing-db/fix/es-1892645-cloudfetch-retry-transient-errors
May 21, 2026
Merged

[ES-1892645] Retry transient S3 errors in CloudFetch downloads#355
vikrantpuppala merged 3 commits into
mainfrom
jayantsing-db/fix/es-1892645-cloudfetch-retry-transient-errors

Conversation

@jayantsing-db
Copy link
Copy Markdown
Contributor

fetchBatchBytes had no retry on HTTP failures, so any single 5xx from S3 cancelled the entire query. With thousands of concurrent GETs on large result sets, even a sub-percent per-request failure rate makes at least one failure near-certain. Adds exponential backoff with equal jitter for 408/429/500/502/503/504 plus connection errors, honoring the existing RetryMax / RetryWaitMin / RetryWaitMax config and parseable integer Retry-After response headers. Link expiry is re-checked after each backoff so retries don't outlive the presigned URL.

Co-authored-by: Isaac

@jayantsing-db jayantsing-db force-pushed the jayantsing-db/fix/es-1892645-cloudfetch-retry-transient-errors branch from 5d1790d to 4b72cac Compare May 7, 2026 20:24
if half <= 0 {
return base
}
return half + time.Duration(rand.Int63n(int64(half))) //nolint:gosec // G404: jitter only, non-cryptographic
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This uses the global rand. Concurrent calls across MaxDownloadThreads goroutines will contend on the global rand mutex. Probably
immaterial at typical MaxDownloadThreads values, but for many-thousands-concurrent-downloads workloads (which is the stated motivation), worth swapping in a per-goroutine rand.Rand or math/rand/v2's Uint64N.

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Won't fix — at MaxDownloadThreads=10 (default) the global rand mutex is held for sub-microsecond Int63n draws, and only once per backoff event (not per request). Per-goroutine rand.Rand is non-trivial code for no measurable gain. Happy to revisit if a future config tunes MaxDownloadThreads significantly higher.

Comment thread internal/rows/arrowbased/batchloader.go Outdated

if lastStatus != 0 {
msg := fmt.Sprintf("%s: %s %d (after %d retries)", errArrowRowsCloudFetchDownloadFailure, "HTTP error", lastStatus, retryMax)
return nil, dbsqlerrint.NewDriverError(ctx, msg, nil)
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

shall we include lastErr here as well instead of nil?

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Won't fix — lastErr is nil by construction at this point. The HTTP-status branch (line ~475) sets lastErr = nil on every iteration whenever a response is received, so when the loop exits with lastStatus != 0, there's no underlying error to wrap. The status code is already captured in the message. Added a code comment near this branch documenting the invariant so a future reader doesn't "fix" it by passing lastErr.

Copy link
Copy Markdown
Collaborator

@gopalldb gopalldb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added some minor comments

fetchBatchBytes had no retry on HTTP failures, so any single 5xx from
S3 cancelled the entire query. With thousands of concurrent GETs on
large result sets, even a sub-percent per-request failure rate makes
at least one failure near-certain. Adds exponential backoff with
equal jitter for 408/429/500/502/503/504 plus connection errors,
honoring the existing RetryMax / RetryWaitMin / RetryWaitMax config
and parseable integer Retry-After response headers. Link expiry is
re-checked after each backoff so retries don't outlive the presigned
URL.

Co-authored-by: Isaac
Signed-off-by: Jayant Singh <jayant.singh@databricks.com>
Two follow-ups from review feedback:

1. fetchBatchBytes returned res.Body on 200 OK and let the caller's
   io.ReadAll happen outside the retry loop, so a TCP RST or truncated
   body during streaming surfaced as a hard failure. With multi-MB S3
   objects this is the dominant transient mode at scale and exactly
   the failure we want covered. Buffer the body inside the retry loop
   and treat read errors as transient. Decompression stays outside
   the loop — malformed LZ4 is data corruption, not transient.

2. The expiry-between-retries test used ExpiryTime = floor(now), which
   could already be expired before iter 0's check if construction
   crossed a Unix-second boundary (attempts=0 instead of 1). Add 1s of
   guaranteed headroom and bump RetryWaitMin to 4s so the post-backoff
   sleep deterministically pushes floor(now) past expiry on iter 1.

Co-authored-by: Isaac
Signed-off-by: Jayant Singh <jayant.singh@databricks.com>
Signed-off-by: Jayant Singh <jayant.singh@databricks.com>
@vikrantpuppala vikrantpuppala force-pushed the jayantsing-db/fix/es-1892645-cloudfetch-retry-transient-errors branch from 4b72cac to b511e86 Compare May 21, 2026 06:28
@vikrantpuppala
Copy link
Copy Markdown
Collaborator

Force-pushed b511e86 to rebase onto current main and address review findings. Summary of changes from 4b72cac:

Rebase / conflict resolution

Fixes

  • time.After(wait) in the backoff selectt := time.NewTimer(wait) with t.Stop() + drain on ctx.Done(). Prevents up to MaxDownloadThreads lingering timers (each up to RetryWaitMax=30s) when a query is cancelled mid-retry.
  • Dropped the dead if res.Header != nil guard around res.Header.Get("Retry-After")http.Response.Header is always non-nil after httpClient.Do.
  • writeTruncatedOK test helper: replaced t.Fatal with t.Errorf + returnFailNow is undefined behavior from a non-test goroutine (it's called from inside an httptest handler).
  • Added a code comment at the exhaust-after-retries error path explaining why lastErr is nil here by construction (resolves @gopalldb's open thread — see below).

Regression test

  • Added TestCloudFetchIterator_CloseReleasesAfterRetry. Companion to TestCloudFetchIterator_CloseReleasesInFlightDownloads but exercises the retry path (server flaps 503 → 200) so the goroutine produces its result after having gone through cloudFetchBackoff. Verified locally:

Open review threads — responses

  • Global rand mutex contention: won't fix. At MaxDownloadThreads=10 default, rand.Int63n is sub-microsecond under lock and called only once per backoff (not per request). Per-goroutine rand.Rand is non-trivial code for no measurable gain.
  • Pass lastErr instead of nil on HTTP-status exhaust: won't fix. The HTTP-status branch explicitly sets lastErr = nil on every iteration, so when the loop exits with lastStatus != 0, lastErr is nil by construction. The status code is captured in the error message. Now documented as a code comment.

Known tradeoffs (worth flagging, not addressed in this PR)

  • LZ4 path now holds rawBody (compressed) + buf (decompressed) simultaneously during decompression. Necessary for retry-on-truncated-body to work (can't replay a consumed stream). ~200-700 MB extra peak heap on the FactSet workload at default concurrency.
  • logCloudFetchSpeed now reports actual len(buf) instead of res.ContentLength — strictly more accurate, but emits speed log lines for chunked responses that previously produced none.
  • downloadMs telemetry now silently includes backoff time when retries fire. Future follow-up: separate cloudFetchRetryCount dimension.

Deferred follow-ups (not blocking)

  • Retry-After: 0 jitter floor (hot-loop guard if a misbehaving server returns 0).
  • Retry-After integer overflow guard.

Co-authored-by: Isaac

@vikrantpuppala vikrantpuppala requested a review from gopalldb May 21, 2026 06:29
@vikrantpuppala vikrantpuppala merged commit a97b104 into main May 21, 2026
3 checks passed
@vikrantpuppala vikrantpuppala deleted the jayantsing-db/fix/es-1892645-cloudfetch-retry-transient-errors branch May 21, 2026 06:34
vikrantpuppala added a commit that referenced this pull request May 21, 2026
Sibling of ES-1892645/PR #355. The three staging-operation HTTP wrappers
in connection.go (handleStagingPut/Get/Remove) make a single client.Do
call with no retry — any single transient S3 5xx (e.g. 503 SlowDown
during load) fails the entire SQL statement permanently.

Adds retry-with-exponential-backoff to the staging path with the same
semantics as the CloudFetch fix:
- Retryable statuses: 408/429/500/502/503/504
- Equal-jitter exponential backoff capped at RetryWaitMax
- Integer Retry-After response header honored
- Context cancellation aborts backoff promptly
- Reuses existing RetryMax/RetryWaitMin/RetryWaitMax config knobs
  (consistent with the CloudFetch path the customer asked about)

The PUT path needs special handling: http.Client.Do consumes the request
body (an *os.File), so the retry helper rewinds the file with Seek(0,
SeekStart) between attempts and wraps it in io.NopCloser so the client
can't close the file on us.

Factors the shared retry primitives (RetryableStatuses, IsRetryableStatus,
Backoff) into a new internal/retry package so the CloudFetch path
(internal/rows/arrowbased/batchloader.go) and the staging path share one
implementation. This addresses the "two divergent retry implementations"
follow-up from the #355 review.

Co-authored-by: Isaac
Signed-off-by: Vikrant Puppala <vikrant.puppala@databricks.com>
vikrantpuppala added a commit that referenced this pull request May 21, 2026
Sibling of ES-1892645/PR #355. The three staging-operation HTTP wrappers
in connection.go (handleStagingPut/Get/Remove) make a single client.Do
call with no retry — any single transient S3 5xx (e.g. 503 SlowDown
during load) fails the entire SQL statement permanently.

Adds retry-with-exponential-backoff to the staging path with the same
semantics as the CloudFetch fix:
- Retryable statuses: 408/429/500/502/503/504
- Equal-jitter exponential backoff capped at RetryWaitMax
- Integer Retry-After response header honored
- Context cancellation aborts backoff promptly
- Reuses existing RetryMax/RetryWaitMin/RetryWaitMax config knobs
  (consistent with the CloudFetch path the customer asked about)

The PUT path needs special handling: http.Client.Do consumes the request
body (an *os.File), so the retry helper rewinds the file with Seek(0,
SeekStart) between attempts and wraps it in io.NopCloser so the client
can't close the file on us.

Factors the shared retry primitives (RetryableStatuses, IsRetryableStatus,
Backoff) into a new internal/retry package so the CloudFetch path
(internal/rows/arrowbased/batchloader.go) and the staging path share one
implementation. This addresses the "two divergent retry implementations"
follow-up from the #355 review.

Co-authored-by: Isaac
Signed-off-by: Vikrant Puppala <vikrant.puppala@databricks.com>
vikrantpuppala added a commit that referenced this pull request May 21, 2026
Sibling of ES-1892645/PR #355. The three staging-operation HTTP wrappers
in connection.go (handleStagingPut/Get/Remove) make a single client.Do
call with no retry — any single transient S3 5xx (e.g. 503 SlowDown
during load) fails the entire SQL statement permanently.

Adds retry-with-exponential-backoff to the staging path with the same
semantics as the CloudFetch fix:
- Retryable statuses: 408/429/500/502/503/504
- Equal-jitter exponential backoff capped at RetryWaitMax
- Integer Retry-After response header honored
- Context cancellation aborts backoff promptly
- Reuses existing RetryMax/RetryWaitMin/RetryWaitMax config knobs
  (consistent with the CloudFetch path the customer asked about)

The PUT path needs special handling: http.Client.Do consumes the request
body (an *os.File), so the retry helper rewinds the file with Seek(0,
SeekStart) between attempts and wraps it in io.NopCloser so the client
can't close the file on us.

Factors the shared retry primitives (RetryableStatuses, IsRetryableStatus,
Backoff) into a new internal/retry package so the CloudFetch path
(internal/rows/arrowbased/batchloader.go) and the staging path share one
implementation. This addresses the "two divergent retry implementations"
follow-up from the #355 review.

Co-authored-by: Isaac
Signed-off-by: Vikrant Puppala <vikrant.puppala@databricks.com>
vikrantpuppala added a commit that referenced this pull request May 21, 2026
## Summary

Sibling of
[ES-1892645](https://databricks.atlassian.net/browse/ES-1892645) / PR
#355 (the CloudFetch retry fix that just merged). Same FactSet customer,
same root cause class (transient S3 5xx), different code path.

The three staging-operation HTTP wrappers in `connection.go`
(`handleStagingPut`, `handleStagingGet`, `handleStagingRemove`) make a
single `client.Do(req)` call with no retry. Under FactSet's load test
(~30 PUTs / 2 min against a UC external volume), S3 intermittently
returns `503 SlowDown` and the driver fails the entire SQL statement
permanently:

```
staging operation over HTTP was unsuccessful: 503-<S3 SlowDown body>
```

## Changes

- **`connection.go`** — adds `doStagingRequestWithRetry`, a per-conn
helper that wraps a `func(attempt int) (*http.Request, error)` factory
in a retry loop. All three `handleStaging*` methods use it.
- **PUT body lifecycle** — `http.Client.Do` consumes the request body
(an `*os.File`) on each attempt, so the retry helper `Seek(0,
SeekStart)`s the file between attempts. The file is also wrapped in
`io.NopCloser` so the client can't close it; the outer `defer
dat.Close()` owns the lifecycle.
- **`internal/retry/retry.go`** (new) — factors out `RetryableStatuses`,
`IsRetryableStatus`, and `Backoff` so the CloudFetch path and the
staging path share one implementation. Addresses the "two divergent
retry implementations" follow-up from #355.
- **`internal/rows/arrowbased/batchloader.go`** — migrated to use the
shared `retry` package. Same behavior, no functional change.

## Retry semantics — consistent with PR #355 (the customer asked)

| Behavior | This PR (staging) | PR #355 (CloudFetch) |
|---|---|---|
| Retryable statuses | 408/429/500/502/503/504 | identical (shared via
`internal/retry`) |
| Backoff curve | Exponential with equal jitter, capped at
`RetryWaitMax` | identical |
| Integer `Retry-After` honored | yes (capped at `RetryWaitMax`) |
identical |
| Context cancel during backoff | aborts promptly | identical |
| Config knobs | `RetryMax` / `RetryWaitMin` / `RetryWaitMax` |
identical |

## Test plan

Added `TestConn_handleStagingRetry` in `connection_test.go` with 8
subtests:

- [x] PUT retries transient 503 and eventually succeeds
- [x] GET retries transient 503 and eventually succeeds
- [x] REMOVE retries transient 503 and eventually succeeds
- [x] PUT retries transient HTTP 500
- [x] PUT fails after exhausting retries on persistent 503 (verifies
attempt count = `RetryMax+1`)
- [x] PUT does not retry non-retryable status (403)
- [x] **PUT replays the file body on each retry** — server verifies full
payload bytes arrive on attempts 1, 2, and 3 (catches the
`Seek`/`NopCloser` regression class)
- [x] PUT respects context cancellation during backoff

Plus moved `TestCloudFetchBackoff` and `TestCloudFetchRetryableStatus`
into the new `internal/retry/retry_test.go` package (no behavioral
changes, just relocated to live with the shared helpers).

### Verified

- [x] All new tests fail on `main` (`origin/main` SHA `a97b104`) —
confirmed reproduction of the bug.
- [x] All new tests pass on this branch.
- [x] Full test suite passes locally: `go test ./... -short`.
- [x] `go vet ./...` clean.
- [x] `gofmt -l` clean (only pre-existing generated-file diffs in
`internal/cli_service/`).

## Related-pattern audit

Per the `/fix-github-issue` Step 7, searched for other single-shot
`client.Do(req)` sites that might need the same treatment:

| Site | Status |
|---|---|
| `internal/rows/arrowbased/batchloader.go` (CloudFetch) | already
retried via #355; now shares helpers |
| `telemetry/exporter.go` | already has its own retry loop |
| `telemetry/featureflag.go` | feature-flag fetch, not customer data
path — out of scope |
| `auth/tokenprovider/exchange.go` | token exchange — different concern
(auth), file a separate ticket if needed |

No other staging-like sites need retrofitting in this PR.

## Customer-facing answers (from JIRA)

> Are we going to expect consistent behavior across both the in-flight
CloudFetch fix and this more recent PUT/GET/REMOVE use case?

**Yes** — same retryable statuses, same backoff curve, same config
knobs. The two paths now share one implementation in `internal/retry`.

> Will the backoff retry be configurable?

**Yes, via the existing `RetryMax` / `RetryWaitMin` / `RetryWaitMax`
config knobs.** No new API surface; the staging path opts into the same
retry budget the driver already exposes for Thrift and (post-#355)
CloudFetch.

This pull request and its description were written by Isaac.

[ES-1892645]:
https://databricks.atlassian.net/browse/ES-1892645?atlOrigin=eyJpIjoiNWRkNTljNzYxNjVmNDY3MDlhMDU5Y2ZhYzA5YTRkZjUiLCJwIjoiZ2l0aHViLWNvbS1KU1cifQ

Signed-off-by: Vikrant Puppala <vikrant.puppala@databricks.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants