[ES-1911239] Retry transient S3 errors on staging PUT/GET/REMOVE#361
[ES-1911239] Retry transient S3 errors on staging PUT/GET/REMOVE#361vikrantpuppala wants to merge 1 commit into
Conversation
4e2e074 to
46921fe
Compare
|
P1 (Important)
P2 (Minor)
|
46921fe to
bd4d7d4
Compare
|
Thanks for the review! Force-pushed P11. For this PR, documenting the I'll file a follow-up to make explicit 2. REMOVE 503-then-404 surfaces as failure — fixed in Plumbed via a new
Three new tests cover it:
P23. 4. Context-cancel test elapsed bound bumped 1s → 2s — done in 5. 6. 403 test elapsed assertion — added in 7. 8. Error-string prefix differs from CloudFetch path — acknowledged, not a regression (the prefixes describe different operations, both human-readable). 9. Pre-PR REMOVE error-string 10. New Diff statVerification
Co-authored-by: Isaac |
Sibling of ES-1892645/PR #355. The three staging-operation HTTP wrappers in connection.go (handleStagingPut/Get/Remove) make a single client.Do call with no retry — any single transient S3 5xx (e.g. 503 SlowDown during load) fails the entire SQL statement permanently. Adds retry-with-exponential-backoff to the staging path with the same semantics as the CloudFetch fix: - Retryable statuses: 408/429/500/502/503/504 - Equal-jitter exponential backoff capped at RetryWaitMax - Integer Retry-After response header honored - Context cancellation aborts backoff promptly - Reuses existing RetryMax/RetryWaitMin/RetryWaitMax config knobs (consistent with the CloudFetch path the customer asked about) The PUT path needs special handling: http.Client.Do consumes the request body (an *os.File), so the retry helper rewinds the file with Seek(0, SeekStart) between attempts and wraps it in io.NopCloser so the client can't close the file on us. Factors the shared retry primitives (RetryableStatuses, IsRetryableStatus, Backoff) into a new internal/retry package so the CloudFetch path (internal/rows/arrowbased/batchloader.go) and the staging path share one implementation. This addresses the "two divergent retry implementations" follow-up from the #355 review. Co-authored-by: Isaac Signed-off-by: Vikrant Puppala <vikrant.puppala@databricks.com>
bd4d7d4 to
0be1256
Compare
Summary
Sibling of ES-1892645 / PR #355 (the CloudFetch retry fix that just merged). Same FactSet customer, same root cause class (transient S3 5xx), different code path.
The three staging-operation HTTP wrappers in
connection.go(handleStagingPut,handleStagingGet,handleStagingRemove) make a singleclient.Do(req)call with no retry. Under FactSet's load test (~30 PUTs / 2 min against a UC external volume), S3 intermittently returns503 SlowDownand the driver fails the entire SQL statement permanently:Changes
connection.go— addsdoStagingRequestWithRetry, a per-conn helper that wraps afunc(attempt int) (*http.Request, error)factory in a retry loop. All threehandleStaging*methods use it.http.Client.Doconsumes the request body (an*os.File) on each attempt, so the retry helperSeek(0, SeekStart)s the file between attempts. The file is also wrapped inio.NopCloserso the client can't close it; the outerdefer dat.Close()owns the lifecycle.internal/retry/retry.go(new) — factors outRetryableStatuses,IsRetryableStatus, andBackoffso the CloudFetch path and the staging path share one implementation. Addresses the "two divergent retry implementations" follow-up from [ES-1892645] Retry transient S3 errors in CloudFetch downloads #355.internal/rows/arrowbased/batchloader.go— migrated to use the sharedretrypackage. Same behavior, no functional change.Retry semantics — consistent with PR #355 (the customer asked)
internal/retry)RetryWaitMaxRetry-AfterhonoredRetryWaitMax)RetryMax/RetryWaitMin/RetryWaitMaxTest plan
Added
TestConn_handleStagingRetryinconnection_test.gowith 8 subtests:RetryMax+1)Seek/NopCloserregression class)Plus moved
TestCloudFetchBackoffandTestCloudFetchRetryableStatusinto the newinternal/retry/retry_test.gopackage (no behavioral changes, just relocated to live with the shared helpers).Verified
main(origin/mainSHAa97b104) — confirmed reproduction of the bug.go test ./... -short.go vet ./...clean.gofmt -lclean (only pre-existing generated-file diffs ininternal/cli_service/).Related-pattern audit
Per the
/fix-github-issueStep 7, searched for other single-shotclient.Do(req)sites that might need the same treatment:internal/rows/arrowbased/batchloader.go(CloudFetch)telemetry/exporter.gotelemetry/featureflag.goauth/tokenprovider/exchange.goNo other staging-like sites need retrofitting in this PR.
Customer-facing answers (from JIRA)
Yes — same retryable statuses, same backoff curve, same config knobs. The two paths now share one implementation in
internal/retry.Yes, via the existing
RetryMax/RetryWaitMin/RetryWaitMaxconfig knobs. No new API surface; the staging path opts into the same retry budget the driver already exposes for Thrift and (post-#355) CloudFetch.This pull request and its description were written by Isaac.