Skip to content

feat: Add S3 archive FileIO support #8143

Open
Shekharrajak wants to merge 6 commits into
apache:masterfrom
Shekharrajak:feature/paimon-5510-s3-archive
Open

feat: Add S3 archive FileIO support #8143
Shekharrajak wants to merge 6 commits into
apache:masterfrom
Shekharrajak:feature/paimon-5510-s3-archive

Conversation

@Shekharrajak

Copy link
Copy Markdown
Contributor

Ref #5510 (comment)

Purpose

Implements S3-backed archive, restore, and unarchive operations for Paimon FileIO by mapping StorageType to S3 storage classes and issuing same-key S3 copy/restore requests.

We will have follow up PRs for OSS, other supported object storage and all other not supported will through unsupported exception.

Tests

mvn -pl paimon-filesystems/paimon-s3-impl -am -Pfast-build -DfailIfNoTests=false -Dtest=S3ArchiveOperationsTest test

@JingsongLi

Copy link
Copy Markdown
Contributor

I found one blocker in the S3 archive implementation: archive/unarchive currently change storage class by issuing a single CopyObject request for the same key. S3 single-copy only supports objects up to 5 GB, while Paimon data files can be larger than that, so this will fail for valid large data files. Please branch on the object size from HeadObjectResponse and use multipart copy / UploadPartCopy for large objects, with a test covering that path.

static UploadPartCopyRequest uploadPartCopyRequest(
String bucket, String key, String uploadId, CopyPartRange range, String eTag) {
UploadPartCopyRequest.Builder builder =
UploadPartCopyRequest.builder()

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Large-object archive/unarchive bypasses S3A copy request preparation for each UploadPartCopy request. The single-copy path goes through RequestFactory.newCopyObjectRequestBuilder, which applies configured encryption settings, but this hand-built multipart-copy request never sets copySourceSSECustomerAlgorithm/key/MD5. For buckets configured with SSE-C, Paimon can write/read the object and small archives work, but any object above 5 GB will fail because S3 requires the source SSE-C headers on every UploadPartCopy. Please build these part-copy requests through the same S3A helper/factory path or propagate the configured copy-source encryption headers, and add coverage for that case.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants