[format] Support writing and reading Arrow schema metadata for file formats by lxy-9602 · Pull Request #8321 · apache/paimon

lxy-9602 · 2026-06-22T10:57:42Z

Purpose

This PR is a sub-PR for shared-shredding.

Shared-shredding needs to attach dictionary and other field-level metadata before closing data files. To support that flow, this PR adds a generic metadata write/read path for file formats, so upper layers can provide raw key-value metadata during writing, and readers can parse the stored metadata back when opening files.

The metadata representation follows Arrow Parquet's existing schema metadata convention: Arrow stores the original serialized schema under the ARROW:schema file metadata key, base64-decodes it on read, and deserializes it with Arrow IPC schema reading. See Apache Arrow's Parquet schema implementation, where kArrowSchemaKey is ARROW:schema and the value is base64-decoded before ReadSchema.

Related design:
https://cwiki.apache.org/confluence/display/PAIMON/PIP-43%3A+Columnar+Storage+Optimization+for+MAP+Type+in+Paimon

Brief change log

Add SupportsWriterMetadata so format writers can accept raw Map<String, byte[]> metadata before file close.
Add SupportsReaderArrowSchema so format readers can return the stored Arrow schema metadata.
Add FormatMetadataUtils for:
- base64 encoding metadata values before storing them in format footers;
- base64 decoding stored metadata values;
- reading the fixed ARROW:schema key into an Arrow Schema;
- extracting field-level metadata as Map<String, Map<String, String>>.
Support metadata writing for Parquet and ORC writers.
Support Arrow schema metadata reading from Parquet and ORC readers.
Add Parquet/ORC tests covering metadata write/read and field-level Arrow metadata roundtrip.

Compatibility

The metadata value is stored using Arrow-compatible base64 encoding. For Arrow schema metadata, the key is ARROW:schema, matching Arrow Parquet's convention.

Tests

mvn -pl paimon-format -am -Pfast-build -DfailIfNoTests=false \
  -Dtest=ParquetFormatReadWriteTest#testWriteMetadata,OrcFormatReadWriteTest#testWriteMetadata,FormatMetadataUtilsTest test

…ormats

JingsongLi · 2026-06-23T03:52:28Z

+    default ParquetWriter<T> createWriter(
+            OutputFile out, String compression, Supplier<Map<String, byte[]>> metadataSupplier)
+            throws IOException {
+        return createWriter(out, compression);


This default makes metadata support silently disappear for any ParquetBuilder implementation that only implements the original two-argument createWriter. ParquetWriterFactory still returns a ParquetBulkWriter that implements SupportsWriterMetadata, so callers can call addMetadata(...) successfully, but finalizeWrite() will never see the map and the footer will not contain the entries. Please either make support explicit (for example, fail from this overload unless the builder wires the supplier into WriteSupport.finalizeWrite) or avoid exposing SupportsWriterMetadata for builders that cannot persist it.

Thanks for pointing this out. I agree that silently falling back to the two-argument createWriter would make the metadata support look available while the footer entries are actually dropped, which is quite misleading.

I updated the default metadata-aware createWriter overload to fail explicitly with UnsupportedOperationException unless a ParquetBuilder implementation wires the metadata supplier into the writer path. I also added a regression test to cover a builder that only implements the original two-argument createWriter, making sure it fails explicitly instead of silently ignoring metadata.

lxy-9602 added 2 commits June 22, 2026 18:47

[format] Support writing and reading Arrow schema metadata for file f…

85740fb

…ormats

fix doc and add tests

3a53358

JingsongLi reviewed Jun 23, 2026

View reviewed changes

fail explicitly for metadataSupplier

5993357

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[format] Support writing and reading Arrow schema metadata for file formats#8321

[format] Support writing and reading Arrow schema metadata for file formats#8321
lxy-9602 wants to merge 3 commits into
apache:masterfrom
lxy-9602:add-format-meta

lxy-9602 commented Jun 22, 2026

Uh oh!

JingsongLi Jun 23, 2026

Uh oh!

lxy-9602 Jun 23, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

lxy-9602 commented Jun 22, 2026

Purpose

Brief change log

Compatibility

Tests

Uh oh!

JingsongLi Jun 23, 2026

Choose a reason for hiding this comment

Uh oh!

lxy-9602 Jun 23, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

lxy-9602 Jun 23, 2026 •

edited

Loading