[format] Support writing and reading Arrow schema metadata for file formats#8321
[format] Support writing and reading Arrow schema metadata for file formats#8321lxy-9602 wants to merge 3 commits into
Conversation
| default ParquetWriter<T> createWriter( | ||
| OutputFile out, String compression, Supplier<Map<String, byte[]>> metadataSupplier) | ||
| throws IOException { | ||
| return createWriter(out, compression); |
There was a problem hiding this comment.
This default makes metadata support silently disappear for any ParquetBuilder implementation that only implements the original two-argument createWriter. ParquetWriterFactory still returns a ParquetBulkWriter that implements SupportsWriterMetadata, so callers can call addMetadata(...) successfully, but finalizeWrite() will never see the map and the footer will not contain the entries. Please either make support explicit (for example, fail from this overload unless the builder wires the supplier into WriteSupport.finalizeWrite) or avoid exposing SupportsWriterMetadata for builders that cannot persist it.
There was a problem hiding this comment.
Thanks for pointing this out. I agree that silently falling back to the two-argument createWriter would make the metadata support look available while the footer entries are actually dropped, which is quite misleading.
I updated the default metadata-aware createWriter overload to fail explicitly with UnsupportedOperationException unless a ParquetBuilder implementation wires the metadata supplier into the writer path. I also added a regression test to cover a builder that only implements the original two-argument createWriter, making sure it fails explicitly instead of silently ignoring metadata.
Purpose
This PR is a sub-PR for shared-shredding.
Shared-shredding needs to attach dictionary and other field-level metadata before closing data files. To support that flow, this PR adds a generic metadata write/read path for file formats, so upper layers can provide raw key-value metadata during writing, and readers can parse the stored metadata back when opening files.
The metadata representation follows Arrow Parquet's existing schema metadata convention: Arrow stores the original serialized schema under the
ARROW:schemafile metadata key, base64-decodes it on read, and deserializes it with Arrow IPC schema reading. See Apache Arrow's Parquet schema implementation, wherekArrowSchemaKeyisARROW:schemaand the value is base64-decoded beforeReadSchema.Related design:
https://cwiki.apache.org/confluence/display/PAIMON/PIP-43%3A+Columnar+Storage+Optimization+for+MAP+Type+in+Paimon
Brief change log
SupportsWriterMetadataso format writers can accept rawMap<String, byte[]>metadata before file close.SupportsReaderArrowSchemaso format readers can return the stored Arrow schema metadata.FormatMetadataUtilsfor:ARROW:schemakey into an ArrowSchema;Map<String, Map<String, String>>.Compatibility
The metadata value is stored using Arrow-compatible base64 encoding. For Arrow schema metadata, the key is
ARROW:schema, matching Arrow Parquet's convention.Tests
mvn -pl paimon-format -am -Pfast-build -DfailIfNoTests=false \ -Dtest=ParquetFormatReadWriteTest#testWriteMetadata,OrcFormatReadWriteTest#testWriteMetadata,FormatMetadataUtilsTest test