[parquet] Add regression test for TIMESTAMP(n<=3) reading MICROS-annotated INT64 files by q8webmaster · Pull Request #8238 · apache/paimon

q8webmaster · 2026-06-15T00:24:24Z

Problem

After PR #8230 lands, TIMESTAMP(n<=3) columns will be written as INT64 with a MICROS Parquet annotation and epoch-microsecond values. If the vectorized reader fails to respect the annotation's time unit when decoding those columns, it returns timestamps ~1000× too large (year ~58xxx) or throws ArithmeticException: Millis overflow.

This scenario had no test coverage.

Root cause

The fix is already present on master: LongTimestampUpdater.timestampUnit() (introduced in #7845 for NANOS support) reads the actual Parquet annotation and normalises the stored value to epoch-milliseconds before it reaches ParquetTimestampVector.getTimestamp(). Without that normalisation (e.g. paimon 1.4.1, which lacks timestampUnit()), the raw epoch-µs value is passed to Timestamp.fromEpochMillis() — 1000× wrong.

Fix

No production code change — the fix is already in LongTimestampUpdater. This PR adds a regression test that would catch any future regression in this code path.

The test mirrors testReadTimestampNanosWrittenByParquet: it writes a Parquet file externally with INT64 MICROS annotation for a TIMESTAMP(3) column, reads it back via ParquetReaderFactory with a TimestampType(3) row type, and asserts the decoded values match Timestamp.fromMicros().

Prior art

PR #8230 introduced the MICROS writer path this test covers. PR #7845 introduced timestampUnit(), which is the reader fix this test guards.

Changes

ParquetReadWriteTest.java: testReadTimestampMicrosWrittenByParquetForLowPrecision — reads externally-written INT64 MICROS Parquet with a TIMESTAMP(3) schema and verifies correct decoding

…v2 compatibility) Paimon emits TIMESTAMP(MILLIS) for precision <= 3 columns. The Iceberg v2 spec requires INT64 MICROS for timestamp/timestamptz; MILLIS is only valid under Iceberg v3. This causes Iceberg-aware engines (Athena, Trino, Spark) to reject Parquet files with a schema compatibility error. - ParquetSchemaConverter.createTimestampWithLogicalType: emit MICROS for precision <= 3 instead of MILLIS. - ParquetRowDataWriter.TimestampMillsWriter.writeTimestamp: call value.toMicros() so the stored INT64 matches the MICROS annotation unit. The reader path (MILLIS -> precision=3, MICROS -> precision=6) is left unchanged so files written by older versions remain readable. Existing tables with precision<=3 columns should be rebuilt after upgrading. Tests: testLowPrecisionTimestampUseMicrosAnnotation verifies MICROS annotation for precision 0-3; testPaimonParquetSchemaConvert updated for the widened round-trip precision.

…of micros ParquetSimpleStatsExtractor.toTimestampStats called fromEpochMillis for precision <= 3, but footer statistics for those columns now contain INT64 microseconds (matching the MICROS annotation). Switch to fromMicros so that Parquet column bounds are decoded correctly.

VectorizedColumnReader has a lazy dictionary fast path for INT64/ LongColumnVector: the raw Parquet dictionary is stored on the vector directly, bypassing LongTimestampUpdater.longTimestamp() which normalises on-disk microseconds to the milliseconds that ParquetTimestampVector. getTimestamp expects. The result is timestamps ~1000x too far in the future for any dictionary-encoded page (triggered when rowGroupSize is large enough to activate dictionary encoding). Exclude precision <= 3 timestamp types from lazy decoding via a new isLowPrecisionTimestamp helper so the eager path (decodeDictionaryIds) is always taken, applying the correct /1000 normalisation.

…vs epoch_µs After the MICROS annotation change, ParquetRowDataWriter stores TIMESTAMP(n<=3) values as epoch microseconds. ParquetFilters.convertLiteral was still using getMillisecond() (epoch_ms) for those columns, so the Parquet row-group statistics comparison always failed against the new epoch_µs statistics — causing WHERE predicates on low-precision timestamp columns to filter out all row groups and return empty results. Fix: use toMicros() for all INT64 timestamp precisions (0-6) in ParquetFilters.convertLiteral, matching the storage unit written by the writer. Update ParquetFiltersTest assertions accordingly.

…poch-milliseconds

…tated Parquet files

JingsongLi · 2026-06-23T06:18:00Z

            return new SimpleColStats(
-                    Timestamp.fromEpochMillis(longStats.getMin()),
-                    Timestamp.fromEpochMillis(longStats.getMax()),
+                    Timestamp.fromMicros(longStats.getMin()),


After this change the extractor decodes TIMESTAMP(0..3) footer stats as micros solely from the Paimon field precision. Existing Paimon Parquet files written before this PR use TIMESTAMP_MILLIS and store min/max in epoch milliseconds, so extracting stats for those files (for example during migrate/clone or any metadata regeneration) would turn 2024-01-01T00:00:00.123 into a 1970 timestamp and write incorrect file stats. Can we derive the unit from stats.type().getLogicalTypeAnnotation() / the column metadata, like the reader does, and keep MILLIS for legacy files?

JingsongLi · 2026-06-23T06:18:18Z

-            } else if (precision <= 6) {
-                // microseconds
+            if (precision <= 6) {
                return timestamp.toMicros();


This predicate literal is now always epoch micros for TIMESTAMP(0..6), but precision 0..3 files written by existing Paimon versions store epoch milliseconds. The filter is created in ParquetFileFormat before ParquetReaderFactory opens each file, so it cannot see whether a particular file schema is TIMESTAMP_MILLIS or TIMESTAMP_MICROS. For old files, ts = 2024-01-01 becomes 1704067200000000 while the row-group stats/data are around 1704067200000, and Parquet can incorrectly drop matching row groups. We should either build the timestamp filter after reading the file schema or disable/avoid this pushdown for legacy low-precision timestamp files.

Q8Webmaster added 6 commits June 14, 2026 02:27

[parquet] remove over-long comment that broke Spotless line-length check

5bc2d62

[parquet] Fix TIMESTAMP(n<=3) reader decoding epoch-microseconds as e…

48475f0

…poch-milliseconds

q8webmaster marked this pull request as draft June 15, 2026 00:27

Q8Webmaster added 2 commits June 15, 2026 02:29

[parquet] fix spotless: flatten single-method assertThat chains

1c17c40

[parquet] revert: ParquetTimestampVector fix is wrong for master

56cf03e

q8webmaster closed this Jun 15, 2026

q8webmaster reopened this Jun 15, 2026

[parquet] Add regression test for TIMESTAMP(n<=3) reading MICROS-anno…

698b3a2

…tated Parquet files

q8webmaster changed the title ~~[parquet] Fix TIMESTAMP(n<=3) reader decoding epoch-microseconds as epoch-milliseconds~~ [parquet] Add regression test for TIMESTAMP(n<=3) reading MICROS-annotated INT64 files Jun 15, 2026

JingsongLi reviewed Jun 23, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[parquet] Add regression test for TIMESTAMP(n<=3) reading MICROS-annotated INT64 files#8238

[parquet] Add regression test for TIMESTAMP(n<=3) reading MICROS-annotated INT64 files#8238
q8webmaster wants to merge 9 commits into
apache:masterfrom
q8webmaster:fix/parquet-timestamp-vector-micros-reader

q8webmaster commented Jun 15, 2026 •

edited

Loading

Uh oh!

JingsongLi Jun 23, 2026

Uh oh!

JingsongLi Jun 23, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

q8webmaster commented Jun 15, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Problem

Root cause

Fix

Prior art

Changes

Uh oh!

JingsongLi Jun 23, 2026

Choose a reason for hiding this comment

Uh oh!

JingsongLi Jun 23, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

q8webmaster commented Jun 15, 2026 •

edited

Loading