Skip to content

Parse-once for JSON-index ingestion: cache the parsed Map and flatten it directly#18756

Open
xiangfu0 wants to merge 1 commit into
apache:masterfrom
xiangfu0:st-json-index-cache
Open

Parse-once for JSON-index ingestion: cache the parsed Map and flatten it directly#18756
xiangfu0 wants to merge 1 commit into
apache:masterfrom
xiangfu0:st-json-index-cache

Conversation

@xiangfu0

@xiangfu0 xiangfu0 commented Jun 14, 2026

Copy link
Copy Markdown
Contributor

What

A dataType: JSON column with a JSON index currently pays a serialize -> re-parse round-trip per row: DataTypeTransformer serializes the parsed JSON value to the forward-index string, and the mutable JSON index then re-parses that same string before flattening it.

This PR keeps a transient parsed-value cache on the GenericRow for JSON columns that have Pinot's built-in json index enabled, then feeds that cached value directly to MutableJsonIndexImpl:

  • GenericRow stores a transient per-row parsed JSON cache. It is not part of row value/equality/copy/serialized state and is invalidated when the column value changes.
  • DataTypeTransformer populates the cache only for dataType: JSON columns with the built-in JSON index enabled.
  • MutableSegmentImpl passes the cached parsed value to MutableJsonIndexImpl.addParsed(...) when present; otherwise indexing uses the existing string path.
  • JsonUtils.flattenParsed(Object, JsonIndexConfig) flattens a parsed Map / List / JsonNode while preserving the old string-path output.
  • BenchmarkJsonFlatten now runs the same benchmark over multiple payload sizes (330, 2900, 8000 bytes).

User Manual / Sample Config

No new table config is introduced. Existing realtime tables with JSON columns and Pinot's built-in JSON index benefit automatically.

{
  "tableName": "logs_REALTIME",
  "tableType": "REALTIME",
  "indexingConfig": {
    "jsonIndexColumns": ["payload"]
  }
}

The optimization is intentionally scoped to the built-in json index implementation. Plugin/custom JSON-like indexes continue to receive the serialized string path.

Behavior-preserving

  • flattenParsed is tested to produce the same flattened records as flatten(String, JsonIndexConfig) on the serialized form, including nested objects/arrays, top-level arrays, JsonNode, BigDecimal, BigInteger, floats, nulls, max levels, and JSON index matching.
  • For leaf values that do not round-trip identically through Jackson's tree model, flattenParsed falls back to serialize+reparse so the JSON index output remains identical to today's behavior.
  • Cached parsed values are invalidated when a row value is overwritten, removed, default-null-filled, bulk-updated, cleared, or sanitized after type transformation.
  • IndexingFailureTest#testJsonIndexUsesParsedCache covers the segment-level path where a cached JsonNode is indexed via MutableSegmentImpl.

Performance (BenchmarkJsonFlatten)

Validated with the checked-in JMH benchmark on JDK 21:

-wi 3 -i 5 -f 1 -w 2s -r 2s

The benchmark isolates JSON flatten/index-input cost and separately measures serializeMap, which still remains because the forward index stores the serialized JSON string. The last column combines serializeMap with flattening as a harmonic throughput estimate for the full Map-input path.

payload re-parse flatten parsed flatten serialize map flatten-only gain estimated serialize+flatten gain
~330 B 314 ops/ms 588 ops/ms 872 ops/ms 1.87x 1.52x
~2.9 KB 83 ops/ms 535 ops/ms 265 ops/ms 6.44x 2.80x
~8 KB 69 ops/ms 226 ops/ms 260 ops/ms 3.26x 2.21x

The local machine was noisy, but the payload-size trend is consistent: avoiding the string re-tokenization is most valuable as payload size grows.

API Surface

No MutableJsonIndex SPI change remains after review simplification. The additive public surface is limited to:

  • GenericRow parsed JSON cache accessors.
  • JsonUtils.flattenParsed(Object, JsonIndexConfig).
  • MutableJsonIndexImpl.addParsed(Object) as an implementation method used by MutableSegmentImpl.

Validation

  • GITHUB_ACTIONS=true ./mvnw -pl pinot-spi,pinot-segment-local -am -Dtest=JsonUtilsTest,GenericRowTest,DataTypeTransformerTest,JsonIndexTest,IndexingFailureTest -Dsurefire.failIfNoSpecifiedTests=false test
  • GITHUB_ACTIONS=true ./mvnw -pl pinot-segment-local -am -Dtest=IndexingFailureTest -Dsurefire.failIfNoSpecifiedTests=false test
  • GITHUB_ACTIONS=true ./mvnw spotless:apply -pl pinot-segment-local,pinot-spi,pinot-perf
  • GITHUB_ACTIONS=true ./mvnw checkstyle:check -pl pinot-segment-local,pinot-spi,pinot-perf
  • GITHUB_ACTIONS=true ./mvnw license:format -pl pinot-segment-local,pinot-spi,pinot-perf
  • GITHUB_ACTIONS=true ./mvnw license:check -pl pinot-segment-local,pinot-spi,pinot-perf

Copilot AI left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR optimizes realtime ingestion for JSON columns with JSON-family indexes by avoiding a per-row JSON serialize → re-parse round-trip, reusing the already-parsed representation and flattening it directly for indexing.

Changes:

  • Add JsonUtils.flattenParsed(...) and a native-type check to flatten parsed JSON values (Map/List/JsonNode) without string tokenization, with a fallback to the existing string path for non-native leaf types.
  • Introduce a transient per-row parsed JSON cache on GenericRow, populate it in DataTypeTransformer for JSON-indexed JSON columns, and feed it to JSON mutable indexes when supportsParsedValue() is enabled.
  • Extend MutableJsonIndex with additive SPI defaults (addParsed, supportsParsedValue) and implement the parsed path in MutableJsonIndexImpl, with accompanying unit tests and a JMH benchmark.

Reviewed changes

Copilot reviewed 10 out of 10 changed files in this pull request and generated 2 comments.

Show a summary per file
File Description
pinot-spi/src/main/java/org/apache/pinot/spi/utils/JsonUtils.java Adds flattenParsed and native-type detection to avoid re-tokenizing JSON strings during indexing.
pinot-spi/src/test/java/org/apache/pinot/spi/utils/JsonUtilsTest.java Adds coverage ensuring flattenParsed matches the legacy serialize+reparse behavior.
pinot-spi/src/main/java/org/apache/pinot/spi/data/readers/GenericRow.java Adds transient per-row parsed JSON cache accessors and clears it on clear().
pinot-segment-spi/src/main/java/org/apache/pinot/segment/spi/index/mutable/MutableJsonIndex.java Adds additive SPI hooks for parsed JSON ingestion and capability gating.
pinot-segment-local/src/main/java/org/apache/pinot/segment/local/recordtransformer/DataTypeTransformer.java Computes which JSON columns should cache parsed values and populates the cache during transform.
pinot-segment-local/src/main/java/org/apache/pinot/segment/local/indexsegment/mutable/MutableSegmentImpl.java Feeds cached parsed JSON to mutable JSON indexes when supported.
pinot-segment-local/src/main/java/org/apache/pinot/segment/local/realtime/impl/json/MutableJsonIndexImpl.java Implements addParsed + supportsParsedValue using flattenParsed.
pinot-segment-local/src/test/java/org/apache/pinot/segment/local/segment/index/JsonIndexTest.java Adds test asserting parsed-vs-string indexing produces identical matches.
pinot-segment-local/src/test/java/org/apache/pinot/segment/local/recordtransformer/DataTypeTransformerTest.java Adds test verifying caching is enabled only when JSON index is configured.
pinot-perf/src/main/java/org/apache/pinot/perf/BenchmarkJsonFlatten.java Adds a JMH benchmark comparing flatten-from-string vs flatten-from-parsed-map costs.

Comment on lines +119 to +120
if (!_jsonCacheColumns.isEmpty() && _jsonCacheColumns.contains(column)
&& (value instanceof Map || value instanceof List)) {
Comment on lines 33 to 42
default void add(Object value, int dictId, int docId) {
try {
if (value instanceof Map) {
add(JsonUtils.objectToString(value));
if (value instanceof Map || value instanceof List) {
// Already-parsed JSON value (e.g. a Map cached on the GenericRow before it was serialized for the forward
// index): flatten it directly, avoiding the serialize-then-reparse round-trip.
addParsed(value);
} else {
// String (the common case) or, for any other unexpected type, fail fast with a ClassCastException as before.
add((String) value);
}
@codecov-commenter

codecov-commenter commented Jun 14, 2026

Copy link
Copy Markdown

Codecov Report

❌ Patch coverage is 75.92593% with 26 lines in your changes missing coverage. Please review.
✅ Project coverage is 57.02%. Comparing base (cd6dd61) to head (ea3e9ec).
⚠️ Report is 3 commits behind head on master.

Files with missing lines Patch % Lines
...t/local/recordtransformer/DataTypeTransformer.java 79.48% 5 Missing and 3 partials ⚠️
...ain/java/org/apache/pinot/spi/utils/JsonUtils.java 81.81% 6 Missing and 2 partials ⚠️
...local/realtime/impl/json/MutableJsonIndexImpl.java 0.00% 6 Missing ⚠️
...local/indexsegment/mutable/MutableSegmentImpl.java 25.00% 1 Missing and 2 partials ⚠️
.../org/apache/pinot/spi/data/readers/GenericRow.java 93.33% 0 Missing and 1 partial ⚠️

❗ There is a different number of reports uploaded between BASE (cd6dd61) and HEAD (ea3e9ec). Click for more details.

HEAD has 4 uploads less than BASE
Flag BASE (cd6dd61) HEAD (ea3e9ec)
java-21 5 4
unittests 2 1
temurin 5 4
unittests2 1 0
Additional details and impacted files
@@             Coverage Diff              @@
##             master   #18756      +/-   ##
============================================
- Coverage     64.80%   57.02%   -7.78%     
+ Complexity     1322        7    -1315     
============================================
  Files          3393     2607     -786     
  Lines        211332   152809   -58523     
  Branches      33234    24849    -8385     
============================================
- Hits         136952    87143   -49809     
+ Misses        63329    58251    -5078     
+ Partials      11051     7415    -3636     
Flag Coverage Δ
custom-integration1 100.00% <ø> (ø)
integration 100.00% <ø> (ø)
integration1 100.00% <ø> (ø)
integration2 0.00% <ø> (ø)
java-21 57.02% <75.92%> (-7.78%) ⬇️
temurin 57.02% <75.92%> (-7.78%) ⬇️
unittests 57.02% <75.92%> (-7.78%) ⬇️
unittests1 57.02% <75.92%> (+0.02%) ⬆️
unittests2 ?

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Harness.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • 📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

@xiangfu0 xiangfu0 force-pushed the st-json-index-cache branch 3 times, most recently from 577accd to 8fcae8f Compare June 15, 2026 09:49
@xiangfu0 xiangfu0 force-pushed the st-json-index-cache branch 9 times, most recently from 2837e35 to 0f23508 Compare June 27, 2026 12:03
Avoid the realtime JSON-index serialize -> re-parse round-trip by caching parsed JSON values on GenericRow for JSON columns with Pinot's built-in json index enabled.

DataTypeTransformer records the parsed Map/List/JsonNode while still writing the canonical string used by the forward index. MutableSegmentImpl feeds that cached value directly to MutableJsonIndexImpl.addParsed(), which flattens via JsonUtils.flattenParsed(Object, JsonIndexConfig). The generic MutableJsonIndex SPI is unchanged.

flattenParsed preserves the existing string-path output: DecimalNode leaves are normalized the same way the serialized path re-parses them, and unsafe leaf types fall back to serialize+reparse. GenericRow invalidates cached parsed values when the field value changes, including sanitization and row reuse paths.

BenchmarkJsonFlatten now covers 330 B, 2.9 KB, and 8 KB payloads. The latest JDK 21 JMH run shows flatten-only gains of 1.87x / 6.44x / 3.26x, and estimated serialize+flatten gains of 1.52x / 2.80x / 2.21x.
@xiangfu0 xiangfu0 force-pushed the st-json-index-cache branch from 0f23508 to ea3e9ec Compare June 29, 2026 12:06
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants