feat: 100% Spark-compatible JSON support via codegen dispatcher#4305
Draft
andygrove wants to merge 8 commits into
Draft
feat: 100% Spark-compatible JSON support via codegen dispatcher#4305andygrove wants to merge 8 commits into
andygrove wants to merge 8 commits into
Conversation
Add `spark.comet.exec.json.engine` (default `rust`, experimental `java`) that routes the JSON expressions in scope through the JVM UDF framework introduced in apache#4232, delegating to Spark's own expression classes for byte-exact compatibility at the cost of JNI roundtrips per batch. Expressions in scope when `engine=java`: - `get_json_object` -> `GetJsonObjectUDF` - `from_json` -> `FromJsonUDF` - `to_json` -> `ToJsonUDF` A fresh Spark expression is built per `evaluate` call. Spark's JSON evaluators (`GetJsonObjectEvaluator`, `StructsToJsonEvaluator`, `JsonToStructsEvaluator`) hold mutable per-row state, and the JVM UDF framework shares one UDF instance across native worker threads, so a cached cross-thread expression races on its evaluator state. `from_json` / `to_json` use a serde-side `CometLambdaRegistry` to pass the configured Spark expression (schema, options, timezone) to the UDF. The serde rebinds the child to `BoundReference(0)` so the UDF can call `eval(row)` against a single-column wrapper row. `json_array_length` and `json_object_keys` are out of scope: both are `RuntimeReplaceable` in Spark 4.x and Catalyst's `ReplaceExpressions` rule rewrites them to `StaticInvoke` before Comet sees the plan, so `classOf[LengthOfJsonArray]` / `classOf[JsonObjectKeys]` serde registrations never match. Adding support requires recognizing the rewritten `StaticInvoke` form in Comet's serde dispatch. This PR was scaffolded with the project's brainstorming, writing-plans, and subagent-driven-development skills.
This was referenced May 14, 2026
Resolve conflicts: - The common module rename (PR apache#4325) moved the UDF files added by this branch under common/.../udf/ to spark/.../udf/. Git's location-conflict resolver guessed the wrong destination (.../shims/); fix manually by placing FromJsonUDF / GetJsonObjectUDF / ToJsonUDF under spark/src/main/scala/org/apache/comet/udf/. (Follow-up commit retires them in favor of the codegen dispatcher anyway.) - CometConf.scala: keep both COMET_JSON_ENGINE (this PR) and COMET_SCALA_UDF_CODEGEN_ENABLED (main). - serde/structs.scala: combine the JSON engine selector with main's improved native ignoreNullFields / options handling. engine=java keeps routing through convertViaJvmUdf; engine=rust uses main's tightened native path.
…f hand-written UDFs Replace the three hand-written `GetJsonObjectUDF` / `FromJsonUDF` / `ToJsonUDF` JVM UDF implementations and the `CometLambdaRegistry` indirection with the Arrow-direct codegen dispatcher introduced in PR apache#4417 (`CometScalaUDF.emitJvmCodegenDispatch`). The dispatcher Janino-compiles Spark's own `doGenCode` (or `eval(row)` for CodegenFallback expressions) so the JSON family inherits Spark-identical semantics with no per-expression glue. Changes: - Delete the three hand-written UDF files under `spark/src/main/scala/org/apache/comet/udf/` and their unit-test suites. The codegen dispatcher's per-task `kernelCache` provides the same per-thread isolation that `CometLambdaRegistry` was working around. - Rewrite the JSON serdes (`CometGetJsonObject` in `strings.scala`, `CometStructsToJson` and `CometJsonToStructs` in `structs.scala`) to go through a new `JsonRoute` helper. `engine=rust` keeps the native path; `engine=java` delegates to `CometScalaUDF.emitJvmCodegenDispatch` when `spark.comet.exec.scalaUDF.codegen.enabled=true`. - Generalize the codegen dispatcher to accept `CodegenFallback` expressions. `CodegenFallback.doGenCode` emits `references[N].eval(row)`, the same shape the `HigherOrderFunction` carve-out already relied on; lifting the rejection lets `JsonToStructs` and `StructsToJson` (which are `CodegenFallback` in Spark 4) ride the same path. - Unwrap `RuntimeReplaceable` expressions inside `CometScalaUDF.emitJvmCodegenDispatch` before binding. Spark 4's `StructsToJson` is `RuntimeReplaceable` and its `doGenCode` throws "Cannot generate code for expression"; calling `.replacement` gives the `Invoke(StructsToJsonEvaluator, ...)` form that does codegen. - Update the JSON compatibility doc and the `CometJsonJvmSuite` config to reference the codegen flag. Test plan: - `CometJsonJvmSuite`: 3/3 pass (get_json_object, from_json, to_json round-trip via the codegen dispatcher). - `CometJsonExpressionSuite`: 8/8 pass on the unchanged native path. - `CometStringExpressionSuite`: 33/33, `CometCodegenSuite`: 60/60, `CometCodegenSourceSuite`: 50/50, `CometSqlFileTestSuite`: 284/284. - `cargo clippy --all-targets --workspace -- -D warnings`: clean.
- Add CometJsonJvmSuite to pr_build_linux.yml so check-missing-suites passes. - Remove unused HigherOrderFunction, LambdaFunction, NamedLambdaVariable imports from CometBatchKernelCodegen.scala (referenced only in comments). - Remove unused serializeDataType import from strings.scala.
Make `java` (the codegen dispatcher) the default JSON engine and route the rust engine's unsupported cases to it instead of falling back to Spark. - Flip `spark.comet.exec.json.engine` default from `rust` to `java` and drop the "experimental" wording from the config doc and json.md. - Replace the bespoke `JsonRoute` helper with the `CometCodegenDispatch` serde base. `CometGetJsonObject`, `CometStructsToJson`, and `CometJsonToStructs` now extend it and override only to prefer the native rust path when that engine is selected and a native implementation exists. Any case the native path does not cover (to_json with options/array/map, from_json with an unsupported schema) falls through to the codegen dispatcher. - Pin the native-path tests to `engine=rust`, and update the to_json fallback assertions: the codegen dispatcher is enabled by default, so those cases now run in Comet via codegen rather than falling back to Spark.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Which issue does this PR close?
Closes #.
Rationale for this change
The native Rust JSON expressions in Comet have known compatibility gaps and feature restrictions.
from_jsononly supports PERMISSIVE mode with simple schemas,to_jsondoes not handle map or array at the top level, andget_json_objectdiffers from Spark on single-quoted JSON and unescaped control characters.This PR makes the Spark-compatible path the default for the JSON family, and lets the native path opt in for speed where it is correct. Any case the native path does not cover falls through to the compatible path rather than back to Spark. The compatible path routes through the Arrow-direct codegen dispatcher, which runs Spark's own
doGenCodeinside the Comet pipeline for byte-exact results, at the cost of a JNI roundtrip per batch.Configs
spark.comet.exec.json.enginein{java, rust}, defaultjavajava(default): routes the JSON expressions through the codegen dispatcher so Spark's own implementation runs inside the Comet pipeline. This ridesspark.comet.exec.scalaUDF.codegen.enabled(enabled by default). If that dispatcher is disabled, the operator falls back to Spark.rust: native DataFusion implementation. Faster, but has known compatibility gaps. An expression or input case with no native implementation falls back to thejavaengine, not to Spark.What changes are included in this PR?
javathe default JSON engine and drop the "experimental" wording from the config doc and the compatibility guide.CometGetJsonObject,CometStructsToJson, andCometJsonToStructsextendCometCodegenDispatchand override only to prefer the native rust path when that engine is selected and a native implementation exists for the case. Everything else falls through to the dispatcher. Underrustthis coversto_jsonwith options or array/map types andfrom_jsonwith an unsupported schema.docs/source/user-guide/latest/compatibility/json.mdand wire the page into the compatibility navigation.json_array_lengthandjson_object_keysare intentionally out of scope. Both areRuntimeReplaceablein Spark 4.x and Catalyst'sReplaceExpressionsrewrites them toStaticInvokebefore Comet sees the plan, so theclassOf[LengthOfJsonArray]/classOf[JsonObjectKeys]registrations never match. Adding support requires recognizing the rewrittenStaticInvokeform in Comet's serde dispatch and is left to a follow-up.How are these changes tested?
CometJsonExpressionSuitepinsengine=rustand covers the native path. Theto_jsoncases the native path cannot handle now run through the codegen dispatcher, and the suite asserts they execute in Comet rather than falling back to Spark.CometJsonJvmSuitecoversget_json_object,from_json, and ato_json(from_json(...))round-trip on thejavaengine.CometSqlFileTestSuitecoversget_json_object,from_json, andto_jsonthrough SQL golden files. Theget_json_objectandto_jsonfiles pinengine=rust.CometExpressionSuiteto_jsontests pass on the defaultjavaengine.This PR was scaffolded with the project's brainstorming, writing-plans, and subagent-driven-development skills.