feat: 100% Spark-compatible JSON support via codegen dispatcher by andygrove · Pull Request #4305 · apache/datafusion-comet

andygrove · 2026-05-12T17:23:04Z

Which issue does this PR close?

Closes #.

Rationale for this change

The native Rust JSON expressions in Comet have known compatibility gaps and feature restrictions. from_json only supports PERMISSIVE mode with simple schemas, to_json does not handle map or array at the top level, and get_json_object differs from Spark on single-quoted JSON and unescaped control characters.

This PR makes the Spark-compatible path the default for the JSON family, and lets the native path opt in for speed where it is correct. Any case the native path does not cover falls through to the compatible path rather than back to Spark. The compatible path routes through the Arrow-direct codegen dispatcher, which runs Spark's own doGenCode inside the Comet pipeline for byte-exact results, at the cost of a JNI roundtrip per batch.

Configs

spark.comet.exec.json.engine in {java, rust}, default java
- java (default): routes the JSON expressions through the codegen dispatcher so Spark's own implementation runs inside the Comet pipeline. This rides spark.comet.exec.scalaUDF.codegen.enabled (enabled by default). If that dispatcher is disabled, the operator falls back to Spark.
- rust: native DataFusion implementation. Faster, but has known compatibility gaps. An expression or input case with no native implementation falls back to the java engine, not to Spark.

What changes are included in this PR?

Make java the default JSON engine and drop the "experimental" wording from the config doc and the compatibility guide.
Use the codegen-dispatch serde base for the JSON family. CometGetJsonObject, CometStructsToJson, and CometJsonToStructs extend CometCodegenDispatch and override only to prefer the native rust path when that engine is selected and a native implementation exists for the case. Everything else falls through to the dispatcher. Under rust this covers to_json with options or array/map types and from_json with an unsupported schema.
Document the model in docs/source/user-guide/latest/compatibility/json.md and wire the page into the compatibility navigation.

json_array_length and json_object_keys are intentionally out of scope. Both are RuntimeReplaceable in Spark 4.x and Catalyst's ReplaceExpressions rewrites them to StaticInvoke before Comet sees the plan, so the classOf[LengthOfJsonArray] / classOf[JsonObjectKeys] registrations never match. Adding support requires recognizing the rewritten StaticInvoke form in Comet's serde dispatch and is left to a follow-up.

How are these changes tested?

CometJsonExpressionSuite pins engine=rust and covers the native path. The to_json cases the native path cannot handle now run through the codegen dispatcher, and the suite asserts they execute in Comet rather than falling back to Spark.
CometJsonJvmSuite covers get_json_object, from_json, and a to_json(from_json(...)) round-trip on the java engine.
CometSqlFileTestSuite covers get_json_object, from_json, and to_json through SQL golden files. The get_json_object and to_json files pin engine=rust.
CometExpressionSuite to_json tests pass on the default java engine.

This PR was scaffolded with the project's brainstorming, writing-plans, and subagent-driven-development skills.

Add `spark.comet.exec.json.engine` (default `rust`, experimental `java`) that routes the JSON expressions in scope through the JVM UDF framework introduced in apache#4232, delegating to Spark's own expression classes for byte-exact compatibility at the cost of JNI roundtrips per batch. Expressions in scope when `engine=java`: - `get_json_object` -> `GetJsonObjectUDF` - `from_json` -> `FromJsonUDF` - `to_json` -> `ToJsonUDF` A fresh Spark expression is built per `evaluate` call. Spark's JSON evaluators (`GetJsonObjectEvaluator`, `StructsToJsonEvaluator`, `JsonToStructsEvaluator`) hold mutable per-row state, and the JVM UDF framework shares one UDF instance across native worker threads, so a cached cross-thread expression races on its evaluator state. `from_json` / `to_json` use a serde-side `CometLambdaRegistry` to pass the configured Spark expression (schema, options, timezone) to the UDF. The serde rebinds the child to `BoundReference(0)` so the UDF can call `eval(row)` against a single-column wrapper row. `json_array_length` and `json_object_keys` are out of scope: both are `RuntimeReplaceable` in Spark 4.x and Catalyst's `ReplaceExpressions` rule rewrites them to `StaticInvoke` before Comet sees the plan, so `classOf[LengthOfJsonArray]` / `classOf[JsonObjectKeys]` serde registrations never match. Adding support requires recognizing the rewritten `StaticInvoke` form in Comet's serde dispatch. This PR was scaffolded with the project's brainstorming, writing-plans, and subagent-driven-development skills.

Resolve conflicts: - The common module rename (PR apache#4325) moved the UDF files added by this branch under common/.../udf/ to spark/.../udf/. Git's location-conflict resolver guessed the wrong destination (.../shims/); fix manually by placing FromJsonUDF / GetJsonObjectUDF / ToJsonUDF under spark/src/main/scala/org/apache/comet/udf/. (Follow-up commit retires them in favor of the codegen dispatcher anyway.) - CometConf.scala: keep both COMET_JSON_ENGINE (this PR) and COMET_SCALA_UDF_CODEGEN_ENABLED (main). - serde/structs.scala: combine the JSON engine selector with main's improved native ignoreNullFields / options handling. engine=java keeps routing through convertViaJvmUdf; engine=rust uses main's tightened native path.

…f hand-written UDFs Replace the three hand-written `GetJsonObjectUDF` / `FromJsonUDF` / `ToJsonUDF` JVM UDF implementations and the `CometLambdaRegistry` indirection with the Arrow-direct codegen dispatcher introduced in PR apache#4417 (`CometScalaUDF.emitJvmCodegenDispatch`). The dispatcher Janino-compiles Spark's own `doGenCode` (or `eval(row)` for CodegenFallback expressions) so the JSON family inherits Spark-identical semantics with no per-expression glue. Changes: - Delete the three hand-written UDF files under `spark/src/main/scala/org/apache/comet/udf/` and their unit-test suites. The codegen dispatcher's per-task `kernelCache` provides the same per-thread isolation that `CometLambdaRegistry` was working around. - Rewrite the JSON serdes (`CometGetJsonObject` in `strings.scala`, `CometStructsToJson` and `CometJsonToStructs` in `structs.scala`) to go through a new `JsonRoute` helper. `engine=rust` keeps the native path; `engine=java` delegates to `CometScalaUDF.emitJvmCodegenDispatch` when `spark.comet.exec.scalaUDF.codegen.enabled=true`. - Generalize the codegen dispatcher to accept `CodegenFallback` expressions. `CodegenFallback.doGenCode` emits `references[N].eval(row)`, the same shape the `HigherOrderFunction` carve-out already relied on; lifting the rejection lets `JsonToStructs` and `StructsToJson` (which are `CodegenFallback` in Spark 4) ride the same path. - Unwrap `RuntimeReplaceable` expressions inside `CometScalaUDF.emitJvmCodegenDispatch` before binding. Spark 4's `StructsToJson` is `RuntimeReplaceable` and its `doGenCode` throws "Cannot generate code for expression"; calling `.replacement` gives the `Invoke(StructsToJsonEvaluator, ...)` form that does codegen. - Update the JSON compatibility doc and the `CometJsonJvmSuite` config to reference the codegen flag. Test plan: - `CometJsonJvmSuite`: 3/3 pass (get_json_object, from_json, to_json round-trip via the codegen dispatcher). - `CometJsonExpressionSuite`: 8/8 pass on the unchanged native path. - `CometStringExpressionSuite`: 33/33, `CometCodegenSuite`: 60/60, `CometCodegenSourceSuite`: 50/50, `CometSqlFileTestSuite`: 284/284. - `cargo clippy --all-targets --workspace -- -D warnings`: clean.

- Add CometJsonJvmSuite to pr_build_linux.yml so check-missing-suites passes. - Remove unused HigherOrderFunction, LambdaFunction, NamedLambdaVariable imports from CometBatchKernelCodegen.scala (referenced only in comments). - Remove unused serializeDataType import from strings.scala.

… [skip ci]

Make `java` (the codegen dispatcher) the default JSON engine and route the rust engine's unsupported cases to it instead of falling back to Spark. - Flip `spark.comet.exec.json.engine` default from `rust` to `java` and drop the "experimental" wording from the config doc and json.md. - Replace the bespoke `JsonRoute` helper with the `CometCodegenDispatch` serde base. `CometGetJsonObject`, `CometStructsToJson`, and `CometJsonToStructs` now extend it and override only to prefer the native rust path when that engine is selected and a native implementation exists. Any case the native path does not cover (to_json with options/array/map, from_json with an unsupported schema) falls through to the codegen dispatcher. - Pin the native-path tests to `engine=rust`, and update the to_json fallback assertions: the codegen dispatcher is enabled by default, so those cases now run in Comet via codegen rather than falling back to Spark.

…t [skip ci]

andygrove marked this pull request as draft May 12, 2026 17:23

andygrove changed the title ~~feat: add JVM UDF engine for Spark JSON expressions~~ feat: add experimental support fro JSON expressions using Comet JVM UDF Framework May 12, 2026

andygrove changed the title ~~feat: add experimental support fro JSON expressions using Comet JVM UDF Framework~~ feat: add experimental support for JSON expressions using Comet JVM UDF Framework May 12, 2026

andygrove modified the milestones: 0.18.0 (July 2026), 0.18.0, 0.17.0 May 13, 2026

This was referenced May 14, 2026

feat(experimental): ScalaUDF and Java UDF support via Janino codegen #4267

Merged

feat: support stateful CometUDFs #4345

Merged

kazuyukitanimura mentioned this pull request May 18, 2026

Implement JVM UDFs for JSON expressions #4313

Open

andygrove added 2 commits May 26, 2026 16:10

andygrove changed the title ~~feat: add experimental support for JSON expressions using Comet JVM UDF Framework~~ feat: experimental Spark JSON support via codegen dispatcher May 26, 2026

andygrove added 2 commits May 26, 2026 21:04

ci: register CometJsonJvmSuite in macOS workflow

ef4f749

andygrove changed the title ~~feat: experimental Spark JSON support via codegen dispatcher~~ feat: 100% Spark-compatible JSON support via codegen dispatcher May 30, 2026

andygrove added 3 commits May 30, 2026 09:48

Merge remote-tracking branch 'apache/main' into worktree-json-jvm-udf…

5bcbb88

… [skip ci]

docs: wire JSON compatibility page into nav and clarify engine defaul…

4cf3dd0

…t [skip ci]

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: 100% Spark-compatible JSON support via codegen dispatcher#4305

feat: 100% Spark-compatible JSON support via codegen dispatcher#4305
andygrove wants to merge 8 commits into
apache:mainfrom
andygrove:worktree-json-jvm-udf

andygrove commented May 12, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

andygrove commented May 12, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Which issue does this PR close?

Rationale for this change

Configs

What changes are included in this PR?

How are these changes tested?

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

andygrove commented May 12, 2026 •

edited

Loading