Skip to content

feat: 100% Spark-compatible JSON support via codegen dispatcher#4305

Draft
andygrove wants to merge 8 commits into
apache:mainfrom
andygrove:worktree-json-jvm-udf
Draft

feat: 100% Spark-compatible JSON support via codegen dispatcher#4305
andygrove wants to merge 8 commits into
apache:mainfrom
andygrove:worktree-json-jvm-udf

Conversation

@andygrove
Copy link
Copy Markdown
Member

@andygrove andygrove commented May 12, 2026

Which issue does this PR close?

Closes #.

Rationale for this change

The native Rust JSON expressions in Comet have known compatibility gaps and feature restrictions. from_json only supports PERMISSIVE mode with simple schemas, to_json does not handle map or array at the top level, and get_json_object differs from Spark on single-quoted JSON and unescaped control characters.

This PR makes the Spark-compatible path the default for the JSON family, and lets the native path opt in for speed where it is correct. Any case the native path does not cover falls through to the compatible path rather than back to Spark. The compatible path routes through the Arrow-direct codegen dispatcher, which runs Spark's own doGenCode inside the Comet pipeline for byte-exact results, at the cost of a JNI roundtrip per batch.

Configs

  • spark.comet.exec.json.engine in {java, rust}, default java
    • java (default): routes the JSON expressions through the codegen dispatcher so Spark's own implementation runs inside the Comet pipeline. This rides spark.comet.exec.scalaUDF.codegen.enabled (enabled by default). If that dispatcher is disabled, the operator falls back to Spark.
    • rust: native DataFusion implementation. Faster, but has known compatibility gaps. An expression or input case with no native implementation falls back to the java engine, not to Spark.

What changes are included in this PR?

  • Make java the default JSON engine and drop the "experimental" wording from the config doc and the compatibility guide.
  • Use the codegen-dispatch serde base for the JSON family. CometGetJsonObject, CometStructsToJson, and CometJsonToStructs extend CometCodegenDispatch and override only to prefer the native rust path when that engine is selected and a native implementation exists for the case. Everything else falls through to the dispatcher. Under rust this covers to_json with options or array/map types and from_json with an unsupported schema.
  • Document the model in docs/source/user-guide/latest/compatibility/json.md and wire the page into the compatibility navigation.

json_array_length and json_object_keys are intentionally out of scope. Both are RuntimeReplaceable in Spark 4.x and Catalyst's ReplaceExpressions rewrites them to StaticInvoke before Comet sees the plan, so the classOf[LengthOfJsonArray] / classOf[JsonObjectKeys] registrations never match. Adding support requires recognizing the rewritten StaticInvoke form in Comet's serde dispatch and is left to a follow-up.

How are these changes tested?

  • CometJsonExpressionSuite pins engine=rust and covers the native path. The to_json cases the native path cannot handle now run through the codegen dispatcher, and the suite asserts they execute in Comet rather than falling back to Spark.
  • CometJsonJvmSuite covers get_json_object, from_json, and a to_json(from_json(...)) round-trip on the java engine.
  • CometSqlFileTestSuite covers get_json_object, from_json, and to_json through SQL golden files. The get_json_object and to_json files pin engine=rust.
  • CometExpressionSuite to_json tests pass on the default java engine.

This PR was scaffolded with the project's brainstorming, writing-plans, and subagent-driven-development skills.

Add `spark.comet.exec.json.engine` (default `rust`, experimental `java`)
that routes the JSON expressions in scope through the JVM UDF framework
introduced in apache#4232, delegating to Spark's own expression classes for
byte-exact compatibility at the cost of JNI roundtrips per batch.

Expressions in scope when `engine=java`:

- `get_json_object` -> `GetJsonObjectUDF`
- `from_json` -> `FromJsonUDF`
- `to_json` -> `ToJsonUDF`

A fresh Spark expression is built per `evaluate` call. Spark's JSON
evaluators (`GetJsonObjectEvaluator`, `StructsToJsonEvaluator`,
`JsonToStructsEvaluator`) hold mutable per-row state, and the JVM UDF
framework shares one UDF instance across native worker threads, so a
cached cross-thread expression races on its evaluator state.

`from_json` / `to_json` use a serde-side `CometLambdaRegistry` to pass
the configured Spark expression (schema, options, timezone) to the UDF.
The serde rebinds the child to `BoundReference(0)` so the UDF can call
`eval(row)` against a single-column wrapper row.

`json_array_length` and `json_object_keys` are out of scope: both are
`RuntimeReplaceable` in Spark 4.x and Catalyst's `ReplaceExpressions`
rule rewrites them to `StaticInvoke` before Comet sees the plan, so
`classOf[LengthOfJsonArray]` / `classOf[JsonObjectKeys]` serde
registrations never match. Adding support requires recognizing the
rewritten `StaticInvoke` form in Comet's serde dispatch.

This PR was scaffolded with the project's brainstorming, writing-plans,
and subagent-driven-development skills.
@andygrove andygrove marked this pull request as draft May 12, 2026 17:23
@andygrove andygrove marked this pull request as draft May 12, 2026 17:23
@andygrove andygrove changed the title feat: add JVM UDF engine for Spark JSON expressions feat: add experimental support fro JSON expressions using Comet JVM UDF Framework May 12, 2026
@andygrove andygrove changed the title feat: add experimental support fro JSON expressions using Comet JVM UDF Framework feat: add experimental support for JSON expressions using Comet JVM UDF Framework May 12, 2026
andygrove added 2 commits May 26, 2026 16:10
Resolve conflicts:

- The common module rename (PR apache#4325) moved the UDF files added by this
  branch under common/.../udf/ to spark/.../udf/. Git's location-conflict
  resolver guessed the wrong destination (.../shims/); fix manually by
  placing FromJsonUDF / GetJsonObjectUDF / ToJsonUDF under
  spark/src/main/scala/org/apache/comet/udf/. (Follow-up commit retires
  them in favor of the codegen dispatcher anyway.)

- CometConf.scala: keep both COMET_JSON_ENGINE (this PR) and
  COMET_SCALA_UDF_CODEGEN_ENABLED (main).

- serde/structs.scala: combine the JSON engine selector with main's
  improved native ignoreNullFields / options handling. engine=java keeps
  routing through convertViaJvmUdf; engine=rust uses main's tightened
  native path.
…f hand-written UDFs

Replace the three hand-written `GetJsonObjectUDF` / `FromJsonUDF` /
`ToJsonUDF` JVM UDF implementations and the `CometLambdaRegistry`
indirection with the Arrow-direct codegen dispatcher introduced in
PR apache#4417 (`CometScalaUDF.emitJvmCodegenDispatch`). The dispatcher
Janino-compiles Spark's own `doGenCode` (or `eval(row)` for
CodegenFallback expressions) so the JSON family inherits Spark-identical
semantics with no per-expression glue.

Changes:

- Delete the three hand-written UDF files under
  `spark/src/main/scala/org/apache/comet/udf/` and their unit-test
  suites. The codegen dispatcher's per-task `kernelCache` provides the
  same per-thread isolation that `CometLambdaRegistry` was working
  around.
- Rewrite the JSON serdes (`CometGetJsonObject` in `strings.scala`,
  `CometStructsToJson` and `CometJsonToStructs` in `structs.scala`) to
  go through a new `JsonRoute` helper. `engine=rust` keeps the native
  path; `engine=java` delegates to
  `CometScalaUDF.emitJvmCodegenDispatch` when
  `spark.comet.exec.scalaUDF.codegen.enabled=true`.
- Generalize the codegen dispatcher to accept `CodegenFallback`
  expressions. `CodegenFallback.doGenCode` emits
  `references[N].eval(row)`, the same shape the `HigherOrderFunction`
  carve-out already relied on; lifting the rejection lets `JsonToStructs`
  and `StructsToJson` (which are `CodegenFallback` in Spark 4) ride the
  same path.
- Unwrap `RuntimeReplaceable` expressions inside
  `CometScalaUDF.emitJvmCodegenDispatch` before binding. Spark 4's
  `StructsToJson` is `RuntimeReplaceable` and its `doGenCode` throws
  "Cannot generate code for expression"; calling `.replacement` gives
  the `Invoke(StructsToJsonEvaluator, ...)` form that does codegen.
- Update the JSON compatibility doc and the `CometJsonJvmSuite` config
  to reference the codegen flag.

Test plan:
- `CometJsonJvmSuite`: 3/3 pass (get_json_object, from_json,
  to_json round-trip via the codegen dispatcher).
- `CometJsonExpressionSuite`: 8/8 pass on the unchanged native path.
- `CometStringExpressionSuite`: 33/33, `CometCodegenSuite`: 60/60,
  `CometCodegenSourceSuite`: 50/50, `CometSqlFileTestSuite`: 284/284.
- `cargo clippy --all-targets --workspace -- -D warnings`: clean.
@andygrove andygrove changed the title feat: add experimental support for JSON expressions using Comet JVM UDF Framework feat: experimental Spark JSON support via codegen dispatcher May 26, 2026
andygrove added 2 commits May 26, 2026 21:04
- Add CometJsonJvmSuite to pr_build_linux.yml so check-missing-suites
  passes.
- Remove unused HigherOrderFunction, LambdaFunction, NamedLambdaVariable
  imports from CometBatchKernelCodegen.scala (referenced only in
  comments).
- Remove unused serializeDataType import from strings.scala.
@andygrove andygrove changed the title feat: experimental Spark JSON support via codegen dispatcher feat: 100% Spark-compatible JSON support via codegen dispatcher May 30, 2026
andygrove added 3 commits May 30, 2026 09:48
Make `java` (the codegen dispatcher) the default JSON engine and route the rust
engine's unsupported cases to it instead of falling back to Spark.

- Flip `spark.comet.exec.json.engine` default from `rust` to `java` and drop the
  "experimental" wording from the config doc and json.md.
- Replace the bespoke `JsonRoute` helper with the `CometCodegenDispatch` serde
  base. `CometGetJsonObject`, `CometStructsToJson`, and `CometJsonToStructs` now
  extend it and override only to prefer the native rust path when that engine is
  selected and a native implementation exists. Any case the native path does not
  cover (to_json with options/array/map, from_json with an unsupported schema)
  falls through to the codegen dispatcher.
- Pin the native-path tests to `engine=rust`, and update the to_json fallback
  assertions: the codegen dispatcher is enabled by default, so those cases now
  run in Comet via codegen rather than falling back to Spark.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant