[improvement](regression) Use Spark thrift JDBC for external SQL helpers by zgxme · Pull Request #64886 · apache/doris

zgxme · 2026-06-26T08:30:06Z

What problem does this PR solve?

This PR improves the execution efficiency of external SQL helper related regression cases by using Spark thrift JDBC access.

Local validation shows the following cache-related cases are significantly faster after this change:

test_iceberg_table_cache: 3m20s -> 30s
test_paimon_table_meta_cache: 14m59s -> 40s

Release note

None

Check List (For Author)

Test
- Regression test
- Unit Test
- Manual test (add detailed scripts or steps below)
- No need to test or manual test. Explain why:
  - This is a refactor/code format and no logic has been changed.
  - Previous test can cover this change.
  - No code files have been changed.
  - Other reason
Behavior changed:
- No.
- Yes.
Does this need documentation?
- No.
- Yes.

Check List (For Reviewer who merge this PR)

Confirm the release note
Confirm test cases
Confirm document
Add branch pick label

### What problem does this PR solve? Issue Number: close #xxx Related PR: apache#63719 Problem Summary: The regression Spark Iceberg and Paimon helpers executed SQL through docker exec and spark-sql, which required local Docker access and repeatedly started Spark SQL clients. This change follows the Spark Iceberg JDBC helper approach from PR apache#63719 and routes Spark Iceberg/Paimon helper execution through Spark ThriftServer with Hive JDBC. Multi-statement execution now reuses one JDBC connection. ### Release note None ### Check List (For Author) - Test: Manual test - mvn -q -DskipTests compile under regression-test/framework - git diff --check -- framework/src/main/groovy/org/apache/doris/regression/suite/Suite.groovy - Behavior changed: Yes. spark_iceberg, spark_iceberg_multi, and spark_paimon now execute through Spark ThriftServer JDBC instead of docker exec spark-sql. - Does this need documentation: No

### What problem does this PR solve? Issue Number: None Related PR: None Problem Summary: Spark Iceberg helpers opened a new Hive JDBC connection for every spark_iceberg/spark_paimon call. This added repeated Spark ThriftServer session setup overhead in suites that issue many Spark SQL statements. The framework now keeps a Spark Iceberg JDBC connection in SuiteContext thread-local state, creates it on first use, reuses it for later calls in the same suite context thread, and closes it with other context thread-local resources. ### Release note None ### Check List (For Author) - Test: Manual test - Manual test: mvn package -B -DskipTests=true -Dmaven.javadoc.skip=true in regression-test/framework; git diff --check - Behavior changed: Yes. Spark Iceberg/Paimon helper SQL reuses a SuiteContext-local Spark JDBC connection instead of opening one per call. - Does this need documentation: No

### What problem does this PR solve? Issue Number: None Related PR: None Problem Summary: The Iceberg docker entrypoint started Spark master and worker before Spark ThriftServer, but the thriftserver command did not specify a Spark master. Without an explicit master, Spark can fall back to local execution, so the standalone master and worker may not be used by Hive JDBC queries. This change starts Spark ThriftServer with --master spark://doris--spark-iceberg:7077 while keeping the Derby system home JVM option unchanged. ### Release note None ### Check List (For Author) - Test: Manual test - Manual test: bash -n docker/thirdparties/docker-compose/iceberg/entrypoint.sh.tpl - Behavior changed: Yes. Iceberg Spark ThriftServer now explicitly runs against the standalone Spark master in the docker environment. - Does this need documentation: No

### What problem does this PR solve? Issue Number: None Related PR: None Problem Summary: The Iceberg Spark docker environment relied on Spark defaults for ThriftServer and spark-sql resource sizing. Those defaults can use too many CPU cores while leaving executor and driver heap at small defaults, and the default shuffle partition count is high for local regression data. This change caps the Spark app at 8 cores, uses 4-core executors with 8g heap, gives the driver 4g heap, disables dynamic allocation explicitly, and reduces default shuffle/parallelism settings for local regression stability. ### Release note None ### Check List (For Author) - Test: Manual test - Manual test: git diff --check -- docker/thirdparties/docker-compose/iceberg/spark-defaults.conf - Behavior changed: Yes. Iceberg Spark docker jobs now use explicit resource and parallelism defaults. - Does this need documentation: No

### What problem does this PR solve? Issue Number: None Related PR: None Problem Summary: The Iceberg docker entrypoint started Spark ThriftServer before running the preinstalled Spark SQL setup scripts. After moving ThriftServer onto the standalone master, that idle ThriftServer app can reserve executor resources while setup scripts are still running. The ThriftServer also did not receive Iceberg/Paimon SQL extensions, while regression helpers execute Spark SQL through Hive JDBC. This change runs the setup scripts first, then starts ThriftServer with Iceberg and Paimon extensions, and waits for Hive JDBC readiness before marking the container healthy. ### Release note None ### Check List (For Author) - Test: Manual test - Manual test: bash -n docker/thirdparties/docker-compose/iceberg/entrypoint.sh.tpl; /bin/sh -n docker/thirdparties/docker-compose/iceberg/entrypoint.sh.tpl; git diff --check -- docker/thirdparties/docker-compose/iceberg/entrypoint.sh.tpl - Behavior changed: Yes. Iceberg Spark ThriftServer starts after preinstalled data setup and waits for JDBC readiness before /mnt/SUCCESS. - Does this need documentation: No

hello-stephen · 2026-06-26T08:30:13Z

Thank you for your contribution to Apache Doris.
Don't know what should be done next? See How to process your PR.

Please clearly describe your PR:

What problem was fixed (it's best to include specific error reporting information). How it was fixed.
Which behaviors were modified. What was the previous behavior, what is it now, why was it modified, and what possible impacts might there be.
What features were added. Why was this function added?
Which code was refactored and why was this part of the code refactored?
Which functions were optimized and what is the difference before and after the optimization?

### What problem does this PR solve? Issue Number: close #xxx Related PR: #xxx Problem Summary: Spark 4 thriftserver rejects the previous noSasl JDBC URL and then fails to open sessions against the default Iceberg namespace because demo.default is not created. This makes the Iceberg docker startup loop on the thriftserver readiness check and prevents regression Spark Iceberg JDBC helpers from connecting. Create the default Iceberg namespace before starting thriftserver, use the normal HiveServer2 JDBC URL without auth=noSasl, and fail readiness with useful logs instead of looping forever. ### Release note None ### Check List (For Author) - Test: Manual test - Ran bash -n and /bin/sh -n for docker/thirdparties/docker-compose/iceberg/entrypoint.sh.tpl - Ran git diff --check for modified files - Ran mvn package -B -DskipTests=true -Dmaven.javadoc.skip=true in regression-test/framework - Behavior changed: No - Does this need documentation: No

zgxme · 2026-06-26T10:03:53Z

run buildall

Gabriel39 · 2026-06-26T10:14:56Z

run buildall

github-actions · 2026-06-26T10:39:13Z

PR approved by at least one committer and no changes requested.

github-actions · 2026-06-26T10:39:15Z

PR approved by anyone and no changes requested.

Issue Number: None Related PR: None Problem Summary: Add P2 demo regression cases for Iceberg and Paimon. The cases write data through Spark SQL first, then query the same external table through both Doris and Spark, normalizing JDBC result values before comparison to avoid false failures caused by different Java number classes returned by the two JDBC drivers. None - Test: Regression test - ./run-regression-test.sh --run -d external_table_p2/iceberg -s test_iceberg_spark_doris_consistency_demo - ./run-regression-test.sh --run -d external_table_p2/paimon -s test_paimon_spark_doris_consistency_demo - Behavior changed: No - Does this need documentation: No

Gabriel39 · 2026-06-26T13:43:12Z

run buildall

### What problem does this PR solve? Issue Number: None Related PR: None Problem Summary: Paimon preinstalled SQL scripts are executed in a shared Spark SQL session. run06.sql changes the session time zone to +08:00 for timestamp partition coverage, but did not restore it before subsequent scripts. This can make later Paimon bootstrap data depend on session state and change physical file metadata such as partition file size. Restore the session time zone to UTC at the end of run06.sql so later scripts start from the default time zone. ### Release note None ### Check List (For Author) - Test: Manual test - git diff --check -- docker/thirdparties/docker-compose/iceberg/scripts/create_preinstalled_scripts/paimon/run06.sql - Behavior changed: No - Does this need documentation: No

### What problem does this PR solve? Issue Number: None Related PR: None Problem Summary: The Iceberg docker bootstrap was changed to sort preinstalled SQL script paths before generating the Spark SQL source files, and run06.sql restored the session time zone after its timestamp partition setup. Revert those changes so the bootstrap ordering and Paimon setup SQL match the previous behavior while investigating Paimon partition file size differences. ### Release note None ### Check List (For Author) - Test: Manual test - git diff --check -- docker/thirdparties/docker-compose/iceberg/entrypoint.sh.tpl docker/thirdparties/docker-compose/iceberg/scripts/create_preinstalled_scripts/paimon/run06.sql - Behavior changed: Yes. Iceberg docker preinstalled SQL path handling returns to the prior unsorted find output behavior, and run06.sql no longer restores session time zone. - Does this need documentation: No

zgxme added 5 commits June 26, 2026 15:01

yiguolei previously approved these changes Jun 26, 2026

View reviewed changes

github-actions Bot added the approved Indicates a PR has been approved by one committer. label Jun 26, 2026

github-actions Bot added the reviewed label Jun 26, 2026

zgxme dismissed yiguolei’s stale review via a983fa3 June 26, 2026 11:11

zgxme requested a review from yiguolei June 26, 2026 11:11

github-actions Bot removed the approved Indicates a PR has been approved by one committer. label Jun 26, 2026

zgxme added 4 commits June 27, 2026 19:53

Merge branch 'master' into spark-sql-0626

bb914a0

fix

84f251b

zgxme force-pushed the spark-sql-0626 branch from e2aa7f7 to 84f251b Compare June 28, 2026 10:45

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[improvement](regression) Use Spark thrift JDBC for external SQL helpers #64886

[improvement](regression) Use Spark thrift JDBC for external SQL helpers #64886
zgxme wants to merge 11 commits into
apache:masterfrom
zgxme:spark-sql-0626

zgxme commented Jun 26, 2026 •

edited

Loading

Uh oh!

hello-stephen commented Jun 26, 2026

Uh oh!

zgxme commented Jun 26, 2026

Uh oh!

Gabriel39 commented Jun 26, 2026

Uh oh!

github-actions Bot commented Jun 26, 2026

Uh oh!

github-actions Bot commented Jun 26, 2026

Uh oh!

Gabriel39 commented Jun 26, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Uh oh!

Conversation

zgxme commented Jun 26, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What problem does this PR solve?

Release note

Check List (For Author)

Check List (For Reviewer who merge this PR)

Uh oh!

hello-stephen commented Jun 26, 2026

Uh oh!

zgxme commented Jun 26, 2026

Uh oh!

Gabriel39 commented Jun 26, 2026

Uh oh!

github-actions Bot commented Jun 26, 2026

Uh oh!

github-actions Bot commented Jun 26, 2026

Uh oh!

Gabriel39 commented Jun 26, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

zgxme commented Jun 26, 2026 •

edited

Loading