[fix](be) Avoid finalized pipeline task submit crash#64899
Conversation
|
Thank you for your contribution to Apache Doris. Please clearly describe your PR:
|
|
run buildall |
TPC-H: Total hot run time: 29197 ms |
TPC-DS: Total hot run time: 171345 ms |
ClickBench: Total hot run time: 25.07 s |
BE UT Coverage ReportIncrement line coverage Increment coverage report
|
BE Regression && UT Coverage ReportIncrement line coverage Increment coverage report
|
|
/review |
There was a problem hiding this comment.
I found one test coverage gap in the new scheduler lifecycle protection.
Critical checkpoint conclusions:
- Goal: the code adds a submit gate to avoid late HybridTaskScheduler::submit() calling is_blockable() after terminal task cleanup starts. The production path is small and focused, but the added test only proves post-finalize rejection.
- Concurrency/lifecycle: the relevant race is dependency wake-up submit versus terminal close/finalize. The new lock/gate is aimed at HybridTaskScheduler::submit(); no deadlock or normal scheduling-order regression was found in the inspected paths, but close-time behavior needs direct coverage.
- Config/compatibility/persistence/data writes: no new config, wire/storage compatibility, transaction, or persistence behavior is involved.
- Parallel paths: reviewed wake_up, dependency set_ready/is_blocked_by, self-resubmit, RevokableTask, and blockable join/aggregate/spill paths; no separate correctness issue found.
- Tests: TEST_FINALIZED_TASK_REJECTS_HYBRID_SUBMIT should also cover the close-time gate before finalize, otherwise the close() change can regress while this test still passes.
Validation:
- git diff --check for the four GitHub PR files passed.
- build-support/check-format.sh could not run to completion because the runner clang-format is not v16.
- ./run-be-ut.sh --run --filter=PipelineTaskTest.TEST_FINALIZED_TASK_REJECTS_HYBRID_SUBMIT failed during setup because thirdparty/installed/bin/protoc is missing.
User focus: no additional user-provided review focus was supplied.
Subagent conclusions: optimizer-rewrite reported no candidates. tests-session-config reported TSC-1, which became the single inline comment. Convergence round 1 ended with both live subagents reporting NO_NEW_VALUABLE_FINDINGS for this same final comment set.
### What problem does this PR solve?
Issue Number: None
Related PR: None
Problem Summary: A runtime filter dependency can wake a pipeline task while another thread is closing or finalizing the same task. HybridTaskScheduler::submit() synchronously calls PipelineTask::is_blockable() before enqueueing the task, and is_blockable() reads _sink and _operators. After close/finalize starts releasing terminal task resources, a late submit can therefore dereference cleared task resources and crash.
This change adds a small submit gate on PipelineTask. Terminal close/finalize closes that gate under a dedicated lock, and HybridTaskScheduler::submit() checks the gate under the same lock before calling is_blockable(). The lock only covers the blockable check and does not extend to the actual queue submit.
### Release note
None
### Check List (For Author)
- Test: Unit Test / Static check
- ./run-be-ut.sh --run --filter=PipelineTaskTest.TEST_FINALIZED_TASK_REJECTS_HYBRID_SUBMIT
- build-support/check-format.sh
- CLANG_TIDY_BINARY=/mnt/disk6/common/ldb_toolchain_028/bin/clang-tidy build-support/run-clang-tidy.sh --base upstream/master --build-dir be/ut_build_ASAN
- git diff --check
- Behavior changed: No
- Does this need documentation: No
0e148ec to
1dd3207
Compare
|
/review |
There was a problem hiding this comment.
Automated review summary for current head 1dd3207f04a7311ddec655990fca58d68b0b0b9d.
Opinion: I found no additional blocking issue to raise on the current head. The existing inline thread about the finalized-only test coverage should not be duplicated; the current head adds the close-before-finalize hybrid-submit coverage that thread requested.
Critical checkpoint conclusions:
- Goal and test proof: the PR prevents
HybridTaskScheduler::submit()from callingPipelineTask::is_blockable()after terminal close/finalize begins. The gate is closed fromclose(close_sink=true)andfinalize(), and hybrid submit checks it under_blockable_check_lockbefore reading_sink/_operators. The new close-before-finalize and finalized-task BE tests cover the intended submit rejection cases. - Scope: the change is small and focused on BE pipeline task lifecycle/scheduler handling plus BE unit coverage. I saw no new config, persistence, FE/BE protocol, storage-format, or transaction surface.
- Concurrency and lifecycle:
_blockable_check_lockserializes the accept-submit gate withHybridTaskScheduler::submit(), while the existing_dependency_lifecycle_lockstill protects forced dependency unblocking against close/finalize resource clearing. In the production close path,_runningremains true throughclose_task()/finalize(), so a task queued just before the gate closes is requeued until finalization and then skipped. - Parallel paths: the hybrid scheduler applies the gate before choosing its internal blocking/simple queue.
RevokableTask::is_blockable()remains a separate virtual path and does not dereference the underlying task's sink/operators. - Tests and validation:
git diff --checkis clean for the changed files. I did not run BE UT/build because this runner lacksthirdparty/installedandthirdparty/installed/bin/protoc; Doris format validation is also blocked here because the availableclang-formatis 18.1.3 while the repo requires clang-format 16. - User focus: no additional user-provided focus points were present.
Subagent conclusions: optimizer-rewrite reported no new valuable findings after checking that optimizer/Nereids risks are out of scope and reviewing analogous lifecycle/parallel submit paths. tests-session-config reported no new valuable findings after checking the added tests, duplicate context, and basic whitespace/tooling state. Final convergence round 1 ended with both subagents replying NO_NEW_VALUABLE_FINDINGS for the same ledger and empty proposed inline comment set.
|
run buildall |
TPC-H: Total hot run time: 29431 ms |
TPC-DS: Total hot run time: 172007 ms |
ClickBench: Total hot run time: 25.15 s |
BE Regression && UT Coverage ReportIncrement line coverage Increment coverage report
|
| _sink->is_blockable(_state); | ||
| } | ||
|
|
||
| void PipelineTask::_stop_accepting_submit() { |
There was a problem hiding this comment.
我们调用这个is block 方法的时候,一定有整个fragment context 的shared ptr,为啥会出现析构的问题?
There was a problem hiding this comment.
is blockable里调的是sink operator这些东西,PipelineTask在finalize/close里会释放相关资源
| } | ||
|
|
||
| Status PipelineTask::close(Status exec_status, bool close_sink) { | ||
| if (close_sink) { |
There was a problem hiding this comment.
在finalize 里加了之后,这里为什么还要加?
There was a problem hiding this comment.
op->close / sink->close 这些东西本质也是写操作,担心会让is blockable这个读操作不安全。而且理论上进close sink了也不需要再submit了
不过实际执行逻辑是 [close_task] -> [task->close] -> [task->finalize] 。可以只在close sink里加,finalize里不加
### What problem does this PR solve? Issue Number: None Related PR: apache#64899 Problem Summary: Backport apache#64899 to branch-4.0. A runtime filter dependency can submit a pipeline task after close/finalize has started releasing task resources. On branch-4.0, HybridTaskScheduler::submit() calls PipelineTask::is_blockable(), which reads _sink and _operators. This adds a submit gate so terminal close/finalize rejects late hybrid scheduler submits before is_blockable() can touch released resources. ### Release note None ### Check List (For Author) - Test: Unit Test / Static check - build-support/clang-format.sh be/src/pipeline/pipeline_task.cpp be/src/pipeline/pipeline_task.h be/src/pipeline/task_scheduler.cpp be/test/pipeline/pipeline_task_test.cpp && build-support/check-format.sh be/src/pipeline/pipeline_task.cpp be/src/pipeline/pipeline_task.h be/src/pipeline/task_scheduler.cpp be/test/pipeline/pipeline_task_test.cpp - ./run-be-ut.sh --run --filter=PipelineTaskTest.TEST_CLOSED_TASK_REJECTS_HYBRID_SUBMIT_BEFORE_FINALIZE:PipelineTaskTest.TEST_FINALIZED_TASK_REJECTS_HYBRID_SUBMIT - Behavior changed: No - Does this need documentation: No (cherry picked from commit 1533359)
### What problem does this PR solve? Issue Number: None Related PR: apache#64899 Problem Summary: A runtime filter dependency can wake a pipeline task while another thread is closing or finalizing the same task. HybridTaskScheduler::submit() synchronously calls PipelineTask::is_blockable() before enqueueing the task, and is_blockable() reads _sink and _operators. After close/finalize starts releasing terminal task resources, a late submit can therefore dereference cleared task resources and crash. This change adds a small submit gate on PipelineTask. Terminal close/finalize closes that gate under a dedicated lock, and HybridTaskScheduler::submit() checks the gate under the same lock before calling is_blockable(). The lock only covers the blockable check and does not extend to the actual queue submit. (cherry picked from commit 1dd3207) ### Release note None ### Check List (For Author) - Test: Unit Test / Static check - build-support/clang-format.sh - build-support/check-format.sh - git diff --check - Reused the compile command from be/ut_build_ASAN/compile_commands.json to run -fsyntax-only for be/test/exec/pipeline/pipeline_task_test.cpp. - Attempted: ./run-be-ut.sh --run --filter=PipelineTaskTest.TEST_CLOSED_TASK_REJECTS_HYBRID_SUBMIT_BEFORE_FINALIZE:PipelineTaskTest.TEST_FINALIZED_TASK_REJECTS_HYBRID_SUBMIT. It was interrupted after confirming the changed pipeline_task.cpp and task_scheduler.cpp objects compiled, because branch-4.1 BE UT uses one large doris_be_test binary and the filtered run expanded into a near full UT build. - Attempted: ninja -C be/ut_build_ASAN -j1 test/CMakeFiles/doris_be_test.dir/exec/pipeline/pipeline_task_test.cpp.o. It was stopped after it began rebuilding a large OpenBLAS dependency chain; the test-file compile error was verified instead with the compile_commands.json -fsyntax-only command above. - Behavior changed: No - Does this need documentation: No
What problem does this PR solve?
Issue Number: None
Related PR: None
Problem Summary: A runtime filter dependency can wake a pipeline task while another thread is closing or finalizing the same task.
HybridTaskScheduler::submit()synchronously callsPipelineTask::is_blockable()before enqueueing, andis_blockable()reads_sinkand_operators. After terminal close/finalize starts releasing those resources, a late submit can dereference cleared task resources and crash.This PR adds a small submit gate on
PipelineTask. Terminal close/finalize closes that gate under a dedicated lock, andHybridTaskScheduler::submit()checks it under the same lock before callingis_blockable(). The lock only covers the blockable check; it does not cover the actual queue submit.Release note
None
Check List (For Author)
./run-be-ut.sh --run --filter=PipelineTaskTest.TEST_FINALIZED_TASK_REJECTS_HYBRID_SUBMITbuild-support/check-format.shCLANG_TIDY_BINARY=/mnt/disk6/common/ldb_toolchain_028/bin/clang-tidy build-support/run-clang-tidy.sh --base upstream/master --build-dir be/ut_build_ASANgit diff --check