Search before asking
Version
doris-4.0.5-rc01-59de8c4c524
Git commit: 59de8c4
JDK: OpenJDK 17.0.18+8 (/opt/jdk-17.0.18+8/bin/java)
OS: Linux x86_64
What's Wrong?
Our BE node (BackendId: 1780885838226, Host: 10.66.7.1) crashes repeatedly with SIGSEGV on version 4.0.5-rc01. This is NOT caused by Linux OOM killer (no matching records in dmesg).
Two distinct crash patterns observed in be.out:
Pattern 1 – Query execution (most frequent)
SIGSEGV at address @0x38 in bthread worker thread:
bthread::TaskGroup::sched_to → run_main_task → worker_thread
Pattern 2 – Background cache management (2026-06-25 00:20)
SIGSEGV at address @0x8 during Page Cache capacity adjustment:
Daemon::cache_adjust_capacity_thread
→ CacheManager::for_each_cache_refresh_capacity
→ LRUCache::set_capacity
→ SegmentFooterPB::~SegmentFooterPB
Most recent production incident (2026-06-25):
- 08:11:34 BE started (PID 419300)
- ~13:40:22 BE crashed (Aborted at unix time 1782366022)
- 13:41:10 Client ETL job failed: (1105, 'errCode = 2, detailMessage = backend 1780885838226 is down')
- 13:41:55 BE restarted automatically
Crash frequency on this single BE (same version):
| Date | Query id | Stack |
|------------|---------------------------------------|--------------------------------|
| 2026-06-22 | 6c0e5ba4c3ab45f2-a3734229ec2d08ba | bthread SIGSEGV @0x38 |
| 2026-06-23 | 0-0 | bthread SIGSEGV @0x38 |
| 2026-06-24 | 7133b64a2c24c1d-942bd6304aa27668 | bthread SIGSEGV @0x38 |
| 2026-06-24 | 594b7141026d19bf-7a05c24fda60e79f | bthread SIGSEGV @0x38 |
| 2026-06-25 | 0-0 | Page Cache/SegmentFooterPB @0x8 |
| 2026-06-25 | 0-0 | bthread SIGSEGV @0x38 |
BE environment:
- CPU: 32 cores, Memory: ~125 GB
- TabletNum: 17860, DataUsedCapacity: ~585 GB, Disk UsedPct: ~35%
- Workload mode: mix
Stack trace – most recent crash (2026-06-25 ~13:40, before restart at 13:41:55):
What You Expected?
BE should remain stable under heavy query/load pressure. If resource limits are exceeded, Doris should fail the query gracefully (e.g., memory limit exceeded, query cancelled), not crash the entire BE process with SIGSEGV.
After a failure, clients should receive a query-level error, not "backend is down" due to process crash.
How to Reproduce?
We have not isolated a minimal reproducible case yet. Crashes appear under sustained production load on a single BE node running 4.0.5-rc01.
Observed trigger pattern:
- BE runs for several hours under mixed query/load workload
- A heavy INSERT INTO ... SELECT ETL job starts
- BE crashes with SIGSEGV in bthread::TaskGroup::sched_to
- FE reports "backend is down" to clients
- BE is restarted (~1 minute later)
Recent failing workload (high level, no full SQL):
- Job: ETL via DolphinScheduler (pre environment)
- SQL pattern:
- TRUNCATE TABLE dws.dws_finance_sku_order_time_expense_stats_h
- Two large INSERT INTO ... SELECT statements (is_combination=0 and is_combination=1)
- Source: large DWD fact tables (dwd_finance_order_expense_*)
- Operations: GROUP BY on 20+ columns, UNION ALL, multiple LEFT JOINs (dim tables + ods table)
- Task runtime before failure: ~3 minutes (13:37:58 → 13:41:10 CST)
Cluster info:
- Single affected BE: 10.66.7.1 (BackendId 1780885838226)
- 17860 tablets on this BE (relatively heavy)
- No disk pressure (UsedPct ~35%)
- No Linux OOM kill records in dmesg
What we ruled out:
- Linux OOM killer: dmesg shows no OOM/kill records around crash time
- Disk full: UsedPct ~35%
- Manual shutdown: isShutdown=false in SHOW BACKENDS
- One-off incident: same BE crashed 6+ times in 3 days with identical stack patterns
To help reproduce:
We can provide full be.out crash logs and SQL privately if needed. The crash also occurs without a specific query id (Query id: 0-0), suggesting it may be triggered by general load rather than one specific SQL shape.
Anything Else?
Impact:
- Production ETL jobs fail with "backend is down"
- BE requires repeated manual/automatic restarts
- Data risk: jobs use TRUNCATE before INSERT, leaving target table empty/incomplete on failure
Questions for maintainers:
- Are there known fixes for bthread::TaskGroup::sched_to SIGSEGV (@0x38) in 4.0.x after commit 59de8c4?
- Is the SegmentFooterPB destructor crash during cache_adjust_capacity_thread() a known issue in 4.0.5-rc01?
- Is 4.0.5-rc01 recommended for production? Which stable version should we upgrade to?
- Any workaround besides version upgrade (cache/memory config, query-side mitigation)?
Attachments we can provide:
- Full be.out crash sections
- be.WARNING / be.INFO around 13:40-13:42 CST on 2026-06-25
- SHOW BACKENDS output
- Full SQL text (privately if needed)
Are you willing to submit PR?
Code of Conduct
Search before asking
Version
doris-4.0.5-rc01-59de8c4c524
Git commit: 59de8c4
JDK: OpenJDK 17.0.18+8 (/opt/jdk-17.0.18+8/bin/java)
OS: Linux x86_64
What's Wrong?
Our BE node (BackendId: 1780885838226, Host: 10.66.7.1) crashes repeatedly with SIGSEGV on version 4.0.5-rc01. This is NOT caused by Linux OOM killer (no matching records in dmesg).
Two distinct crash patterns observed in be.out:
Pattern 1 – Query execution (most frequent)
SIGSEGV at address @0x38 in bthread worker thread:
bthread::TaskGroup::sched_to → run_main_task → worker_thread
Pattern 2 – Background cache management (2026-06-25 00:20)
SIGSEGV at address @0x8 during Page Cache capacity adjustment:
Daemon::cache_adjust_capacity_thread
→ CacheManager::for_each_cache_refresh_capacity
→ LRUCache::set_capacity
→ SegmentFooterPB::~SegmentFooterPB
Most recent production incident (2026-06-25):
Crash frequency on this single BE (same version):
| Date | Query id | Stack |
|------------|---------------------------------------|--------------------------------|
| 2026-06-22 | 6c0e5ba4c3ab45f2-a3734229ec2d08ba | bthread SIGSEGV @0x38 |
| 2026-06-23 | 0-0 | bthread SIGSEGV @0x38 |
| 2026-06-24 | 7133b64a2c24c1d-942bd6304aa27668 | bthread SIGSEGV @0x38 |
| 2026-06-24 | 594b7141026d19bf-7a05c24fda60e79f | bthread SIGSEGV @0x38 |
| 2026-06-25 | 0-0 | Page Cache/SegmentFooterPB @0x8 |
| 2026-06-25 | 0-0 | bthread SIGSEGV @0x38 |
BE environment:
Stack trace – most recent crash (2026-06-25 ~13:40, before restart at 13:41:55):
What You Expected?
BE should remain stable under heavy query/load pressure. If resource limits are exceeded, Doris should fail the query gracefully (e.g., memory limit exceeded, query cancelled), not crash the entire BE process with SIGSEGV.
After a failure, clients should receive a query-level error, not "backend is down" due to process crash.
How to Reproduce?
We have not isolated a minimal reproducible case yet. Crashes appear under sustained production load on a single BE node running 4.0.5-rc01.
Observed trigger pattern:
Recent failing workload (high level, no full SQL):
Cluster info:
What we ruled out:
To help reproduce:
We can provide full be.out crash logs and SQL privately if needed. The crash also occurs without a specific query id (Query id: 0-0), suggesting it may be triggered by general load rather than one specific SQL shape.
Anything Else?
Impact:
Questions for maintainers:
Attachments we can provide:
Are you willing to submit PR?
Code of Conduct