Skip to content

[Bug] BE SIGSEGV crash in bthread::TaskGroup::sched_to and Page Cache on 4.0.5-rc01 #64826

Description

@iversonleo587-cpu

Search before asking

  • I had searched in the issues and found no similar issues.

Version

doris-4.0.5-rc01-59de8c4c524
Git commit: 59de8c4
JDK: OpenJDK 17.0.18+8 (/opt/jdk-17.0.18+8/bin/java)
OS: Linux x86_64

What's Wrong?

Our BE node (BackendId: 1780885838226, Host: 10.66.7.1) crashes repeatedly with SIGSEGV on version 4.0.5-rc01. This is NOT caused by Linux OOM killer (no matching records in dmesg).
Two distinct crash patterns observed in be.out:
Pattern 1 – Query execution (most frequent)
SIGSEGV at address @0x38 in bthread worker thread:
bthread::TaskGroup::sched_to → run_main_task → worker_thread
Pattern 2 – Background cache management (2026-06-25 00:20)
SIGSEGV at address @0x8 during Page Cache capacity adjustment:
Daemon::cache_adjust_capacity_thread
→ CacheManager::for_each_cache_refresh_capacity
→ LRUCache::set_capacity
→ SegmentFooterPB::~SegmentFooterPB
Most recent production incident (2026-06-25):

  • 08:11:34 BE started (PID 419300)
  • ~13:40:22 BE crashed (Aborted at unix time 1782366022)
  • 13:41:10 Client ETL job failed: (1105, 'errCode = 2, detailMessage = backend 1780885838226 is down')
  • 13:41:55 BE restarted automatically
    Crash frequency on this single BE (same version):
    | Date | Query id | Stack |
    |------------|---------------------------------------|--------------------------------|
    | 2026-06-22 | 6c0e5ba4c3ab45f2-a3734229ec2d08ba | bthread SIGSEGV @0x38 |
    | 2026-06-23 | 0-0 | bthread SIGSEGV @0x38 |
    | 2026-06-24 | 7133b64a2c24c1d-942bd6304aa27668 | bthread SIGSEGV @0x38 |
    | 2026-06-24 | 594b7141026d19bf-7a05c24fda60e79f | bthread SIGSEGV @0x38 |
    | 2026-06-25 | 0-0 | Page Cache/SegmentFooterPB @0x8 |
    | 2026-06-25 | 0-0 | bthread SIGSEGV @0x38 |
    BE environment:
  • CPU: 32 cores, Memory: ~125 GB
  • TabletNum: 17860, DataUsedCapacity: ~585 GB, Disk UsedPct: ~35%
  • Workload mode: mix
    Stack trace – most recent crash (2026-06-25 ~13:40, before restart at 13:41:55):

What You Expected?

BE should remain stable under heavy query/load pressure. If resource limits are exceeded, Doris should fail the query gracefully (e.g., memory limit exceeded, query cancelled), not crash the entire BE process with SIGSEGV.
After a failure, clients should receive a query-level error, not "backend is down" due to process crash.

How to Reproduce?

We have not isolated a minimal reproducible case yet. Crashes appear under sustained production load on a single BE node running 4.0.5-rc01.
Observed trigger pattern:

  1. BE runs for several hours under mixed query/load workload
  2. A heavy INSERT INTO ... SELECT ETL job starts
  3. BE crashes with SIGSEGV in bthread::TaskGroup::sched_to
  4. FE reports "backend is down" to clients
  5. BE is restarted (~1 minute later)
    Recent failing workload (high level, no full SQL):
  • Job: ETL via DolphinScheduler (pre environment)
  • SQL pattern:
    • TRUNCATE TABLE dws.dws_finance_sku_order_time_expense_stats_h
    • Two large INSERT INTO ... SELECT statements (is_combination=0 and is_combination=1)
    • Source: large DWD fact tables (dwd_finance_order_expense_*)
    • Operations: GROUP BY on 20+ columns, UNION ALL, multiple LEFT JOINs (dim tables + ods table)
  • Task runtime before failure: ~3 minutes (13:37:58 → 13:41:10 CST)
    Cluster info:
  • Single affected BE: 10.66.7.1 (BackendId 1780885838226)
  • 17860 tablets on this BE (relatively heavy)
  • No disk pressure (UsedPct ~35%)
  • No Linux OOM kill records in dmesg
    What we ruled out:
  • Linux OOM killer: dmesg shows no OOM/kill records around crash time
  • Disk full: UsedPct ~35%
  • Manual shutdown: isShutdown=false in SHOW BACKENDS
  • One-off incident: same BE crashed 6+ times in 3 days with identical stack patterns
    To help reproduce:
    We can provide full be.out crash logs and SQL privately if needed. The crash also occurs without a specific query id (Query id: 0-0), suggesting it may be triggered by general load rather than one specific SQL shape.

Anything Else?

Impact:

  • Production ETL jobs fail with "backend is down"
  • BE requires repeated manual/automatic restarts
  • Data risk: jobs use TRUNCATE before INSERT, leaving target table empty/incomplete on failure
    Questions for maintainers:
  1. Are there known fixes for bthread::TaskGroup::sched_to SIGSEGV (@0x38) in 4.0.x after commit 59de8c4?
  2. Is the SegmentFooterPB destructor crash during cache_adjust_capacity_thread() a known issue in 4.0.5-rc01?
  3. Is 4.0.5-rc01 recommended for production? Which stable version should we upgrade to?
  4. Any workaround besides version upgrade (cache/memory config, query-side mitigation)?
    Attachments we can provide:
  • Full be.out crash sections
  • be.WARNING / be.INFO around 13:40-13:42 CST on 2026-06-25
  • SHOW BACKENDS output
  • Full SQL text (privately if needed)

Are you willing to submit PR?

  • Yes I am willing to submit a PR!

Code of Conduct

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions