Skip to content

CASSANDRA-21134: Direct I/O for background SSTable writes#4815

Open
samueldlightfoot wants to merge 1 commit into
apache:cassandra-6.0from
samueldlightfoot:CASSANDRA-21134-direct-compaction-writes
Open

CASSANDRA-21134: Direct I/O for background SSTable writes#4815
samueldlightfoot wants to merge 1 commit into
apache:cassandra-6.0from
samueldlightfoot:CASSANDRA-21134-direct-compaction-writes

Conversation

@samueldlightfoot
Copy link
Copy Markdown
Contributor

@samueldlightfoot samueldlightfoot commented May 17, 2026

CASSANDRA-21134: Direct I/O for background SSTable writes

Summary

Opt-in O_DIRECT write path for background SSTable producers, bypassing the OS page cache for write-once read-never data. Memtable flushes remain buffered (hot data benefits from the cache).

background_write_disk_access_mode: direct    # default: standard
direct_write_buffer_size: 1MiB                # aligned up to FS block size; auto-grows to chunk_length

Gated by (1) config, (2) table compression enabled, (3) an OperationType allowlist (DataComponent#DIRECT_WRITE_SUPPORT). Selection is central in DataComponent.buildWriter; producers are unchanged.

Performance

Benchmark results are attached to the JIRA. Significant p99 read latency improvements under throttled compaction.

Operations covered (DIO eligible)

OperationType End-to-end test
WRITE CQLSSTableWriterDaemonTest (parameterised on disk mode)
COMPACTION CompactionsTest (parameterised on disk mode)
MAJOR_COMPACTION CompactionsTest.testCompactionWithSizeLimitedRewriter
CLEANUP, GARBAGE_COLLECT, TOMBSTONE_COMPACTION, UPGRADE_SSTABLES CompactionsTest (transitive)
ANTICOMPACTION AntiCompactionTest.testAntiCompactionWithCompressedTableAndDirectWrites
STREAM StreamingDirectWriteTest

The allowlist is exhaustive: any new OperationType with writesData == true that is not classified fails static initialization (AssertionError).

Operations NOT covered

Path Classification Reason Coverage
FLUSH (memtable) UNSUPPORTED_POLICY Just-flushed data is hot — keep it in the page cache. DataComponentDirectWriteSelectionTest
SCRUB UNSUPPORTED_CORRECTNESS tryAppend needs mark() / resetAndTruncate(), which DIO cannot satisfy. DataComponentDirectWriteSelectionTest
Zero-Copy Streaming n/a (path bypass) Entire-SSTable streaming bypasses DataComponent.buildWriter. StreamingDirectWriteTest (disables ZCS)
Uncompressed writers n/a (path bypass) Only CompressedSequentialWriter has a DIO subclass. DataComponentDirectWriteSelectionTest (compression gate)

Removing an UNSUPPORTED_CORRECTNESS entry requires code changes; UNSUPPORTED_POLICY is a policy decision.

Key code

  • io/DirectIoSupport.java — eligibility enum (SUPPORTED / UNSUPPORTED_CORRECTNESS /
    UNSUPPORTED_POLICY / NOT_APPLICABLE).
  • io/sstable/format/DataComponent.java — selection, allowlist, exhaustiveness check;
    per-op first-activation log.
  • io/compress/DirectCompressedSequentialWriter.java — new writer; aligned buffers,
    mark()/resetAndTruncate() unsupported.
  • io/compress/CompressedSequentialWriter.java — refactored so the DIO subclass can
    override the write-chunk path; writeChunk contract documented and asserted.
  • config/Config.java, config/DatabaseDescriptor.java — new knobs, validation, startup
    wiring; buffer size aligned to FS block size, auto-grown to chunk length.
  • service/StartupChecks.java — fails fast if direct is requested on a platform/FS
    that does not support O_DIRECT.

Tests introduced

  • Property-based DIO-writer sweep — write/read integrity and on-disk byte-identity
    vs. the buffered writer over compressors × chunk lengths × random payload sizes;
    seed-logged for repro (DirectCompressedSequentialWriterTest).
  • Parameterised buffer-size tests — three regimes pinning the distinct branches of
    flushCompleteBlocks (DirectCompressedSequentialWriterTest).
  • Selection-matrix tests — per-OperationType eligibility, allowlist exhaustiveness,
    compression gate, config-mode gate (DataComponentDirectWriteSelectionTest).
  • End-to-end coverage per allowlist armWRITE (CQLSSTableWriterDirectWriteTest),
    STREAM (in-JVM dtest, StreamingDirectWriteTest), and the compaction family +
    ANTICOMPACTION (extended CompactionsTest, AntiCompactionTest).
  • Regression guards — constructor channel-leak protection, non-power-of-two
    block-size rejection, once-per-JVM undersized-buffer warn, SCRUB-gating canaries
    (DirectCompressedSequentialWriterTest).
  • Resource-leak detectionBufferPoolMXBean check that the off-heap aligned
    buffer is returned on close (DirectCompressedSequentialWriterTest).
  • Config validation — new YAML knobs (mode parsing, buffer-size bounds, defaults)
    in DatabaseDescriptorTest.

Not in scope

  • Uncompressed SSTable writers.
  • ZCS streaming.

Reviewer notes

Findings from the Cassandra bug-hunting skills (Opus 4.7 xhigh & kimi-k2.6:cloud) were addressed prior to
review.

@samueldlightfoot samueldlightfoot force-pushed the CASSANDRA-21134-direct-compaction-writes branch from 3349322 to 005f4e1 Compare May 17, 2026 09:25
@samueldlightfoot samueldlightfoot marked this pull request as ready for review May 17, 2026 09:25
@samueldlightfoot samueldlightfoot force-pushed the CASSANDRA-21134-direct-compaction-writes branch 4 times, most recently from b95f990 to 3d95be1 Compare May 18, 2026 09:58
Adds an opt-in O_DIRECT write path for background SSTable producers,
bypassing the OS page cache for data that is unlikely to be re-read
soon after being written. Memtable flushes remain buffered.

Enabled via two new YAML knobs:
 - background_write_disk_access_mode: standard (default) | direct
 - direct_write_buffer_size: 1MiB (default; aligned up to FS block
   size, auto-grown to chunk_length)

The path is gated by config, table compression being enabled, and an
OperationType allowlist in DataComponent. The allowlist is exhaustive:
any new OperationType with writesData=true that is not classified will
fail static initialization.

Operations on the DIO path: COMPACTION, MAJOR_COMPACTION,
TOMBSTONE_COMPACTION, ANTICOMPACTION, GARBAGE_COLLECT, CLEANUP,
UPGRADE_SSTABLES, WRITE, STREAM (chunked receiver only).

Operations off the DIO path:
 - FLUSH (policy: just-flushed data is hot, keep in page cache)
 - SCRUB (correctness: tryAppend needs mark/resetAndTruncate)
 - Zero-Copy Streaming (bypasses DataComponent.buildWriter)
 - Uncompressed writers (only CompressedSequentialWriter has a DIO
   subclass in this change)

StartupChecks fails fast if 'direct' is requested on a platform/FS
that does not support O_DIRECT.

patch by Sam Lightfoot; reviewed by <reviewers> for CASSANDRA-21134
@samueldlightfoot samueldlightfoot force-pushed the CASSANDRA-21134-direct-compaction-writes branch from a49806c to ca8ef09 Compare May 19, 2026 17:23
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant