CASSANDRA-21134: Direct I/O for background SSTable writes#4815
Open
samueldlightfoot wants to merge 1 commit into
Open
CASSANDRA-21134: Direct I/O for background SSTable writes#4815samueldlightfoot wants to merge 1 commit into
samueldlightfoot wants to merge 1 commit into
Conversation
3349322 to
005f4e1
Compare
b95f990 to
3d95be1
Compare
Adds an opt-in O_DIRECT write path for background SSTable producers, bypassing the OS page cache for data that is unlikely to be re-read soon after being written. Memtable flushes remain buffered. Enabled via two new YAML knobs: - background_write_disk_access_mode: standard (default) | direct - direct_write_buffer_size: 1MiB (default; aligned up to FS block size, auto-grown to chunk_length) The path is gated by config, table compression being enabled, and an OperationType allowlist in DataComponent. The allowlist is exhaustive: any new OperationType with writesData=true that is not classified will fail static initialization. Operations on the DIO path: COMPACTION, MAJOR_COMPACTION, TOMBSTONE_COMPACTION, ANTICOMPACTION, GARBAGE_COLLECT, CLEANUP, UPGRADE_SSTABLES, WRITE, STREAM (chunked receiver only). Operations off the DIO path: - FLUSH (policy: just-flushed data is hot, keep in page cache) - SCRUB (correctness: tryAppend needs mark/resetAndTruncate) - Zero-Copy Streaming (bypasses DataComponent.buildWriter) - Uncompressed writers (only CompressedSequentialWriter has a DIO subclass in this change) StartupChecks fails fast if 'direct' is requested on a platform/FS that does not support O_DIRECT. patch by Sam Lightfoot; reviewed by <reviewers> for CASSANDRA-21134
a49806c to
ca8ef09
Compare
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
CASSANDRA-21134: Direct I/O for background SSTable writes
Summary
Opt-in
O_DIRECTwrite path for background SSTable producers, bypassing the OS page cache for write-once read-never data. Memtable flushes remain buffered (hot data benefits from the cache).Gated by (1) config, (2) table compression enabled, (3) an
OperationTypeallowlist (DataComponent#DIRECT_WRITE_SUPPORT). Selection is central inDataComponent.buildWriter; producers are unchanged.Performance
Benchmark results are attached to the JIRA. Significant p99 read latency improvements under throttled compaction.
Operations covered (DIO eligible)
WRITECQLSSTableWriterDaemonTest(parameterised on disk mode)COMPACTIONCompactionsTest(parameterised on disk mode)MAJOR_COMPACTIONCompactionsTest.testCompactionWithSizeLimitedRewriterCLEANUP,GARBAGE_COLLECT,TOMBSTONE_COMPACTION,UPGRADE_SSTABLESCompactionsTest(transitive)ANTICOMPACTIONAntiCompactionTest.testAntiCompactionWithCompressedTableAndDirectWritesSTREAMStreamingDirectWriteTestThe allowlist is exhaustive: any new
OperationTypewithwritesData == truethat is not classified fails static initialization (AssertionError).Operations NOT covered
FLUSH(memtable)UNSUPPORTED_POLICYDataComponentDirectWriteSelectionTestSCRUBUNSUPPORTED_CORRECTNESStryAppendneedsmark()/resetAndTruncate(), which DIO cannot satisfy.DataComponentDirectWriteSelectionTestDataComponent.buildWriter.StreamingDirectWriteTest(disables ZCS)CompressedSequentialWriterhas a DIO subclass.DataComponentDirectWriteSelectionTest(compression gate)Removing an
UNSUPPORTED_CORRECTNESSentry requires code changes;UNSUPPORTED_POLICYis a policy decision.Key code
io/DirectIoSupport.java— eligibility enum (SUPPORTED/UNSUPPORTED_CORRECTNESS/UNSUPPORTED_POLICY/NOT_APPLICABLE).io/sstable/format/DataComponent.java— selection, allowlist, exhaustiveness check;per-op first-activation log.
io/compress/DirectCompressedSequentialWriter.java— new writer; aligned buffers,mark()/resetAndTruncate()unsupported.io/compress/CompressedSequentialWriter.java— refactored so the DIO subclass canoverride the write-chunk path;
writeChunkcontract documented and asserted.config/Config.java,config/DatabaseDescriptor.java— new knobs, validation, startupwiring; buffer size aligned to FS block size, auto-grown to chunk length.
service/StartupChecks.java— fails fast ifdirectis requested on a platform/FSthat does not support
O_DIRECT.Tests introduced
vs. the buffered writer over compressors × chunk lengths × random payload sizes;
seed-logged for repro (
DirectCompressedSequentialWriterTest).flushCompleteBlocks(DirectCompressedSequentialWriterTest).OperationTypeeligibility, allowlist exhaustiveness,compression gate, config-mode gate (
DataComponentDirectWriteSelectionTest).WRITE(CQLSSTableWriterDirectWriteTest),STREAM(in-JVM dtest,StreamingDirectWriteTest), and the compaction family +ANTICOMPACTION(extendedCompactionsTest,AntiCompactionTest).block-size rejection, once-per-JVM undersized-buffer warn, SCRUB-gating canaries
(
DirectCompressedSequentialWriterTest).BufferPoolMXBeancheck that the off-heap alignedbuffer is returned on close (
DirectCompressedSequentialWriterTest).in
DatabaseDescriptorTest.Not in scope
Reviewer notes
Findings from the Cassandra bug-hunting skills (Opus 4.7 xhigh & kimi-k2.6:cloud) were addressed prior to
review.