feat(health): forward Redfish diagnostic payloads from health logs#2831
feat(health): forward Redfish diagnostic payloads from health logs#2831jayzhudev wants to merge 8 commits into
Conversation
|
Note Reviews pausedIt looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the Use the following commands to manage reviews:
Use the checkboxes below for quick actions:
WalkthroughAdds optional Redfish diagnostic payload forwarding across health log collection and sink emission. It introduces diagnostic record construction, per-sink ChangesRedfish Diagnostic Payload Forwarding
Estimated code review effort🎯 4 (Complex) | ⏱️ ~60 minutes 🚥 Pre-merge checks | ✅ 4 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (4 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
Comment |
There was a problem hiding this comment.
Actionable comments posted: 1
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (1)
crates/health/src/sink/otlp.rs (1)
121-138: 🩺 Stability & Availability | 🟠 Major | 🏗️ Heavy liftJoin/supervise drain tasks instead of detached spawns.
At Line 127 and Line 138, both drain tasks are spawned and their handles are dropped. If either task panics/exits, export can silently stop with no propagation path.
As per coding guidelines, “Avoid spawning background tasks without joining them via
JoinHandle::join()or adding them to aJoinSetwhich is later awaited withJoinSet::join_all()to ensure panics propagate.”🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@crates/health/src/sink/otlp.rs` around lines 121 - 138, The spawned handles for both the OtlpDrainTask and OtlpMetricsDrainTask drain tasks are being dropped without being joined or supervised, which means panics in these background tasks will not propagate. Capture the JoinHandle returned by each handle.spawn() call for drain.run() and metrics_drain.run(), and either store them in a JoinSet that is later awaited with join_all() or explicitly join them to ensure panic propagation. This ensures that if either drain task panics or fails, the error will be properly propagated instead of silently stopping export.Source: Coding guidelines
🧹 Nitpick comments (1)
crates/health/src/collectors/logs/diagnostic.rs (1)
158-220: 📐 Maintainability & Code Quality | 🔵 Trivial | ⚡ Quick winAdopt table-driven tests for the diagnostic mapping cases.
Lines 158-220 exercise one mapping operation with multiple input variants; converting these into
scenarios!/check_caseswould better match repository test conventions and make edge-case growth cheaper.As per coding guidelines, “Prefer table-driven tests using the
carbide-test-supportcrate withscenarios!for fallible operations andvalue_scenarios!for total operations.”🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@crates/health/src/collectors/logs/diagnostic.rs` around lines 158 - 220, The test functions redfish_enum_string_uses_serde_wire_spelling, diagnostic_payload_fields_preserve_opaque_payload, and diagnostic_payload_fields_skip_absent_payload_and_uri are currently written as individual test cases but should be consolidated into a single table-driven test following repository conventions. Refactor these tests to use the scenarios! macro from the carbide-test-support crate, creating a table of test input variants (payloads, expected attributes, etc.) and iterate through them in a single test function to verify each case, which will make the test structure more maintainable and easier to extend with additional edge cases.Source: Coding guidelines
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In `@crates/health/src/sink/otlp.rs`:
- Around line 208-227: The diagnostic record handling in the CollectorEvent::Log
match arm bypasses mapper-based deduplication when has_included_diagnostics is
true by using only a sequence-based key instead of calling mapper.queue_key().
This causes parent log events to lose deduplication. Modify the logic to ensure
that mapper.queue_key() is always called on the parent log for deduplication
purposes (using context.endpoint_key and record.attributes), and separately
track the diagnostic sequence if needed for diagnostic-specific queuing. Ensure
the parent log's dedup key is preserved even when diagnostic enrichment is
applied, so repeated parent events continue to deduplicate correctly through the
mapper.
---
Outside diff comments:
In `@crates/health/src/sink/otlp.rs`:
- Around line 121-138: The spawned handles for both the OtlpDrainTask and
OtlpMetricsDrainTask drain tasks are being dropped without being joined or
supervised, which means panics in these background tasks will not propagate.
Capture the JoinHandle returned by each handle.spawn() call for drain.run() and
metrics_drain.run(), and either store them in a JoinSet that is later awaited
with join_all() or explicitly join them to ensure panic propagation. This
ensures that if either drain task panics or fails, the error will be properly
propagated instead of silently stopping export.
---
Nitpick comments:
In `@crates/health/src/collectors/logs/diagnostic.rs`:
- Around line 158-220: The test functions
redfish_enum_string_uses_serde_wire_spelling,
diagnostic_payload_fields_preserve_opaque_payload, and
diagnostic_payload_fields_skip_absent_payload_and_uri are currently written as
individual test cases but should be consolidated into a single table-driven test
following repository conventions. Refactor these tests to use the scenarios!
macro from the carbide-test-support crate, creating a table of test input
variants (payloads, expected attributes, etc.) and iterate through them in a
single test function to verify each case, which will make the test structure
more maintainable and easier to extend with additional edge cases.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Enterprise
Run ID: a63ac0d5-1ac5-4f62-ac73-4450662dc15d
📒 Files selected for processing (17)
crates/health/benches/collector_pipeline.rscrates/health/benches/sink_pipeline.rscrates/health/example/config.example.tomlcrates/health/src/collectors/logs/diagnostic.rscrates/health/src/collectors/logs/mod.rscrates/health/src/collectors/logs/periodic.rscrates/health/src/collectors/logs/sse.rscrates/health/src/config.rscrates/health/src/discovery/context.rscrates/health/src/discovery/spawn.rscrates/health/src/lib.rscrates/health/src/otlp/convert.rscrates/health/src/sink/events.rscrates/health/src/sink/log_file.rscrates/health/src/sink/mod.rscrates/health/src/sink/otlp.rscrates/health/src/sink/tracing.rs
b579628 to
ac30bd1
Compare
There was a problem hiding this comment.
Actionable comments posted: 1
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In `@crates/health/src/sink/otlp.rs`:
- Around line 203-214: In the CollectorEvent::Log branch (and similarly in the
branches mentioned at lines 507-540), the current implementation uses a single
dedup key for both the parent log record and any diagnostic records. To fix
this, separate the enqueuing logic: enqueue the parent log record with the
existing queue_key from self.mapper.queue_key to maintain deduplication of
parent events, then enqueue diagnostic records separately with distinct keys
that bypass deduplication so they are appended rather than replaced. Update the
return statement to handle appending multiple records instead of returning a
single key-event pair, and update the test to assert that diagnostic records are
preserved and appended across bursty duplicate parent events rather than being
dropped.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Enterprise
Run ID: 778c2b9c-b7cf-43fe-815d-e3c7ab57ca1a
📒 Files selected for processing (17)
crates/health/benches/collector_pipeline.rscrates/health/benches/sink_pipeline.rscrates/health/example/config.example.tomlcrates/health/src/collectors/logs/diagnostic.rscrates/health/src/collectors/logs/mod.rscrates/health/src/collectors/logs/periodic.rscrates/health/src/collectors/logs/sse.rscrates/health/src/config.rscrates/health/src/discovery/context.rscrates/health/src/discovery/spawn.rscrates/health/src/lib.rscrates/health/src/otlp/convert.rscrates/health/src/sink/events.rscrates/health/src/sink/log_file.rscrates/health/src/sink/mod.rscrates/health/src/sink/otlp.rscrates/health/src/sink/tracing.rs
✅ Files skipped from review due to trivial changes (3)
- crates/health/benches/collector_pipeline.rs
- crates/health/src/collectors/logs/mod.rs
- crates/health/benches/sink_pipeline.rs
🚧 Files skipped from review as they are similar to previous changes (13)
- crates/health/src/discovery/context.rs
- crates/health/src/sink/mod.rs
- crates/health/src/lib.rs
- crates/health/src/discovery/spawn.rs
- crates/health/src/config.rs
- crates/health/src/collectors/logs/sse.rs
- crates/health/src/otlp/convert.rs
- crates/health/src/sink/events.rs
- crates/health/src/collectors/logs/diagnostic.rs
- crates/health/example/config.example.toml
- crates/health/src/collectors/logs/periodic.rs
- crates/health/src/sink/tracing.rs
- crates/health/src/sink/log_file.rs
There was a problem hiding this comment.
Actionable comments posted: 1
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In `@crates/health/src/sink/otlp.rs`:
- Around line 170-177: Diagnostic records are currently given unique queue keys
in the OTLP sink, which bypasses DedupQueue’s bounded latest-wins behavior and
can grow memory unbounded during slowdowns. Update the diagnostic enqueue path
in the otlp sink so it reuses the existing endpoint key in save_latest, or add
an explicit drop/cap policy for diagnostics before enqueueing. Keep the fix
localized around the diagnostic_sequence / diagnostic_key logic in the otlp sink
implementation.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Enterprise
Run ID: baa935e3-2025-4a8f-8836-9ef8a21446cd
📒 Files selected for processing (8)
crates/health/src/collectors/logs/diagnostic.rscrates/health/src/collectors/logs/sse.rscrates/health/src/config.rscrates/health/src/otlp/convert.rscrates/health/src/sink/events.rscrates/health/src/sink/log_file.rscrates/health/src/sink/otlp.rscrates/health/src/sink/tracing.rs
🚧 Files skipped from review as they are similar to previous changes (7)
- crates/health/src/otlp/convert.rs
- crates/health/src/collectors/logs/sse.rs
- crates/health/src/sink/tracing.rs
- crates/health/src/config.rs
- crates/health/src/sink/events.rs
- crates/health/src/sink/log_file.rs
- crates/health/src/collectors/logs/diagnostic.rs
49b980f to
86d437a
Compare
There was a problem hiding this comment.
♻️ Duplicate comments (1)
crates/health/src/sink/otlp.rs (1)
166-174: 🗄️ Data Integrity & Integration | 🟠 Major | 🏗️ Heavy liftDiagnostic enqueue key still drops distinct payloads per endpoint.
At Line 168, using a single
"{endpoint_key}|diagnostic"key makes diagnostics latest-wins per endpoint, so multiple different diagnostic records overwrite each other before drain. This contradicts the OTLP requirement to preserve parent-log dedup while appending diagnostics, and Line 535+ now locks in the dropping behavior via test expectations.Also applies to: 535-569
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@crates/health/src/sink/otlp.rs` around lines 166 - 174, The diagnostic enqueue key in `crates/health/src/sink/otlp.rs` is too coarse because `save_latest` uses a single `diagnostic_key` per `context.endpoint_key`, causing distinct diagnostic payloads to overwrite each other before drain. Update the enqueue logic in the OTLP sink so diagnostics are still bounded by endpoint but are keyed in a way that preserves multiple distinct diagnostic records alongside the parent log dedup behavior, and then adjust the related expectations in `diagnostic` handling/tests near the existing `save_latest` usage and the `535+` test block to match the new behavior.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Duplicate comments:
In `@crates/health/src/sink/otlp.rs`:
- Around line 166-174: The diagnostic enqueue key in
`crates/health/src/sink/otlp.rs` is too coarse because `save_latest` uses a
single `diagnostic_key` per `context.endpoint_key`, causing distinct diagnostic
payloads to overwrite each other before drain. Update the enqueue logic in the
OTLP sink so diagnostics are still bounded by endpoint but are keyed in a way
that preserves multiple distinct diagnostic records alongside the parent log
dedup behavior, and then adjust the related expectations in `diagnostic`
handling/tests near the existing `save_latest` usage and the `535+` test block
to match the new behavior.
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Enterprise
Run ID: ec73a8a5-bc02-4769-84ff-1e85ef8cfefa
📒 Files selected for processing (4)
crates/health/example/config.example.tomlcrates/health/src/config.rscrates/health/src/sink/events.rscrates/health/src/sink/otlp.rs
🚧 Files skipped from review as they are similar to previous changes (3)
- crates/health/src/sink/events.rs
- crates/health/src/config.rs
- crates/health/example/config.example.toml
🔍 Container Scan Summary
Per-CVE detail lives in the per-service |
There was a problem hiding this comment.
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (1)
crates/health/src/sink/otlp.rs (1)
124-135: 🩺 Stability & Availability | 🟠 Major | 🏗️ Heavy liftJoin or lifecycle-manage spawned OTLP drain tasks.
handle.spawn(drain.run())andhandle.spawn(metrics_drain.run())drop theirJoinHandles, so panics won’t propagate and task shutdown cannot be coordinated.Please route these through a lifecycle-managed
JoinSet(or retain handles and join/cancel explicitly).As per coding guidelines, “Avoid spawning background tasks without joining them via
JoinHandle::join()or adding them to aJoinSet…”. As per path instructions, Rust reviews should enforce “joined/cancellable background tasks.”🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@crates/health/src/sink/otlp.rs` around lines 124 - 135, The OTLP drain tasks are spawned without any lifecycle management, so their JoinHandles are dropped and shutdown/panic handling can’t be coordinated. Update the spawning logic around drain.run() and OtlpMetricsDrainTask::run() to keep them joined or cancellable, preferably by adding both tasks to a JoinSet or retaining the JoinHandle values and explicitly joining/canceling them during shutdown.Sources: Coding guidelines, Path instructions
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Outside diff comments:
In `@crates/health/src/sink/otlp.rs`:
- Around line 124-135: The OTLP drain tasks are spawned without any lifecycle
management, so their JoinHandles are dropped and shutdown/panic handling can’t
be coordinated. Update the spawning logic around drain.run() and
OtlpMetricsDrainTask::run() to keep them joined or cancellable, preferably by
adding both tasks to a JoinSet or retaining the JoinHandle values and explicitly
joining/canceling them during shutdown.
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Enterprise
Run ID: eebcfd8e-ab34-4c37-9a30-3a4366b041b5
📒 Files selected for processing (7)
crates/health/example/config.example.tomlcrates/health/src/collectors/logs/periodic.rscrates/health/src/collectors/logs/sse.rscrates/health/src/config.rscrates/health/src/sink/events.rscrates/health/src/sink/otlp.rscrates/health/src/sink/tracing.rs
🚧 Files skipped from review as they are similar to previous changes (5)
- crates/health/example/config.example.toml
- crates/health/src/sink/tracing.rs
- crates/health/src/collectors/logs/periodic.rs
- crates/health/src/sink/events.rs
- crates/health/src/collectors/logs/sse.rs
|
@coderabbitai review |
✅ Action performedReview finished.
|
Signed-off-by: Jay Zhu <jayzhu@nvidia.com>
Signed-off-by: Jay Zhu <jayzhu@nvidia.com>
Signed-off-by: Jay Zhu <jayzhu@nvidia.com>
Signed-off-by: Jay Zhu <jayzhu@nvidia.com>
Signed-off-by: Jay Zhu <jayzhu@nvidia.com>
Document the new diagnostic helpers and tests so doc coverage stays above the required threshold. Signed-off-by: Jay Zhu <jayzhu@nvidia.com>
Preserve every diagnostic payload while keeping parent log records on mapper-based deduplication. Removed diagnostic_sequence. Signed-off-by: Jay Zhu <jayzhu@nvidia.com>
Keep diagnostic OTLP records on the existing latest-wins endpoint key while preserving parent log mapper dedup. Remove unrelated OTLP zero-value validation from this diagnostic cleanup and fill doc comments for touched health types. Added docstrings. Signed-off-by: Jay Zhu <jayzhu@nvidia.com>
0167816 to
9010ca9
Compare
Added opt-in support to include Redfish diagnostic payloads in selected log sinks. When enabled,
healthcarries CPER/diagnostic data through as opaque log data. Parsing and interpretation are intentionally kept outside this crate.Implementation:
DiagnosticData,DiagnosticDataType, OEM type,AdditionalDataURI, size metadata, and parent correlation fields from SSE and periodic log records.DiagnosticLogRecordfield toLogRecord.include_diagnostics = falseconfig for tracing, log file, and OTLP sinks.Related issues
Closes #2817
Type of Change
No emitted behavior change when diagnostic payloads are disabled.
Breaking Changes
Testing