Skip to content

feat: resume interrupted dataset generation runs (sync + async engine)#526

Open
przemekboruta wants to merge 33 commits intoNVIDIA-NeMo:mainfrom
przemekboruta:main
Open

feat: resume interrupted dataset generation runs (sync + async engine)#526
przemekboruta wants to merge 33 commits intoNVIDIA-NeMo:mainfrom
przemekboruta:main

Conversation

@przemekboruta
Copy link
Copy Markdown
Contributor

@przemekboruta przemekboruta commented Apr 13, 2026

Summary

Closes #525

Adds resume: ResumeMode = ResumeMode.NEVER to DataDesigner.create() and DatasetBuilder.build(). Generation picks up from where the interrupted run left off — for both the sync and async engines.

from data_designer import DataDesigner, ResumeMode

dd = DataDesigner(...)
dd.add_column(...)

# First run — interrupted mid-way
results = dd.create(config_builder, num_records=10_000)

# After restart — picks up from the last completed batch/row-group
results = dd.create(config_builder, num_records=10_000, resume=ResumeMode.ALWAYS)

# Or: resume only if the config has not changed, otherwise start fresh
results = dd.create(config_builder, num_records=10_000, resume=ResumeMode.IF_POSSIBLE)

Changes

Layer Change
ArtifactStorage New ResumeMode(StrEnum) enum (NEVER/ALWAYS/IF_POSSIBLE); resume: ResumeMode = ResumeMode.NEVER field; resolved_dataset_name skips timestamp logic on ALWAYS/IF_POSSIBLE; new clear_partial_results(); new refresh_media_storage_path() re-points _media_storage.base_path after an IF_POSSIBLENEVER downgrade
DatasetBatchManager.start() New start_batch, initial_actual_num_records, num_records_list, and original_target_num_records params (all default, no breakage); num_records_list lets callers supply the exact per-batch sizes; original_target_num_records is persisted in each incremental metadata write so extension resumes can always recover the immutable original group boundaries
DatasetBuilder.build() New resume: ResumeMode param; _load_resume_state() reads and validates metadata.json; _build_with_resume() skips completed batches (sync); _build_async() skips completed row groups (async); _check_resume_config_compatibility() returns _ConfigCompatibility enum (COMPATIBLE/INCOMPATIBLE/NO_PRIOR_DATASET) and is called for both ALWAYS and IF_POSSIBLEALWAYS raises on INCOMPATIBLE, IF_POSSIBLE starts fresh on INCOMPATIBLE or NO_PRIOR_DATASET; resolved_dataset_name cache invalidated on IF_POSSIBLE downgrade; partial-completion warning emitted before return in _build_async; _ResumeState carries original_target_num_records (immutable across extension writes) so _rg_size() always uses the correct original group boundaries
RowGroupBufferManager.__init__() New initial_actual_num_records and initial_total_num_batches params to seed counters on resume
DatasetBuilder._find_completed_row_group_ids() New helper — scans parquet-files/ for batch_*.parquet to determine which async row groups are already done
DatasetBuilder._build_async() Pre-computes the full row-group list with correct per-group sizes before passing to _prepare_async_run; original groups use min(bs, target - rg*bs), extension groups use min(bs, ext_records - ext_idx*bs) — replaces the skip_row_groups approach that incorrectly deducted buffer_size for non-aligned skipped groups; total row groups now computed as num_original_groups + ceil(extension_records/bs) instead of ceil(num_records/bs), fixing a false "already complete" when the extension fits within the last original group's slack
DatasetBuilder._build_with_resume() Passes num_records_list = original_sizes + extension_sizes to batch_manager.start(), giving the correct total batch count for non-aligned extensions; fixes the same false "already complete" on the sync path
finalize_row_group closure Now writes incremental metadata.json after every row-group checkpoint (not just at the end), making all async runs resumable if interrupted
DataDesigner.create() Exposes resume: ResumeMode, passes it through to ArtifactStorage and builder.build()
CLI create New --resume / -r option (ResumeMode, default NEVER); GenerationController.run_create() accepts and forwards resume
bool return in _build_with_resume / _build_async build() gates run_after_generation on the return value so processors are never re-run on an already-complete dataset

ResumeMode semantics

Mode Behaviour
NEVER (default) Always start a fresh run; existing dataset gets a timestamped directory
ALWAYS Resume from last checkpoint; raise DatasetGenerationError if config changed or run is otherwise incompatible
IF_POSSIBLE Resume if the current config fingerprint matches the stored one; silently start fresh otherwise — including when no prior dataset exists or when the directory is empty (no error)

Validation and error cases

  • Missing metadata.json (interrupted before first batch): restarts from scratch (both engines)
  • num_records less than already-generated records → DatasetGenerationError
  • num_records between already-generated records and original target (i.e. actual <= num_records < target) → DatasetGenerationError; prevents negative extension_records in the async path which would silently truncate the dataset and corrupt metadata
  • num_records greater than original target is allowed (extends the dataset)
  • Extension crash window — pre-write_metadata (async): crash after an extension row group's parquet file is on disk but before the metadata write — initial_actual_num_records is derived from the filesystem using the correct per-group sizes, avoiding a negative size for extension groups
  • Extension crash window — post-write_metadata (async): crash after a successful incremental metadata write during an extension flips target_num_records to the extension target; original_target_num_records is a separate immutable field written once at the start of the original run and carried unchanged through all subsequent writes, so _rg_size() always uses the correct original group boundaries on any future resume
  • Non-aligned extension row-group sizes (async): when the original run was not aligned to buffer_size (last group has fewer than buffer_size records), extension row groups now receive correct sizes — the old skip_row_groups loop deducted buffer_size for every skipped group, leaving remaining short and causing extension groups to be undersized, silently producing a dataset shorter than requested
  • False "already complete" on non-aligned extension (both engines): original_target=5, buffer_size=2 produces 3 groups/batches; extending to num_records=6 gave ceil(6/2)=3 == len(completed)=3, triggering the already-complete branch and returning 5 records silently — fixed by computing the total as num_original_groups + ceil(extension_records/bs) on both paths
  • buffer_size mismatch → DatasetGenerationError
  • Column/model config changed + ALWAYSDatasetGenerationError; with IF_POSSIBLE → silent fresh start, resolved_dataset_name cache invalidated so the fresh run gets a timestamped directory, and _media_storage re-initialised to the same path so image-column runs write to the correct directory
  • No existing dataset directory, or directory exists but is empty → IF_POSSIBLE starts fresh (no error); ALWAYS continues (handled by metadata check)
  • Missing builder_config.json → warning logged, config compatibility check skipped (treated as compatible)
  • Dataset already complete → warning logged, returns existing path without re-running processors (both engines)

Test plan

  • test_resolved_dataset_name_resume_uses_existing_folder
  • test_resolved_dataset_name_resume_raises_when_no_existing_folder
  • test_resolved_dataset_name_resume_raises_when_folder_is_empty
  • test_resolved_dataset_name_if_possible_uses_existing_folder
  • test_resolved_dataset_name_if_possible_uses_clean_name_when_no_existing_folder
  • test_clear_partial_results_removes_partial_folder
  • test_clear_partial_results_is_noop_when_no_partial_folder
  • test_start_with_start_batch
  • test_start_with_initial_actual_num_records
  • test_start_with_start_batch_and_initial_actual_num_records
  • test_start_default_values_unchanged
  • test_build_resume_starts_fresh_without_metadata
  • test_build_resume_raises_when_num_records_below_actual
  • test_build_resume_raises_when_num_records_below_original_target
  • test_build_resume_allows_larger_num_records
  • test_build_resume_raises_on_buffer_size_mismatch
  • test_build_resume_always_raises_on_config_mismatch
  • test_build_resume_runs_remaining_batches
  • test_build_resume_logs_warning_when_already_complete
  • test_build_resume_already_complete_does_not_run_after_generation_processors
  • test_build_resume_not_already_complete_when_extension_fits_in_slack
  • test_build_async_resume_logs_warning_when_already_complete
  • test_build_async_resume_starts_fresh_without_metadata
  • test_build_async_resume_already_complete_does_not_run_after_generation_processors
  • test_find_completed_row_group_ids_used_for_initial_total_batches
  • test_initial_actual_num_records_from_filesystem_in_crash_window
  • test_build_async_resume_skip_row_groups_contains_completed_ids
  • test_build_async_resume_initial_actual_num_records_uses_original_target
  • test_build_async_resume_initial_actual_num_records_extension_crash_window
  • test_build_async_resume_extension_non_aligned_row_group_sizes
  • test_build_async_resume_not_already_complete_when_extension_fits_in_slack
  • test_if_possible_incompatible_config_does_not_overwrite_existing_dataset
  • test_if_possible_incompatible_config_refreshes_media_storage_path
  • test_build_async_resume_stale_original_target_after_incremental_metadata_write
  • test_if_possible_starts_fresh_when_no_existing_directory
  • test_if_possible_starts_fresh_when_directory_is_empty
  • test_create_command_passes_resume_always
  • test_create_command_passes_resume_if_possible
  • test_run_create_passes_resume_always
  • test_run_create_passes_resume_if_possible

@przemekboruta przemekboruta requested a review from a team as a code owner April 13, 2026 11:15
@greptile-apps
Copy link
Copy Markdown
Contributor

greptile-apps Bot commented Apr 13, 2026

Greptile Summary

This PR adds a resume: ResumeMode parameter to DataDesigner.create() and DatasetBuilder.build(), allowing interrupted dataset generation runs to be resumed from the last completed batch (sync) or row group (async), with three modes: NEVER (default, always fresh), ALWAYS (resume or raise), and IF_POSSIBLE (resume if config matches, else start fresh silently).

  • Config compatibility is checked before any artifact directory access to prevent premature caching of resolved_dataset_name; the IF_POSSIBLE → NEVER downgrade invalidates the cache and calls refresh_media_storage_path() to keep MediaStorage in sync.
  • Async path derives initial_actual_num_records from the filesystem (completed IDs) rather than stale metadata, handles non-aligned original/extension row-group sizes via a precomputed list, and writes incremental metadata.json after every row-group checkpoint.
  • Sync path correctly computes per-batch sizes for extension runs using original_target_num_records so non-aligned extensions get accurate batch sizes; run_after_generation processors are gated on whether generation actually ran (the generated bool), preventing re-processing an already-complete dataset.

Confidence Score: 5/5

Safe to merge — all previously identified correctness issues (crash-window counters, non-aligned extension sizes, false already-complete detection, processor re-run on complete dataset, IF_POSSIBLE overwrite) are addressed in this commit.

The resume logic correctly handles every edge case documented in the PR: filesystem-based initial counters for the async crash window, precomputed per-group sizes for non-aligned extensions, the original_target/new_target distinction when extending, config compatibility checks for both ALWAYS and IF_POSSIBLE, and the generated bool that gates run_after_generation. All previously flagged issues in the review thread are resolved, the test plan is thorough, and the new code is well-partitioned between the two engine paths.

No files require special attention — dataset_builder.py carries the most complexity but the logic has been thoroughly vetted across multiple review rounds.

Important Files Changed

Filename Overview
packages/data-designer-engine/src/data_designer/engine/dataset_builders/dataset_builder.py Core resume logic: adds config compatibility check, per-path resume branching, precomputed row-group lists for non-aligned extensions, filesystem-based initial counters, and incremental metadata writes; all previously flagged crash-window and extension-alignment issues appear addressed.
packages/data-designer-engine/src/data_designer/engine/storage/artifact_storage.py Adds ResumeMode enum, resume field, resolved_dataset_name resume semantics, clear_partial_results(), and refresh_media_storage_path(); all edge cases (empty dir, missing dir, IF_POSSIBLE downgrade) handled correctly.
packages/data-designer-engine/src/data_designer/engine/dataset_builders/utils/dataset_batch_manager.py start() gains start_batch, initial_actual_num_records, num_records_list, and original_target_num_records params; all backward-compatible with defaults; incremental metadata now includes original_target_num_records for multi-run tracking.
packages/data-designer-engine/src/data_designer/engine/dataset_builders/utils/row_group_buffer.py RowGroupBufferManager gains initial_actual_num_records and initial_total_num_batches constructor params to seed counters on resume; write_metadata gains original_target_num_records param; straightforward and correct.
packages/data-designer/src/data_designer/interface/data_designer.py create() and _create_resource_provider() thread resume through to ArtifactStorage and builder.build(); ResumeMode is exported in init.py; clean pass-through with no logic added at this layer.
packages/data-designer/src/data_designer/cli/commands/create.py Adds --resume / -r CLI option with case-insensitive ResumeMode parsing; default is NEVER; forwarded correctly through GenerationController.
packages/data-designer-engine/tests/engine/dataset_builders/test_dataset_builder.py Comprehensive test coverage for all resume modes, crash windows, extension alignment, already-complete detection, config compatibility, and processor gating; matches all scenarios described in the PR.

Flowchart

%%{init: {'theme': 'neutral'}}%%
flowchart TD
    A["build(resume=X)"] --> B["_check_resume_config_compatibility()"]
    B --> C{compat?}
    C -- "ALWAYS + INCOMPATIBLE" --> D["raise DatasetGenerationError"]
    C -- "IF_POSSIBLE + INCOMPATIBLE/NO_PRIOR" --> E["resume=NEVER\npop cache\nrefresh_media_storage_path()"]
    C -- "IF_POSSIBLE + COMPATIBLE" --> F["resume=ALWAYS\npop cache"]
    C -- "ALWAYS + COMPATIBLE" --> G["continue with ALWAYS"]
    E --> H["_write_builder_config()"]
    F --> H
    G --> H
    H --> I{metadata.json\nexists?}
    I -- "No (ALWAYS)" --> J["clear_partial_results()\nresume=NEVER"]
    I -- "Yes" --> K{async engine?}
    J --> K
    K -- "Yes" --> L["_build_async()"]
    K -- "No + ALWAYS" --> M["_build_with_resume()"]
    K -- "No + NEVER" --> N["sync batch loop"]
    L --> O{resume\n== ALWAYS?}
    O -- "Yes" --> P["_load_resume_state()\n_find_completed_row_group_ids()\nprecompute row groups\nclear_partial_results()"]
    O -- "No" --> Q["fresh async run"]
    P --> R{already\ncomplete?}
    R -- "Yes" --> S["return False"]
    R -- "No" --> T["schedule remaining row groups\nwrite incremental metadata"]
    M --> U["_load_resume_state()\ncompute original+extension sizes\nbatch_manager.start()"]
    U --> V{already\ncomplete?}
    V -- "Yes" --> S
    V -- "No" --> W["run remaining batches"]
    S --> X["skip run_after_generation"]
    T --> Y["return True"]
    W --> Y
    N --> Y
    Q --> Y
    Y --> Z["run_after_generation\nreturn final_dataset_path"]
Loading

Reviews (28): Last reviewed commit: "fix(engine): preserve original_target_nu..." | Re-trigger Greptile

@przemekboruta przemekboruta changed the title feat: resume interrupted dataset generation runs (sync engine) feat: resume interrupted dataset generation runs (sync + async engine) Apr 13, 2026
przemekboruta added a commit to przemekboruta/DataDesigner that referenced this pull request Apr 13, 2026
…set already complete

_build_with_resume and _build_async now return False when the dataset is already
complete (early-return path), True otherwise. build() skips
_processor_runner.run_after_generation() on False, preventing processors from
calling shutil.rmtree and rewriting an already-finalized dataset.

Fixes the issue raised in review: greptile P1 comment on PR NVIDIA-NeMo#526.
@github-actions
Copy link
Copy Markdown
Contributor

Issue #525 has been triaged. The linked issue check is being re-evaluated.

@andreatgretel andreatgretel added agent-review Trigger agentic CI review and removed agent-review Trigger agentic CI review labels Apr 13, 2026
@andreatgretel andreatgretel added the agent-review Trigger agentic CI review label Apr 16, 2026
@github-actions
Copy link
Copy Markdown
Contributor

Code Review: PR #526 — Resume interrupted dataset generation runs (sync + async engine)

Summary

This PR adds a resume: bool = False parameter to DataDesigner.create() and DatasetBuilder.build(), enabling users to resume interrupted dataset generation from the last completed batch (sync) or row group (async). The implementation touches 5 source files and 4 test files across the data-designer-engine and data-designer packages.

Scope: ~860 additions, ~16 deletions across 10 files (including a plan doc and comprehensive tests).

The feature is well-designed: it leverages existing metadata.json checkpoints, validates run-parameter compatibility, handles edge cases (already-complete, no-metadata, parameter mismatch), and correctly separates the sync and async resume paths. The plan diverged from implementation in a positive way — the async engine now supports resume (the plan initially deferred it).

Findings

High Severity

(H1) _load_resume_state return value discarded in async resume path
dataset_builder.py:411 — In _build_async, when resume=True, the call self._load_resume_state(num_records, buffer_size) is made for validation only — the returned _ResumeState is discarded. This is intentional (the async path derives state from the filesystem instead), but it's confusing. The validation-only intent should be made explicit, e.g. by extracting a _validate_resume_params() method or assigning to _ with a comment. As-is, a future maintainer might remove the "unused" call and break parameter validation for async resume.

Medium Severity

(M1) _find_completed_row_group_ids parses batch filenames with split("_", 1)[1]
dataset_builder.py:381 — The glob pattern is batch_*.parquet and the ID is extracted via p.stem.split("_", 1)[1]. This works for batch_00000"00000"int("00000") = 0. However, if a file like batch_00000_extra.parquet appeared (e.g., from a future format change), split("_", 1)[1] would yield "00000_extra" and int() would raise ValueError, which is caught. This is acceptable but fragile. Consider using a regex r"^batch_(\d+)$" on the stem for robustness.

(M2) initial_actual_num_records calculation assumes uniform batch sizes
dataset_builder.py:418-420 — The async resume path computes initial_actual_num_records as:

sum(min(buffer_size, num_records - rg_id * buffer_size) for rg_id in completed_ids)

This formula assumes each row group was written with exactly min(buffer_size, remaining) rows, ignoring dropped rows. If the original run dropped rows within a row group (e.g., due to generation failures), the actual count would be lower. However, actual_num_records in the sync path also counts written records (not requested), and the metadata from write_metadata stores the true post-drop count. This means the filesystem-derived count may overestimate vs. what was actually written. The comment at line 414 acknowledges metadata may lag, but the formula's assumption about no drops could lead to inflated actual_num_records in the final metadata when some rows were dropped in completed groups.

(M3) batch_manager.start() calls reset() which deletes files on resume path
dataset_batch_manager.py:177start() calls self.reset() which sets _current_batch_number = 0 and _actual_num_records = 0, then immediately overrides them. The reset(delete_files=False) call is harmless here (it doesn't delete files), but it does zero-out internal state that's immediately overwritten. While functionally correct, this coupling is subtle — if reset() ever gains side effects beyond zeroing counters, the resume path would break silently.

Low Severity

(L1) Plan/implementation divergence: async engine support
The plan document (plans/525/resume-interrupted-runs.md) states in the Design Decisions table: "Async engine: Raise DatasetGenerationError if DATA_DESIGNER_ASYNC_ENGINE=1 with resume=True" and in Trade-offs: "Resume support for async engine: deferred to a follow-up." The implementation fully supports async resume. The plan should be updated to reflect the actual implementation.

(L2) _ResumeState.buffer_size field is redundant
dataset_builder.py:91_ResumeState stores buffer_size but it's always set to the same buffer_size parameter that was already validated. The field is never read after construction in _build_with_resume — the method uses the buffer_size parameter directly. The field could be removed to avoid confusion.

(L3) Incremental metadata writes add I/O overhead to async engine
dataset_builder.py:443write_metadata is now called after every row group checkpoint in finalize_row_group. For large datasets with many small row groups, this adds per-row-group disk I/O. The trade-off (resumability vs. performance) is reasonable, but worth noting in documentation or the PR description. The final write_metadata call at line 478 is documented as redundant ("overwrites the last incremental write with identical content") — good.

(L4) Test file has mid-file imports
test_dataset_builder.py:927-429 — The resume test section re-imports json, Path, and ArtifactStorage with underscore-prefixed aliases (_json, _Path, _ArtifactStorage) mid-file. While this works, it's unconventional and potentially confusing. Standard practice is to add imports at the top of the file.

(L5) No validation of start_batch or initial_actual_num_records bounds
dataset_batch_manager.py:165-166 — The new start_batch and initial_actual_num_records parameters have no validation (e.g., start_batch >= 0, start_batch <= num_batches, initial_actual_num_records >= 0). Since these are only called from internal resume code that validates upstream, this is acceptable — but defensive checks would prevent misuse if the method is called from new paths in the future.

Positive Observations

  • Comprehensive test coverage: 20+ new test cases covering validation errors, already-complete detection, async/sync paths, filesystem-vs-metadata crash window scenarios, and processor non-invocation on skip.
  • Clean separation of sync/async resume: The sync path uses _build_with_resume with DatasetBatchManager, while the async path extends _build_async with skip_row_groups and filesystem-based counters. No shared mutable state between the two paths.
  • Filesystem as source of truth for async: The decision to derive initial_actual_num_records from the filesystem rather than potentially-stale metadata (lines 414-420) handles the crash window correctly and is well-documented.
  • Graceful degradation for missing metadata: The build() method at line 188 handles the case where metadata.json is missing (interrupted before any batch completed) by logging and restarting fresh, rather than raising an error. This is a UX improvement over the plan's original "raise error" approach.
  • No breaking changes: All new parameters default to their pre-existing behavior (resume=False, start_batch=0, initial_actual_num_records=0).
  • Incremental metadata writes enable async resumability — a meaningful improvement over the plan's deferred-async-resume decision.

Verdict

Approve with suggestions. The implementation is solid, well-tested, and handles edge cases thoughtfully. The high-severity finding (H1) is a readability/maintainability concern rather than a correctness bug — the discarded return value works because _load_resume_state raises on validation failure. The medium-severity findings (M1-M3) are minor robustness concerns. None of these block merging, but H1 and M2 are worth addressing before or shortly after merge.

@github-actions github-actions Bot removed the agent-review Trigger agentic CI review label Apr 16, 2026
@greptile-apps
Copy link
Copy Markdown
Contributor

greptile-apps Bot commented Apr 17, 2026

Want your agent to iterate on Greptile's feedback? Try greploops.

@nabinchha
Copy link
Copy Markdown
Contributor

nabinchha commented Apr 28, 2026

cc @johnnygreco @andreatgretel

Suggestion: add an IF_POSSIBLE mode to resume for idempotent retry workflows

First, thanks for landing this — being able to resume interrupted runs at all is a big quality-of-life win for long jobs. This suggestion is about extending the API one step further so it composes cleanly with automated retry/orchestration workflows.

Dependencies

This suggestion depends on #584 (deterministic hash to uniquely identify a workflow config), which is being shipped soon. The behavior matrix below uses that hash as its definition of "compatible" — i.e., whether the on-disk run was produced by an equivalent workflow config. #584 should land first.

Motivation

The current API is binary:

  • resume=False → start fresh (default; collisions get a timestamp suffix)
  • resume=True → resume; raise if there's no resumable state

This works well for interactive use, where the caller knows up front whether they're starting a new run or resuming an existing one.

It's awkward for automated workflows where the same invocation may be a first run or a retry:

  • A wrapper that resubmits the same job script after infra failures
  • A CI/cron pipeline that re-runs a generation job on a fixed schedule
  • Any orchestrator that doesn't track per-job "have I run this before?" state externally

In all of these, the caller has to:

  1. Stat the output directory
  2. Decide whether resume is appropriate
  3. Pass the right value to create()

That logic ends up reimplemented in every wrapper, and it has to know about DD's storage layout (where metadata lives, what counts as "resumable") to do it correctly. As DD's storage layout evolves, every wrapper breaks.

Proposal

A ResumeMode enum with three values

from enum import StrEnum

class ResumeMode(StrEnum):
    NEVER = "never"              # current resume=False
    ALWAYS = "always"            # current resume=True
    IF_POSSIBLE = "if_possible"  # new
def create(
    self,
    *,
    dataset_name: str | None = None,
    resume: ResumeMode = ResumeMode.NEVER,
    ...
) -> DatasetResult: ...

The key property of IF_POSSIBLE: the caller passes the same value on every invocation and DD does the right thing based on what's actually on disk. The caller no longer needs to reason about prior state.

Behavior matrix

"Compatible" below means the persisted config_hash (#584) matches the current invocation's hash — i.e., the on-disk run was produced by an equivalent workflow config.

State on disk ResumeMode.NEVER ResumeMode.ALWAYS ResumeMode.IF_POSSIBLE
Folder missing or empty create (timestamp on collision elsewhere) raise create in dataset_name
metadata.json present, compatible timestamp-suffix new folder resume resume
metadata.json present, incompatible timestamp-suffix new folder raise raise
Folder has data but no metadata.json timestamp-suffix new folder raise raise

The crucial line is the third one: under IF_POSSIBLE, an incompatible-config case must raise, not silently start fresh. Silently overwriting a folder that belongs to a different config is worse than failing loudly. The whole point of IF_POSSIBLE is "I might be a retry of myself" — if the hash says it isn't, the folder belongs to an unrelated run that happened to land on the same dataset_name, and the right response is to surface that collision rather than paper over it.

Implementation sketch

  • Define ResumeMode (probably in data_designer.config) and re-export from the public package.
  • ArtifactStorage: change resume: bool to resume: ResumeMode. In resolved_dataset_name, the IF_POSSIBLE branch returns dataset_name unchanged whether or not the folder currently exists, and never raises on missing/empty folders.
  • DatasetBuilder._load_resume_state:
  • Tests should cover:
    • All four state-on-disk cases × all three ResumeMode values
    • The hash-mismatch error path for ALWAYS and IF_POSSIBLE
    • String coercion via StrEnum (resume="if_possible" resolves to ResumeMode.IF_POSSIBLE) so config-driven callers stay ergonomic

Other considerations

  • Concurrency. IF_POSSIBLE plus a shared dataset_name across two concurrent processes is a race. Worth documenting that resume assumes a single writer per dataset_name. A lockfile in the dataset folder would make this enforceable, but is probably a separate piece of work.
  • Cleanup semantics. clear_partial_results() should fire in IF_POSSIBLE mode the same way it does in ALWAYS — partial results from a previous interrupted run shouldn't leak into the resumed (or fresh) run.

Related cleanups (separate from the API change)

While reading the PR, two small things stood out that are worth a follow-up regardless of the tri-state proposal:

  • The partial-completion warning at the end of _build_async is unreachable because of the return True immediately above it. Moving the warning above the return restores user-visible feedback for incomplete async runs.
  • _load_resume_state raises DatasetGenerationError from a FileNotFoundError without from exc, dropping the original traceback. Chaining it would help future debugging.

@andreatgretel
Copy link
Copy Markdown
Contributor

Thank you for taking this on, the plan in plans/525/ made the trade-offs easy to follow. The main asks before merge are in the comments above: the no-metadata fallback running before sync/async split (#1), the async already-complete check not surviving run_after_generation (#2), the dead-code warning at line 511 (greptile already flagged), and a few smaller follow-throughs (builder_config validation, happy-path test).

One thought is that we could try doing checkpointing on a task level already. However, that would need a sidecar format (parquet only wants whole row groups), concurrency-safe writes from many parallel asyncio tasks, and a CompletionTracker replay path on resume, probably 3-5x the code here plus new edge cases for skipped/dropped cells. At the row-group level the lost-work blast radius is bounded by buffer_size LLM calls per crash, which is fine for the common case. The architecture here (per-cell update_cell + CompletionTracker) is well-shaped to add task-level later by intercepting cell writes, so this PR doesn't paint anyone into a corner.

@przemekboruta
Copy link
Copy Markdown
Contributor Author

Thanks everyone for the thorough review — really useful catches across the board. Here's what was addressed:

@johnnygreco / @nabinchha

  • Added _GenerationOutcome enum (GENERATED / ALREADY_COMPLETE) replacing the bare bool returns — build() now gates run_after_generation on status is _GenerationOutcome.GENERATED, so processors are never re-run on an already-complete dataset
  • num_records validation changed from exact match to < actual_num_records — you can now resume with a larger or equal target, e.g. resume=True, num_records=6000 after a run interrupted at 5000 records
  • No-metadata fallback moved inside the sync-only branch; async handles it internally via _find_completed_row_group_ids() so the crash-window parquet files are not silently discarded

@andreatgretel

  • Async already-complete check now uses max(metadata_count, filesystem_count) — metadata count is authoritative after AFTER_GENERATION processors have rewritten parquet-files/, filesystem count is authoritative in the crash window when metadata lags; taking the max covers both cases
  • Added _check_resume_config_compatibility(): reads builder_config.json before _write_builder_config() overwrites it, compares the data_designer section (ignoring library_version), and raises DatasetGenerationError when it differs — prevents silently mixing batches generated with incompatible configs
  • Added test_build_resume_runs_remaining_batches: 3 batches total, 1 already done → asserts _run_batch is called with current_batch_number=1 and 2 only, not 0
  • Also fixed the dead-code regression: the "Surface partial completion" warning block was sitting after return True in _build_async — moved it before the return so it actually executes

PR description updated to reflect the changed semantics.

Empty directory (crash between mkdir and first file write) was treated as
compatible — _check_resume_config_compatibility returned True, IF_POSSIBLE
upgraded to ALWAYS, which then raised ArtifactStorageError.

Fix: treat empty directory the same as missing — return False from
_check_resume_config_compatibility when any(dir.iterdir()) is False.

Test: test_if_possible_starts_fresh_when_directory_is_empty
…int mismatch

ResumeMode.ALWAYS was documented to raise when column/model config changed, but
_check_resume_config_compatibility() was only called in the IF_POSSIBLE branch.
A user resuming with ALWAYS after changing the config would silently mix records
from two different configs.

Fix:
- Refactor _check_resume_config_compatibility() to return _ConfigCompatibility
  enum (COMPATIBLE / INCOMPATIBLE / NO_PRIOR_DATASET) instead of bool so callers
  can distinguish 'no prior run' from 'configs differ'
- Call the check for both ALWAYS and IF_POSSIBLE before _write_builder_config()
- ALWAYS + INCOMPATIBLE → DatasetGenerationError
- IF_POSSIBLE + INCOMPATIBLE → silent fresh start (existing behaviour)
- IF_POSSIBLE + NO_PRIOR_DATASET → silent fresh start (existing behaviour)

Test: test_build_resume_always_raises_on_config_mismatch
@przemekboruta przemekboruta requested a review from nabinchha May 4, 2026 21:01
Copy link
Copy Markdown
Contributor

@nabinchha nabinchha left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for grinding through this one, @przemekboruta — the ResumeMode enum, crash-window reconciliation, and the _ConfigCompatibility tri-state are all real improvements over the earlier rounds. Most of what's left is small follow-through from the recent iterations (stale strings/docstrings) plus a couple of behavior gaps that are worth catching before merge.

Summary

Adds resume: ResumeMode (NEVER / ALWAYS / IF_POSSIBLE) to DataDesigner.create() and DatasetBuilder.build(). Sync resume reads metadata.json for batch progress; async resume reconciles metadata.json with a filesystem scan of parquet-files/ and seeds the row-group buffer with the original target so non-aligned runs extend correctly. IF_POSSIBLE uses DataDesignerConfig.fingerprint() to silently fall back to a fresh run on config drift; ALWAYS raises. The implementation matches the PR description.

Findings

Critical — Let's fix these before merge

Direct collision with #540 (feat(results): add export() method and --output-format CLI flag)

  • What: Commit 0bdf24ab on this branch introduces an early version of DatasetCreationResults.export() plus the --output-format / -f CLI flag. That feature has its own dedicated PR — #540, also by @przemekboruta, branch feat/dataset-export, currently open — which has already been through review rounds with @andreatgretel and @nabinchha and contains a meaningfully more developed implementation:

    • Streaming export (memory proportional to one batch, not the full dataset) — #526's commit calls self.load_dataset() which materializes everything in memory.
    • count_records() helper using pq.read_metadata so the CLI doesn't OOM just to print a row count — #526 has none of this.
    • Schema unification with pa.unify_schemas(promote_options="permissive") for parquet batches with type drift — #526 doesn't handle this.
    • InvalidFileFormatError (project-canonical) instead of ValueError.
    • click.Choice(SUPPORTED_EXPORT_FORMATS) for parse-time CLI validation, better --help, and tab completion.
    • Extension-based format inference (.jsonl/.csv/.parquet) with explicit override.
    • ~15 tests covering streaming, schema unification, extension inference, error paths, controller-level happy/sad paths, etc.

    Both PRs touch the same files (packages/data-designer/src/data_designer/interface/results.py, cli/commands/create.py, cli/controllers/generation_controller.py, plus their tests) — whichever merges second will conflict, and if #526 lands first it will overwrite the older shape into main, forcing #540 into a redo just to ship its memory-safety improvements.

  • Why: This isn't just scope creep — it's a guaranteed merge conflict and a regression risk. The version of export() shipping in #526 has none of the OOM-safety, schema-unification, or canonical-error-type work that #540 has been iterating on for weeks. Users on the eventual release would land on the inferior implementation purely because of merge ordering.

  • Suggestion: Drop commit 0bdf24ab from this PR — git rebase -i origin/main and removing that single commit (and its companion test edits) should be clean since the resume work doesn't depend on it. Let #540 ship the export feature on its own merits. Coordinate with @nabinchha / @andreatgretel on the sequencing if needed, but #540 is the right home for those changes and #526 should focus solely on resume (#525).

packages/data-designer/src/data_designer/cli/commands/create.py:13-49 and packages/data-designer/src/data_designer/cli/controllers/generation_controller.py:113-151 — CLI create doesn't surface resume

  • What: DataDesigner.create() accepts resume: ResumeMode and ResumeMode is exported from data_designer.interface, but data-designer create has no --resume flag and GenerationController.run_create() never passes resume= through. CLI users get no way to resume an interrupted run.
  • Why: Resume's whole value proposition is "you don't have to redo work after a crash" — but the CLI is exactly the surface where crashes happen most (long-running jobs killed by preemption / OOM / SIGTERM where the user only has a shell handy). We're shipping the API but stopping one layer short.
  • Suggestion: Add a --resume option to create_command (e.g. typer.Option(ResumeMode.NEVER, "--resume", case_sensitive=False) so --resume always / --resume if_possible / --resume never all work via StrEnum), thread it through GenerationController.run_create(), and pass resume=resume into data_designer.create(...). Worth a parametrize test in tests/cli/commands/test_create_command.py.

packages/data-designer-engine/src/data_designer/engine/dataset_builders/dataset_builder.py:393 and :593 — Stale "Remove resume=True" warning text

  • What: Both already-complete warnings still say "Remove resume=True if you want to generate a new dataset." The public API is now a ResumeMode enum, and resume=True isn't even a valid value anymore.
  • Why: It's the operator-facing log line that fires exactly when a user is mid-recovery and reading carefully. Sending them looking for a flag that no longer exists is the wrong note to hit.
  • Suggestion:
"⚠️ Dataset is already complete — all batches were found in the existing artifact directory. "
"Nothing to resume. Use resume=ResumeMode.NEVER if you want to generate a new dataset."

(same swap in _build_async).

packages/data-designer-engine/src/data_designer/engine/dataset_builders/dataset_builder.py:230-232 and packages/data-designer/src/data_designer/interface/data_designer.py:226-230 — Docstring overstates when ALWAYS raises

  • What: Both docstrings say ALWAYS "Raises if no prior progress exists or … incompatible." The actual behavior in build() lines 290-299 is that a missing metadata.json (interrupted before the first batch / row group completed) silently starts a fresh run with an info log — it doesn't raise. The only "no prior progress" case that still raises is "directory itself doesn't exist" (via ArtifactStorage.resolved_dataset_name).
  • Why: That gap is exactly what users will hit after their first crashed run on a small dataset (records < buffer_size → only one row group → if it didn't finish, no metadata). They'll plan around an exception that doesn't fire.
  • Suggestion: Tighten to "ALWAYS: resume from the last completed batch / row group. If there's a checkpoint but its parameters are incompatible (buffer_size mismatch, num_records < what was already generated), raises DatasetGenerationError. If no checkpoint exists yet (interrupted before the first batch finished), silently restarts from the beginning. If the dataset directory itself doesn't exist, raises."

packages/data-designer-engine/src/data_designer/engine/dataset_builders/dataset_builder.py:290-299artifact_storage.resume not synced when the no-metadata fallback fires

  • What: When IF_POSSIBLE + COMPATIBLE upgrades to ALWAYS (lines 274-277), both the local resume and self.artifact_storage.resume are flipped to ALWAYS and the cached resolved_dataset_name is popped. If the no-metadata branch then fires (line 290), only the local resume is downgraded to NEVER; self.artifact_storage.resume stays at ALWAYS. Today this is benign because resolved_dataset_name was already cached one line earlier, but it leaves the two state holders disagreeing.
  • Why: Same shape of bug that motivated test_if_possible_incompatible_config_does_not_overwrite_existing_datasetArtifactStorage.resume and the local resume getting out of sync. That test only covers the IF_POSSIBLE → incompatible path; the no-metadata path leaves the same trap unprotected.
  • Suggestion: Mirror the IF_POSSIBLE downgrade:
if resume == ResumeMode.ALWAYS and not self.artifact_storage.metadata_file_path.exists():
    logger.info(...)
    self.artifact_storage.clear_partial_results()
    resume = ResumeMode.NEVER
    self.artifact_storage.resume = ResumeMode.NEVER  # keep both in sync

packages/data-designer-engine/tests/engine/dataset_builders/test_dataset_builder.py — Resume tests reach pervasively into private internals

  • What: DEVELOPMENT.md is explicit: "Test public APIs only. Tests should exercise public interfaces, not _-prefixed functions or classes. If something is hard to test without reaching into private internals, consider refactoring the code to expose a public entry point." The new resume tests violate this in several distinct ways:
    • Three tests exercise _find_completed_row_group_ids() as the system under test (lines 1606, 1620, 1636) — test_find_completed_row_group_ids_empty_dir, test_find_completed_row_group_ids_with_files, test_find_completed_row_group_ids_ignores_non_batch_files. The function is a private filesystem scanner; its behavior is already covered by the end-to-end async-resume tests via build().
    • Resume behavior is driven by patching private methods rather than going through build(): _check_resume_config_compatibility (1549, 1907), _build_with_resume (1523), _build_async (1688), _prepare_async_run used as a capture hook (1786, 1830, 1873).
    • Coverage relies on a wall of private-method patches per test — most of the IF_POSSIBLE / async-resume tests stack 5-7 patch.object(builder, "_…") calls (_run_model_health_check_if_needed, _run_mcp_tool_check_if_needed, _write_builder_config, _initialize_generators_and_graph, _processor_runner.run_after_generation, …). Some of that scaffolding pre-dates this PR, but the new tests double down on the pattern.
    • The "stale return type" issue at line 1907 is a downstream symptom: patch.object(builder, "_check_resume_config_compatibility", return_value=False) happens to behave correctly only because False != _ConfigCompatibility.COMPATIBLE is true. If anyone later branches on a specific enum member, the test will silently keep "passing" on the wrong path — the logger.info("Config has changed …") branch on line 268 isn't being exercised the way the test name implies because False == _ConfigCompatibility.INCOMPATIBLE is also false.
  • Why: This is exactly the failure mode DEVELOPMENT.md warns about — tests that ratchet implementation details into the test suite. Today's symptoms: (a) the stale return_value=False test is silently miscategorized; (b) refactoring resume logic (e.g. extracting _resolve_async_resume_state per the previous review's suggestion, or reshaping _build_with_resume / _build_async into a single dispatcher) will cascade into ~20 test edits even when behavior is unchanged; (c) the senior review at the top of this thread already asked for "an end-to-end variant with real parquet files on disk" to lock down the filesystem reconciliation invariant — the existing tests instrument private internals instead of asserting on observable behavior.
  • Suggestion: A few concrete moves:
    • Drop the three direct _find_completed_row_group_ids tests. That logic is already exercised by test_find_completed_row_group_ids_used_for_initial_total_batches and the async crash-window tests, which assert on observable downstream effects.
    • Convert the _prepare_async_run capture-hook tests into end-to-end tests that drive builder.build(num_records=…, resume=ResumeMode.ALWAYS) against a real seeded dataset directory (small num_records, real parquet files written via _write_parquet_files you already have), then assert on metadata.json contents (actual_num_records, num_completed_batches) after build() returns. That's how test_find_completed_row_group_ids_used_for_initial_total_batches already works — extend the same shape to the crash-window and skip-set tests.
    • For the IF_POSSIBLE-downgrade tests (test_if_possible_incompatible_config_does_not_overwrite_existing_dataset etc.), the assertions on storage.resume == ResumeMode.NEVER and storage.resolved_dataset_name != "dataset" are observable; the patches on _write_builder_config / _initialize_generators_and_graph are scaffolding to short-circuit generation. If we extract a public seam (e.g. a dry_run=True argument on build() that runs the resume-decision logic and returns the resolved dataset name without generating) those tests collapse to a 3-line invocation.
    • At minimum, fix the return_value=False to return_value=_ConfigCompatibility.INCOMPATIBLE — the enum is already imported on line 24 of the test file. Even if the broader refactor lands as a follow-up, this one stops being a silently miscategorized test.

Warnings — Worth addressing

packages/data-designer-engine/src/data_designer/engine/dataset_builders/dataset_builder.py:519-548 — Missing-config case silently returns COMPATIBLE, but docstring promises a warning

  • What: Docstring says "COMPATIBLE — fingerprints match, or stored config is unreadable (warning logged)". In practice config_path absent returns COMPATIBLE silently (no log), unlike the unreadable case which warns.
  • Why: A user who deletes builder_config.json by mistake and reruns with IF_POSSIBLE gets a silent resume whose generation parameters can no longer be validated. Worse, this means IF_POSSIBLE will then upgrade to ALWAYS and proceed — bypassing the fingerprint check the user explicitly opted into. That's a silent correctness footgun on a feature whose entire job is to detect config drift.
  • Suggestion: Log a warning in the not config_path.exists() branch — a missing builder_config.json next to a populated dataset directory is genuinely anomalous and worth surfacing. Update the docstring to match if you go that route.

packages/data-designer-engine/src/data_designer/engine/dataset_builders/dataset_builder.py:543except Exception is broader than needed

  • What: _check_resume_config_compatibility catches any Exception to recover from a corrupt builder_config.json. Realistic failure modes are OSError, json.JSONDecodeError, and pydantic.ValidationError.
  • Why: STYLEGUIDE.md is explicit: "Prefer specific exception types over bare except. Never catch Exception or BaseException without re-raising." As written, this also swallows programming bugs — e.g. an AttributeError from a future schema change to BuilderConfig, or a TypeError from a refactor — under the "unreadable config — assume compatible" log line. That's exactly the silent-fallback shape the style guide is trying to prevent: a real bug becomes "user sees a warning, run continues with fingerprint check skipped, config drift goes undetected."
  • Suggestion: except (OSError, json.JSONDecodeError, ValidationError): (import ValidationError from pydantic). Anything outside that set is a programming error and should propagate.

packages/data-designer-engine/tests/engine/dataset_builders/test_dataset_builder.py — Format check fails

  • What: ruff format --check flags test_if_possible_starts_fresh_when_directory_is_empty at line 1959 for an unnecessary multi-line signature.
  • Why: CI will fail this on the next push and pre-commit would have caught it. Worth fixing before merge so the green-CI baseline is honest.
  • Suggestion: make check-all-fix and re-commit.

packages/data-designer-engine/src/data_designer/engine/dataset_builders/dataset_builder.py:602-609 — Nested finalize_row_group closure

  • What: _build_async defines finalize_row_group(rg_id) as a nested function inside the body of an already ~120-line method. It captures buffer_manager, num_records, buffer_size, and on_batch_complete from the enclosing scope.
  • Why: STYLEGUIDE.md is explicit: "Avoid nested functions. Define helpers at module level or as private methods on the class. Nested functions hide logic, make testing harder, and complicate stack traces. The only acceptable use is closures that genuinely need to capture local state." This one closes over instance + parameter state, which a method + functools.partial handles equivalently. The senior review at the top of this thread already flagged that _build_async is dense at 120 lines and worth decomposing — this nested function is one of the contributors to that density and one of the cheaper pieces to extract.
  • Suggestion: Promote to a private method _finalize_row_group(self, rg_id, *, buffer_manager, num_records, buffer_size, on_batch_complete) and pass functools.partial(self._finalize_row_group, buffer_manager=buffer_manager, num_records=num_records, buffer_size=buffer_size, on_batch_complete=on_batch_complete) into _prepare_async_run. Same behavior, but stack traces and tests can address it directly.

What Looks Good

  • _ConfigCompatibility tri-state cleanly disambiguates the three resume scenarios. The earlier bool return type forced IF_POSSIBLE to conflate "no prior dataset" with "configs differ"; the new enum lets ALWAYS raise on INCOMPATIBLE only, while IF_POSSIBLE collapses both INCOMPATIBLE and NO_PRIOR_DATASET into a silent fresh start.
  • Filesystem-vs-metadata reconciliation in _build_async is the right move. Sourcing both initial_total_num_batches and initial_actual_num_records from _find_completed_row_group_ids() (with state.target_num_records for per-group sizing) covers the full crash-window matrix without needing a separate flag — and test_initial_actual_num_records_from_filesystem_in_crash_window plus test_build_async_resume_initial_actual_num_records_uses_original_target lock that down.
  • if generated: run_after_generation(...) gating is exactly the right shape. Resuming an already-complete dataset no longer destroys post-processed parquet by re-running AFTER_GENERATION processors on top of itself. Both engines covered.
  • test_build_async_resume_skip_row_groups_contains_completed_ids locks down the "only the missing row groups get scheduled" invariant the previous review thread asked for.
  • Test coverage hits the failure modes that matter — ~30 new tests across config-mismatch, buffer-mismatch, num_records-too-small, no-metadata fallback, crash-window FS reconciliation, IF_POSSIBLE downgrade paths (no dir / empty dir / incompatible config), and the async skip-set invariant.

Verdict

Needs changes — Six Critical findings and four Warnings, all worth addressing before merge. The collision with #540 is the highest-priority blocker (regression risk on a feature with its own already-reviewed PR); the CLI --resume gap and the test-private-API issue are the next biggest because they shape what users and future maintainers interact with. The Warnings are all small but each is a STYLEGUIDE.md violation or a silent-fallback footgun, so they're cheaper to fold into this PR than to chase in a follow-up.


This review was generated by an AI assistant.

…I flag, fix edge cases

C1: drop commit 0bdf24a — remove export() / --output-format from this PR; that feature
    belongs to NVIDIA-NeMo#540 which has a superior streaming implementation
C2: add --resume / -r flag to data-designer create CLI, thread ResumeMode through
    GenerationController.run_create() into DataDesigner.create()
C3: fix already-complete warning text — replace stale "Remove resume=True" with
    "Use resume=ResumeMode.NEVER" in _build_with_resume and _build_async
C4: fix docstrings — ALWAYS does NOT raise when no checkpoint exists (silently
    restarts from scratch); clarify num_records >= actual semantics
C5: sync artifact_storage.resume = NEVER when no-metadata fallback fires so both
    state holders agree after the downgrade
C6: fix return_value=False → _ConfigCompatibility.INCOMPATIBLE in IF_POSSIBLE test;
    drop 3 direct _find_completed_row_group_ids tests (private API, covered by build())
W1: add logger.warning when builder_config.json is absent (silent COMPATIBLE was footgun)
W2: narrow except Exception → (OSError, json.JSONDecodeError, ValidationError)
W3: run make check-all-fix — ruff reformatted test_if_possible_starts_fresh_when_directory_is_empty
…s formula on async resume

When extending an async run (num_records > state.target_num_records) and a crash
occurs after an extension row group is written to disk but before write_metadata,
the formula `min(buffer_size, state.target_num_records - rg_id * buffer_size)` yields
a negative value for any extension row group (rg_id * buffer_size >= target), making
initial_actual_num_records silently undercount. The RowGroupBufferManager then starts
at the wrong offset, and the final metadata reports an incorrect actual_num_records
with a false partial-completion warning.

Fix: use state.target_num_records for original row groups and num_records for extension
row groups (guarded by rg_id * buffer_size < state.target_num_records). Covers the
scenario with a new regression test.
… on non-aligned extension resume

The partitioning loop in _prepare_async_run decremented remaining by
min(buffer_size, remaining) for every row group, including skipped ones.
For a non-aligned original run (e.g. target=5, buffer_size=2, last group
has 1 record), the loop deducted 2 for the skipped last group, leaving
remaining one short.  Extension row groups received smaller sizes than
intended, so the generated dataset was silently short by the deficit and
a false partial-completion warning fired.

Fix: pre-compute the full row-group list with correct per-group sizes in
_build_async where state.target_num_records is available, then pass it to
_prepare_async_run as precomputed_row_groups (replacing the skip_row_groups
param). Original groups use min(buffer_size, target - rg*bs); extension
groups use min(buffer_size, extension_records - ext_idx*bs).

Also updates the skip_row_groups test to assert on precomputed_row_groups
and adds a regression test for the non-aligned extension case.
@przemekboruta
Copy link
Copy Markdown
Contributor Author

Thanks for the thorough review, @nabinchha — the second pass in particular caught real gaps worth fixing before merge.

Two more fixes landed since your last review, both surfaced by Greptile on the updated code:

in the extension crash window — when an async run is extended (new num_records > state.target_num_records) and crashes after an extension row group is written to disk but before write_metadata, the formula min(buffer_size, state.target_num_records - rg_id * buffer_size) yielded a negative value for any extension row group, silently undercounting initial_actual_num_records. Fixed in b8c633c1 — original row groups now use state.target_num_records, extension row groups use the new num_records.

Non-aligned extension row-group sizes — the partitioning loop in _prepare_async_run decremented remaining by min(buffer_size, remaining) for every row group including skipped ones. For a non-aligned original run (e.g. target=5, buffer_size=2 → last group has 1 record on disk), the loop deducted 2 instead of 1, leaving remaining short and causing extension row groups to receive smaller sizes than intended — the dataset ended up silently one record short with a false partial-completion warning. Fixed in 0fef8d41_build_async now pre-computes the full row-group list with correct per-group sizes and passes it to _prepare_async_run as precomputed_row_groups, replacing the skip_row_groups approach.

@przemekboruta przemekboruta requested a review from nabinchha May 6, 2026 20:03
Comment thread plans/525/resume-interrupted-runs.md Outdated
The plan described the initial resume: bool design which has since been
replaced by the full ResumeMode enum (NEVER/ALWAYS/IF_POSSIBLE), async
engine support, filesystem reconciliation, and config compatibility checks.
The PR description is the authoritative record of what shipped.
Resolved conflicts in CLI create command and tests — kept both
--resume (PR NVIDIA-NeMo#526) and --output-format (PR NVIDIA-NeMo#540) parameters
throughout create.py, generation_controller.py, and all test files.
… group's slack

original_target=5, buffer_size=2 produces 3 groups [2,2,1]. Extending to
num_records=6: ceil(6/2)=3 equalled len(completed_ids)=3, triggering the
already-complete branch on both the async and sync paths — returning the
5-record dataset silently.

Fix (async): replace ceil(num_records/bs) with
  num_original_groups + ceil(extension_records/bs)
so any extension always adds new groups beyond num_original_groups.

Fix (sync): add num_records_list param to DatasetBatchManager.start() and
pass the correct per-batch sizes in _build_with_resume, giving the batch
manager the right total batch count (4 instead of 3 in the example).
@nabinchha
Copy link
Copy Markdown
Contributor

Also, @przemekboruta do you have any interest in writing a short dev note about this feature once it goes out?

przemekboruta and others added 3 commits May 8, 2026 11:58
… resume

Prevents negative extension_records in async path which silently truncated
the dataset and corrupted metadata without triggering a partial-completion warning.
…ngrade

When build() detected an incompatible config and downgraded resume from
IF_POSSIBLE to NEVER, _media_storage.base_path remained bound to the
original directory while all other path properties resolved to the new
timestamped directory — causing broken image references in image-column runs.
…sume writes

After finalize_row_group successfully wrote incremental metadata during an
extension run, target_num_records in metadata was updated to the extension
target. A subsequent resume would read this as the original target, making
_rg_size() incorrect for all row groups and silently corrupting actual_num_records.

Stores original_target_num_records as an immutable field in metadata so the
original group boundaries are always recoverable regardless of how many
incremental writes have occurred.
@przemekboruta
Copy link
Copy Markdown
Contributor Author

Hey @nabinchha! Got a bit carried away with the Greptile review and landed a few more commits — mainly fixing edge cases in the async extension crash windows (stale original_target_num_records after a successful incremental metadata write) and a couple of other P1 suggestions. The PR should be in a much better shape now.

And yes, I'd love to write a dev note about this feature! Just let me know — should I add it to this PR, or would you prefer a separate one?

@przemekboruta przemekboruta requested a review from nabinchha May 8, 2026 11:15
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

feat: resume interrupted dataset generation runs

4 participants