Skip to content

add COPY FROM SKIP_DUPLICATE_PK feature#601

Open
ericyuanhui wants to merge 1 commit into
LadybugDB:mainfrom
ericyuanhui:v17_1
Open

add COPY FROM SKIP_DUPLICATE_PK feature#601
ericyuanhui wants to merge 1 commit into
LadybugDB:mainfrom
ericyuanhui:v17_1

Conversation

@ericyuanhui

Copy link
Copy Markdown

COPY FROM SKIP_DUPLICATE_PK Requirements

1. Background

Ladybug currently supports COPY FROM into node tables with primary-key validation. When duplicate primary keys are encountered:

  • with IGNORE_ERRORS=false, the query fails
  • with IGNORE_ERRORS=true, the query continues and the duplicated rows are skipped

The existing IGNORE_ERRORS=true behavior is already close to the desired outcome, but it has two practical limitations for repeated-PK-heavy imports:

  1. duplicate-PK details are written into the warning pipeline and become visible through CALL show_warnings() RETURN *
  2. large warning volumes can increase session memory usage unnecessarily when the caller only needs the conflicting PK values

This proposal introduces a focused option for file-based COPY FROM:

SKIP_DUPLICATE_PK=true

The new option should behave similarly to IGNORE_ERRORS=true for file imports, but with a special treatment for duplicate primary-key conflicts:

  • duplicated PK rows are skipped and import continues
  • duplicate-PK details are not written into the warning system
  • duplicate PK values are returned directly in the COPY result
  • all other errors continue to follow the existing behavior

2. Goal

Add SKIP_DUPLICATE_PK=true for COPY FROM on CSV and Parquet files targeting node tables, so that:

  • duplicate PK rows do not abort the import
  • duplicate PK rows are skipped
  • duplicate PK values are returned in a structured result
  • duplicate PK details do not enter WarningContext
  • non-duplicate-PK behavior remains aligned with existing file-import behavior

3. Non-Goals

This work does not include:

  • COPY FROM subquery support
  • relationship-table COPY
  • changing CALL show_warnings() behavior
  • changing generic IGNORE_ERRORS semantics
  • changing INSERT, MERGE, or non-COPY write paths
  • introducing new dedicated Python API methods
  • first-phase support for no-index PK validation paths

4. Scope

Included:

  • COPY <node_table> FROM "<file>.csv" (SKIP_DUPLICATE_PK=true)
  • COPY <node_table> FROM "<file>.parquet" (SKIP_DUPLICATE_PK=true)
  • multi-file CSV / Parquet imports
  • imports into node tables with primary-key hash-index support

Excluded:

  • COPY <node_table> FROM (<subquery>)
  • COPY <rel_table> ...
  • COPY FROM npy
  • node tables without a usable primary-key hash-index path for duplicate skip

5. Syntax

New option:

SKIP_DUPLICATE_PK=true

Examples:

COPY User FROM "user.csv" (SKIP_DUPLICATE_PK=true);
COPY User FROM "user.parquet" (SKIP_DUPLICATE_PK=true);
COPY User FROM ["a.csv", "b.csv"] (SKIP_DUPLICATE_PK=true);

6. Semantics

6.1 Default Behavior

If SKIP_DUPLICATE_PK is not specified, existing behavior remains unchanged.

6.2 Behavior When Enabled

When SKIP_DUPLICATE_PK=true is specified for file-based node copy:

  • if a row conflicts with an already existing PK in the target table:

    • skip the row
    • continue import
    • record the conflicting PK value for the final result
    • do not write the duplicate-PK detail into warnings
  • if a row conflicts with an earlier row from the same COPY execution:

    • keep the first successful row
    • skip the later conflicting row
    • record the conflicting PK value for the final result
    • do not write the duplicate-PK detail into warnings
  • if the error is not a duplicate-PK conflict:

    • follow the existing flow
    • do not convert the new option into a generic "ignore everything" mode

6.3 Interaction With Other Errors

The intent is:

  • duplicate PK gets special handling
  • everything else remains on the existing path

In particular:

  • NULL PK remains on the existing path
  • CSV parsing errors remain on the existing path
  • Parquet schema / decode errors remain on the existing path
  • cast / conversion errors remain on the existing path

That means the new option is not a replacement for all current error handling. It is a focused extension on top of current file-copy behavior, specifically for duplicated primary keys.

6.4 Warning Behavior

When SKIP_DUPLICATE_PK=true:

  • duplicate-PK conflicts must not be appended to WarningContext
  • duplicate-PK conflicts must not appear in:
CALL show_warnings() RETURN *;

All non-duplicate-PK warnings remain unchanged and continue to use the existing warning pipeline.

7. Result Contract

For compatibility:

  • if SKIP_DUPLICATE_PK is not enabled, COPY keeps the existing single-string result schema
  • if SKIP_DUPLICATE_PK=true, COPY returns an extended schema

Recommended extended schema:

  • result: STRING
  • skipped_duplicate_pk_count: INT64
  • skipped_duplicate_pks: STRING[]

Example:

result skipped_duplicate_pk_count skipped_duplicate_pks
997 tuples have been copied to the User table. 3 ["u01","u15","u99"]

Notes:

  • PK values should be returned as strings regardless of physical PK type
  • returning a fixed STRING[] is preferred for Python API usability

8. Python API Expectations

No dedicated Python API change is required if the result is exposed as a normal query result set.

Expected usage:

res = conn.execute("COPY User FROM 'user.csv' (SKIP_DUPLICATE_PK=true)")
row = res.get_next()
row["skipped_duplicate_pks"]

Implication:

  • existing client code that assumes COPY always returns exactly one string column may need to handle the extended schema when the new option is enabled
  • client code that reads by column name should remain straightforward

9. Compatibility Requirements

  • disabled by default
  • no change to existing IGNORE_ERRORS behavior when the new option is absent
  • no change to COPY FROM subquery behavior, because subquery support is out of scope
  • no change to rel-table copy semantics
  • no change to show_warnings() contract

10. Validation Requirements

The feature should be considered complete when all of the following are true:

  1. CSV node copy with duplicated PKs can continue with SKIP_DUPLICATE_PK=true
  2. Parquet node copy with duplicated PKs can continue with SKIP_DUPLICATE_PK=true
  3. duplicate PK values are returned in the final result
  4. duplicate PK values do not appear in show_warnings()
  5. non-duplicate errors still follow the existing path
  6. subquery copy does not accept the new option
  7. rel copy does not accept the new option

11. Current Test Coverage

Covered by current tests:

  • CSV duplicate PK skip for integer PKs
  • CSV duplicate PK skip for string PKs
  • repeated duplicate PK values returned in result
  • duplicate conflicts against already existing rows
  • warning suppression for duplicate PKs
  • rel copy rejection
  • subquery copy rejection
  • Parquet option validation
  • Parquet NULL PK still failing on the existing path

Recommended future additions:

  • multi-file SKIP_DUPLICATE_PK=true result verification
  • an explicit test for stable result column names in CLI / query-result metadata

Signed-off-by: ericyuanhui <285521263@qq.com>
@adsharma

Copy link
Copy Markdown
Contributor

Thank you. Two major comments:

  • Unbounded pks vector + global mutex per duplicate (recordSkippedDuplicatePK): every duplicate takes duplicatePKSkipResult->mtx. On PARALLEL=true CSV with heavy duplication this serializes the insert pipeline on one mutex. Combined with the unbounded growth, this is both a perf and a memory hazard for the exact workload the feature targets.

  • Fragile count math in finalizeInternal (node_batch_insert.cpp):

      getNumRows() - getNumErroredRows() - skippedCount

    This assumes skipped dup-PK rows are counted in getNumRows() (inserted-then-deleted) but not in getNumErroredRows(), and that skippedCount corrects for the over-count. That's a three-way invariant spread across the error handler, the index builder, and finalize. If any future change makes deleteCurrentErroneousRow decrement the row counter, or makes the dup path increment numErroredRows, this silently double-subtracts and reports a wrong copied-tuple count. Worth a comment at minimum, ideally an assertion.

For the first issue, if there are a million dup pks, returning the first N (say 10) should suffice.

Also the PR was against 0.17.1. I'll rebase it to main to check if the tests still pass.

@ericyuanhui

Copy link
Copy Markdown
Author
  • Unbounded pks vector + global mutex per duplicate (recordSkippedDuplicatePK): every duplicate takes duplicatePKSkipResult->mtx. On PARALLEL=true CSV with heavy duplication this serializes the insert pipeline on one mutex. Combined with the unbounded growth, this is both a perf and a memory hazard for the exact workload the feature targets.

Sorry, I didn’t quite get this. Copying from the node table—why would there be heavy duplication? If this feature isn’t enabled, then copy from will throw an error if there are duplicate PKs. Even if you enable IGNORE_ERRORS=false, duplicates still won’t be inserted; the error just gets put into WarningContext, which also takes up memory. What we expect is that user data has as few duplicate PKs as possible, or none at all. If there are duplicates, we want to return them directly to the user, so they can handle it in the next steps. This change is mostly to better support Parquet files. Right now, Parquet’s WarningContext content is also limited, so to make it more usable, we’ve made changes to copy from for both CSV and Parquet. This won’t affect the original execution logic, and it only takes effect if you enable the new parameter SKIP_DUPLICATE_PK=true.

@ericyuanhui

Copy link
Copy Markdown
Author

Thank you. Two major comments:

  • Unbounded pks vector + global mutex per duplicate (recordSkippedDuplicatePK): every duplicate takes duplicatePKSkipResult->mtx. On PARALLEL=true CSV with heavy duplication this serializes the insert pipeline on one mutex. Combined with the unbounded growth, this is both a perf and a memory hazard for the exact workload the feature targets.

  • Fragile count math in finalizeInternal (node_batch_insert.cpp):

      getNumRows() - getNumErroredRows() - skippedCount

    This assumes skipped dup-PK rows are counted in getNumRows() (inserted-then-deleted) but not in getNumErroredRows(), and that skippedCount corrects for the over-count. That's a three-way invariant spread across the error handler, the index builder, and finalize. If any future change makes deleteCurrentErroneousRow decrement the row counter, or makes the dup path increment numErroredRows, this silently double-subtracts and reports a wrong copied-tuple count. Worth a comment at minimum, ideally an assertion.

For the first issue, if there are a million dup pks, returning the first N (say 10) should suffice.

Also the PR was against 0.17.1. I'll rebase it to main to check if the tests still pass.

Thanks for the detailed review. I addressed both points in this round.
For the duplicate-PK collection path, I kept the existing product behavior of returning all duplicate PK values, but changed the implementation to avoid taking the shared result mutex on every duplicate. Duplicate PKs are now collected in thread-local state during execution and merged once per worker into the shared final result. I also moved PK-to-string conversion out of the shared critical path. This keeps the full result contract intact while reducing mutex contention under highly duplicated parallel imports.
For the copied-tuple count, I kept the current formula but made the invariant explicit in code. In finalizeInternal, I added a comment documenting that duplicate-PK rows are counted in getNumRows(), do not contribute to getNumErroredRows(), and are subtracted explicitly via skippedCount. I also added a debug assertion to guard that invariant so future changes are more likely to fail loudly instead of silently misreporting the copied row count.
I also added coverage for:
multi-file SKIP_DUPLICATE_PK=true
stable result column names in query-result metadata
I did not change the user-visible result contract in this round: skipped_duplicate_pks still returns the full list rather than a truncated sample.
If you want, I can also turn this into a shorter PR comment version.

@adsharma

Copy link
Copy Markdown
Contributor

I addressed both points in this round.

Where is the code? Did you forget to push?

Your comments look reasonable.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants