add COPY FROM SKIP_DUPLICATE_PK feature by ericyuanhui · Pull Request #601 · LadybugDB/ladybug

ericyuanhui · 2026-06-18T15:10:14Z

COPY FROM `SKIP_DUPLICATE_PK` Requirements

1. Background

Ladybug currently supports COPY FROM into node tables with primary-key validation. When duplicate primary keys are encountered:

with IGNORE_ERRORS=false, the query fails
with IGNORE_ERRORS=true, the query continues and the duplicated rows are skipped

The existing IGNORE_ERRORS=true behavior is already close to the desired outcome, but it has two practical limitations for repeated-PK-heavy imports:

duplicate-PK details are written into the warning pipeline and become visible through CALL show_warnings() RETURN *
large warning volumes can increase session memory usage unnecessarily when the caller only needs the conflicting PK values

This proposal introduces a focused option for file-based COPY FROM:

SKIP_DUPLICATE_PK=true

The new option should behave similarly to IGNORE_ERRORS=true for file imports, but with a special treatment for duplicate primary-key conflicts:

duplicated PK rows are skipped and import continues
duplicate-PK details are not written into the warning system
duplicate PK values are returned directly in the COPY result
all other errors continue to follow the existing behavior

2. Goal

Add SKIP_DUPLICATE_PK=true for COPY FROM on CSV and Parquet files targeting node tables, so that:

duplicate PK rows do not abort the import
duplicate PK rows are skipped
duplicate PK values are returned in a structured result
duplicate PK details do not enter WarningContext
non-duplicate-PK behavior remains aligned with existing file-import behavior

3. Non-Goals

This work does not include:

COPY FROM subquery support
relationship-table COPY
changing CALL show_warnings() behavior
changing generic IGNORE_ERRORS semantics
changing INSERT, MERGE, or non-COPY write paths
introducing new dedicated Python API methods
first-phase support for no-index PK validation paths

4. Scope

Included:

COPY <node_table> FROM "<file>.csv" (SKIP_DUPLICATE_PK=true)
COPY <node_table> FROM "<file>.parquet" (SKIP_DUPLICATE_PK=true)
multi-file CSV / Parquet imports
imports into node tables with primary-key hash-index support

Excluded:

COPY <node_table> FROM (<subquery>)
COPY <rel_table> ...
COPY FROM npy
node tables without a usable primary-key hash-index path for duplicate skip

5. Syntax

New option:

SKIP_DUPLICATE_PK=true

Examples:

COPY User FROM "user.csv" (SKIP_DUPLICATE_PK=true);
COPY User FROM "user.parquet" (SKIP_DUPLICATE_PK=true);
COPY User FROM ["a.csv", "b.csv"] (SKIP_DUPLICATE_PK=true);

6. Semantics

6.1 Default Behavior

If SKIP_DUPLICATE_PK is not specified, existing behavior remains unchanged.

6.2 Behavior When Enabled

When SKIP_DUPLICATE_PK=true is specified for file-based node copy:

if a row conflicts with an already existing PK in the target table:
- skip the row
- continue import
- record the conflicting PK value for the final result
- do not write the duplicate-PK detail into warnings
if a row conflicts with an earlier row from the same COPY execution:
- keep the first successful row
- skip the later conflicting row
- record the conflicting PK value for the final result
- do not write the duplicate-PK detail into warnings
if the error is not a duplicate-PK conflict:
- follow the existing flow
- do not convert the new option into a generic "ignore everything" mode

6.3 Interaction With Other Errors

The intent is:

duplicate PK gets special handling
everything else remains on the existing path

In particular:

NULL PK remains on the existing path
CSV parsing errors remain on the existing path
Parquet schema / decode errors remain on the existing path
cast / conversion errors remain on the existing path

That means the new option is not a replacement for all current error handling. It is a focused extension on top of current file-copy behavior, specifically for duplicated primary keys.

6.4 Warning Behavior

When SKIP_DUPLICATE_PK=true:

duplicate-PK conflicts must not be appended to WarningContext
duplicate-PK conflicts must not appear in:

CALL show_warnings() RETURN *;

All non-duplicate-PK warnings remain unchanged and continue to use the existing warning pipeline.

7. Result Contract

For compatibility:

if SKIP_DUPLICATE_PK is not enabled, COPY keeps the existing single-string result schema
if SKIP_DUPLICATE_PK=true, COPY returns an extended schema

Recommended extended schema:

result: STRING
skipped_duplicate_pk_count: INT64
skipped_duplicate_pks: STRING[]

Example:

result	skipped_duplicate_pk_count	skipped_duplicate_pks
`997 tuples have been copied to the User table.`	3	`["u01","u15","u99"]`

Notes:

PK values should be returned as strings regardless of physical PK type
returning a fixed STRING[] is preferred for Python API usability

8. Python API Expectations

No dedicated Python API change is required if the result is exposed as a normal query result set.

Expected usage:

res = conn.execute("COPY User FROM 'user.csv' (SKIP_DUPLICATE_PK=true)")
row = res.get_next()
row["skipped_duplicate_pks"]

Implication:

existing client code that assumes COPY always returns exactly one string column may need to handle the extended schema when the new option is enabled
client code that reads by column name should remain straightforward

9. Compatibility Requirements

disabled by default
no change to existing IGNORE_ERRORS behavior when the new option is absent
no change to COPY FROM subquery behavior, because subquery support is out of scope
no change to rel-table copy semantics
no change to show_warnings() contract

10. Validation Requirements

The feature should be considered complete when all of the following are true:

CSV node copy with duplicated PKs can continue with SKIP_DUPLICATE_PK=true
Parquet node copy with duplicated PKs can continue with SKIP_DUPLICATE_PK=true
duplicate PK values are returned in the final result
duplicate PK values do not appear in show_warnings()
non-duplicate errors still follow the existing path
subquery copy does not accept the new option
rel copy does not accept the new option

11. Current Test Coverage

Covered by current tests:

CSV duplicate PK skip for integer PKs
CSV duplicate PK skip for string PKs
repeated duplicate PK values returned in result
duplicate conflicts against already existing rows
warning suppression for duplicate PKs
rel copy rejection
subquery copy rejection
Parquet option validation
Parquet NULL PK still failing on the existing path

Recommended future additions:

multi-file SKIP_DUPLICATE_PK=true result verification
an explicit test for stable result column names in CLI / query-result metadata

Signed-off-by: ericyuanhui <285521263@qq.com>

adsharma · 2026-06-18T22:57:47Z

Thank you. Two major comments:

Unbounded pks vector + global mutex per duplicate (recordSkippedDuplicatePK): every duplicate takes duplicatePKSkipResult->mtx. On PARALLEL=true CSV with heavy duplication this serializes the insert pipeline on one mutex. Combined with the unbounded growth, this is both a perf and a memory hazard for the exact workload the feature targets.
Fragile count math in finalizeInternal (node_batch_insert.cpp):
```
  getNumRows() - getNumErroredRows() - skippedCount
```
This assumes skipped dup-PK rows are counted in getNumRows() (inserted-then-deleted) but not in getNumErroredRows(), and that skippedCount corrects for the over-count. That's a three-way invariant spread across the error handler, the index builder, and finalize. If any future change makes deleteCurrentErroneousRow decrement the row counter, or makes the dup path increment numErroredRows, this silently double-subtracts and reports a wrong copied-tuple count. Worth a comment at minimum, ideally an assertion.

For the first issue, if there are a million dup pks, returning the first N (say 10) should suffice.

Also the PR was against 0.17.1. I'll rebase it to main to check if the tests still pass.

ericyuanhui · 2026-06-19T12:34:58Z

Unbounded pks vector + global mutex per duplicate (recordSkippedDuplicatePK): every duplicate takes duplicatePKSkipResult->mtx. On PARALLEL=true CSV with heavy duplication this serializes the insert pipeline on one mutex. Combined with the unbounded growth, this is both a perf and a memory hazard for the exact workload the feature targets.

Sorry, I didn’t quite get this. Copying from the node table—why would there be heavy duplication? If this feature isn’t enabled, then copy from will throw an error if there are duplicate PKs. Even if you enable IGNORE_ERRORS=false, duplicates still won’t be inserted; the error just gets put into WarningContext, which also takes up memory. What we expect is that user data has as few duplicate PKs as possible, or none at all. If there are duplicates, we want to return them directly to the user, so they can handle it in the next steps. This change is mostly to better support Parquet files. Right now, Parquet’s WarningContext content is also limited, so to make it more usable, we’ve made changes to copy from for both CSV and Parquet. This won’t affect the original execution logic, and it only takes effect if you enable the new parameter SKIP_DUPLICATE_PK=true.

ericyuanhui · 2026-06-19T13:15:15Z

Thank you. Two major comments:
Unbounded pks vector + global mutex per duplicate (recordSkippedDuplicatePK): every duplicate takes duplicatePKSkipResult->mtx. On PARALLEL=true CSV with heavy duplication this serializes the insert pipeline on one mutex. Combined with the unbounded growth, this is both a perf and a memory hazard for the exact workload the feature targets.
Fragile count math in finalizeInternal (node_batch_insert.cpp):
  getNumRows() - getNumErroredRows() - skippedCount
This assumes skipped dup-PK rows are counted in getNumRows() (inserted-then-deleted) but not in getNumErroredRows(), and that skippedCount corrects for the over-count. That's a three-way invariant spread across the error handler, the index builder, and finalize. If any future change makes deleteCurrentErroneousRow decrement the row counter, or makes the dup path increment numErroredRows, this silently double-subtracts and reports a wrong copied-tuple count. Worth a comment at minimum, ideally an assertion.
For the first issue, if there are a million dup pks, returning the first N (say 10) should suffice.

Also the PR was against 0.17.1. I'll rebase it to main to check if the tests still pass.

Thanks for the detailed review. I addressed both points in this round.
For the duplicate-PK collection path, I kept the existing product behavior of returning all duplicate PK values, but changed the implementation to avoid taking the shared result mutex on every duplicate. Duplicate PKs are now collected in thread-local state during execution and merged once per worker into the shared final result. I also moved PK-to-string conversion out of the shared critical path. This keeps the full result contract intact while reducing mutex contention under highly duplicated parallel imports.
For the copied-tuple count, I kept the current formula but made the invariant explicit in code. In finalizeInternal, I added a comment documenting that duplicate-PK rows are counted in getNumRows(), do not contribute to getNumErroredRows(), and are subtracted explicitly via skippedCount. I also added a debug assertion to guard that invariant so future changes are more likely to fail loudly instead of silently misreporting the copied row count.
I also added coverage for:
multi-file SKIP_DUPLICATE_PK=true
stable result column names in query-result metadata
I did not change the user-visible result contract in this round: skipped_duplicate_pks still returns the full list rather than a truncated sample.
If you want, I can also turn this into a shorter PR comment version.

adsharma · 2026-06-19T15:58:41Z

I addressed both points in this round.

Where is the code? Did you forget to push?

Your comments look reasonable.

ericyuanhui force-pushed the v17_1 branch from 3480b9c to e748d42 Compare June 18, 2026 15:19

add COPY FROM SKIP_DUPLICATE_PK feature

41b2df8

Signed-off-by: ericyuanhui <285521263@qq.com>

adsharma force-pushed the v17_1 branch from e748d42 to 41b2df8 Compare June 18, 2026 22:58

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

add COPY FROM SKIP_DUPLICATE_PK feature#601

add COPY FROM SKIP_DUPLICATE_PK feature#601
ericyuanhui wants to merge 1 commit into
LadybugDB:mainfrom
ericyuanhui:v17_1

ericyuanhui commented Jun 18, 2026

Uh oh!

adsharma commented Jun 18, 2026

Uh oh!

ericyuanhui commented Jun 19, 2026

Uh oh!

ericyuanhui commented Jun 19, 2026

Uh oh!

adsharma commented Jun 19, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

ericyuanhui commented Jun 18, 2026

COPY FROM SKIP_DUPLICATE_PK Requirements

1. Background

2. Goal

3. Non-Goals

4. Scope

5. Syntax

6. Semantics

6.1 Default Behavior

6.2 Behavior When Enabled

6.3 Interaction With Other Errors

6.4 Warning Behavior

7. Result Contract

8. Python API Expectations

9. Compatibility Requirements

10. Validation Requirements

11. Current Test Coverage

Uh oh!

adsharma commented Jun 18, 2026

Uh oh!

ericyuanhui commented Jun 19, 2026

Uh oh!

ericyuanhui commented Jun 19, 2026

Uh oh!

adsharma commented Jun 19, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

COPY FROM `SKIP_DUPLICATE_PK` Requirements