add COPY FROM SKIP_DUPLICATE_PK feature#601
Conversation
|
Thank you. Two major comments:
For the first issue, if there are a million dup pks, returning the first N (say 10) should suffice. Also the PR was against 0.17.1. I'll rebase it to main to check if the tests still pass. |
Sorry, I didn’t quite get this. Copying from the node table—why would there be heavy duplication? If this feature isn’t enabled, then copy from will throw an error if there are duplicate PKs. Even if you enable IGNORE_ERRORS=false, duplicates still won’t be inserted; the error just gets put into WarningContext, which also takes up memory. What we expect is that user data has as few duplicate PKs as possible, or none at all. If there are duplicates, we want to return them directly to the user, so they can handle it in the next steps. This change is mostly to better support Parquet files. Right now, Parquet’s WarningContext content is also limited, so to make it more usable, we’ve made changes to copy from for both CSV and Parquet. This won’t affect the original execution logic, and it only takes effect if you enable the new parameter SKIP_DUPLICATE_PK=true. |
Thanks for the detailed review. I addressed both points in this round. |
Where is the code? Did you forget to push? Your comments look reasonable. |
push it just now. thx. |
Signed-off-by: ericyuanhui <285521263@qq.com>
COPY FROM
SKIP_DUPLICATE_PKRequirements1. Background
Ladybug currently supports
COPY FROMinto node tables with primary-key validation. When duplicate primary keys are encountered:IGNORE_ERRORS=false, the query failsIGNORE_ERRORS=true, the query continues and the duplicated rows are skippedThe existing
IGNORE_ERRORS=truebehavior is already close to the desired outcome, but it has two practical limitations for repeated-PK-heavy imports:CALL show_warnings() RETURN *This proposal introduces a focused option for file-based
COPY FROM:The new option should behave similarly to
IGNORE_ERRORS=truefor file imports, but with a special treatment for duplicate primary-key conflicts:COPYresult2. Goal
Add
SKIP_DUPLICATE_PK=trueforCOPY FROMon CSV and Parquet files targeting node tables, so that:WarningContext3. Non-Goals
This work does not include:
COPY FROMsubquery supportCOPYCALL show_warnings()behaviorIGNORE_ERRORSsemanticsINSERT,MERGE, or non-COPYwrite paths4. Scope
Included:
COPY <node_table> FROM "<file>.csv" (SKIP_DUPLICATE_PK=true)COPY <node_table> FROM "<file>.parquet" (SKIP_DUPLICATE_PK=true)Excluded:
COPY <node_table> FROM (<subquery>)COPY <rel_table> ...COPY FROM npy5. Syntax
New option:
Examples:
6. Semantics
6.1 Default Behavior
If
SKIP_DUPLICATE_PKis not specified, existing behavior remains unchanged.6.2 Behavior When Enabled
When
SKIP_DUPLICATE_PK=trueis specified for file-based node copy:if a row conflicts with an already existing PK in the target table:
if a row conflicts with an earlier row from the same
COPYexecution:if the error is not a duplicate-PK conflict:
6.3 Interaction With Other Errors
The intent is:
In particular:
NULL PKremains on the existing pathThat means the new option is not a replacement for all current error handling. It is a focused extension on top of current file-copy behavior, specifically for duplicated primary keys.
6.4 Warning Behavior
When
SKIP_DUPLICATE_PK=true:WarningContextAll non-duplicate-PK warnings remain unchanged and continue to use the existing warning pipeline.
7. Result Contract
For compatibility:
SKIP_DUPLICATE_PKis not enabled,COPYkeeps the existing single-string result schemaSKIP_DUPLICATE_PK=true,COPYreturns an extended schemaRecommended extended schema:
result: STRINGskipped_duplicate_pk_count: INT64skipped_duplicate_pks: STRING[]Example:
997 tuples have been copied to the User table.["u01","u15","u99"]Notes:
STRING[]is preferred for Python API usability8. Python API Expectations
No dedicated Python API change is required if the result is exposed as a normal query result set.
Expected usage:
Implication:
COPYalways returns exactly one string column may need to handle the extended schema when the new option is enabled9. Compatibility Requirements
IGNORE_ERRORSbehavior when the new option is absentCOPY FROMsubquery behavior, because subquery support is out of scopeshow_warnings()contract10. Validation Requirements
The feature should be considered complete when all of the following are true:
SKIP_DUPLICATE_PK=trueSKIP_DUPLICATE_PK=trueshow_warnings()11. Current Test Coverage
Covered by current tests:
NULL PKstill failing on the existing pathRecommended future additions:
SKIP_DUPLICATE_PK=trueresult verification