[fix](variant) allow inverted index pushdown for cast predicates on variant subcolumns#63118
[fix](variant) allow inverted index pushdown for cast predicates on variant subcolumns#63118wuguowei1994 wants to merge 2 commits into
Conversation
|
Thank you for your contribution to Apache Doris. Please clearly describe your PR:
|
…ariant subcolumns
e75111a to
904d4c0
Compare
|
run buildall |
|
/review |
eldenmoon
left a comment
There was a problem hiding this comment.
I found a correctness blocker in the relaxed variant predicate compatibility check. The current regression covers only the same-width CAST(... AS INT) case, but this change also enables cross-width integer casts and same-family string casts without normalizing the predicate value to the segment storage encoding.
TPC-H: Total hot run time: 29643 ms |
|
Thank you very much for the patient and detailed feedback. I have reconsidered the scope and decided to address the type compatibility issue more comprehensively in this PR. The implementation has been updated to:
This ensures both correctness and broader usability without introducing unsafe behavior. I have also strengthened the regression tests:
This change makes the CAST predicate pushdown on VARIANT subcolumns work reliably in the common usage patterns recommended for VARIANT. PS: |
TPC-DS: Total hot run time: 170611 ms |
43fb305 to
d4d86b8
Compare
…ariant subcolumns
d4d86b8 to
6c2b533
Compare
There was a problem hiding this comment.
Thanks for updating the patch to avoid unsafe cross-width index encoding. I still think this needs changes because the latest exact-type rule no longer fixes the original CAST(v["int_key"] AS INT) scenario described in the PR body. Critical checkpoints: goal/test coverage is not satisfied because only same-width TINYINT pushdown is now proven while the documented INT reproduction remains non-pushdown; the code change is small and focused; no new concurrency, lifecycle, config, persistence, FE-BE protocol, or storage-format compatibility concerns were introduced; the main correctness risk is now an incomplete fix rather than wrong-result pushdown; observability is unchanged and adequate for this path through the existing debug/profile checks. User focus: no additional user-provided focus was specified.
| } else { | ||
| return false; | ||
| } | ||
| auto normalized_storage_type = remove_nullable(storage_column_type); |
There was a problem hiding this comment.
This exact-type check avoids the unsafe cross-width index encoding that was raised earlier, but it also means the PR no longer fixes the reproduction in the PR body. For the inserted values 1..20, the new regression test itself documents the inferred storage type as TINYINT; when the query is the original CAST(v["int_key"] AS INT) = 13, target_cast_type_for_variants is INT, storage_column_type is TINYINT, and this returns false, so the common expression remains outside the inverted-index path and the scan still reads all rows. The added positive test was changed to CAST AS TINYINT, so it does not prove the stated CAST AS INT behavior. Please either implement a safe conversion of predicate values to the segment storage type with range/overflow checks, or narrow the PR/test expectations so they no longer claim to fix the INT cast case.
|
@eldenmoon After reconsidering it, I believe we should strive for higher standards ourselves. I’ve revised the approach described in the comment above. Please give me one week to come back with a better implementation. |
|
currently only bigint in interger types will be infered |
Summary
On the current master branch, inverted index predicate pushdown does not work correctly when querying
VARIANTfields with explicitCAST.This is a serious issue because it is not limited to a single target type. In our testing, predicates in the form of
CAST(variant_field["key"] AS <type>) = ...fail to leverage the inverted index properly across castedVARIANTaccess patterns.This is especially problematic because the recommended usage for
VARIANTfields is to explicitly useCASTwhen extracting typed values. In our internal production workloads,VARIANTis heavily used, and all business teams are required to queryVARIANTsubfields through explicitCAST. As a result, these queries cannot benefit from inverted index filtering and end up scanning significantly more rows than expected, causing severe performance degradation.For production workloads with large
VARIANTcolumns, this effectively makes the inverted index unusable for the officially recommended query pattern, which has a major impact on query latency and resource consumption.Reproduction
Expected Behavior
The predicate:
should be pushed down to the inverted index on the
VARIANTcolumn, and the query should use the inverted index to filter rows before data scanning.Only the matching row should need to be read after index filtering.
Actual Behavior
The query result is correct, but the query profile shows that the inverted index does not effectively filter the data.
Instead of being pruned by the inverted index, all 20 rows are still read/scanned. This indicates that the predicate involving
CASTon theVARIANTsubfield is not correctly handled by inverted index predicate pushdown.Please check the query profile after running the reproduction SQL. The key point is that the inverted index does not successfully reduce the scanned rows for the casted
VARIANTpredicate.