fix(aggregate): show aliased expr in explain#21739
Conversation
|
cc @pepijnve |
|
It has much more cases than I realized initially 😢 |
|
I updated the code with new changes along with pr body |
|
I tried two approaches for this. The first approach was by hiding the bit in alias metadata. It fixed the display problem, but it also mixed planner-only state with user metadata. That made the behavior harder to reason about and forced extra handling in places like equality, hashing, serialization, and rewrite logic. In practice, a simple display fix startedaffecting alias identity and metadata flow in too many places. The current approach adds an explicit internal flag to alias. This is a small API break, but it makes the model much clearer: whether an alias is user written or planner generated is now part of the type itself, not hidden in metadata. That keeps the display logic direct, avoids hidden state leaking into unrelated code paths, and makes future maintenance safer because the intent is visible and compiler checked. Would love to hear your thoughts @pepijnve |
|
@kumarUjjawal could you give an example of where/when the internal flag is necessary? |
| .map(|expr| expr.human_display()) | ||
| .map(|expr| { | ||
| let human_display = expr.human_display(); | ||
| if human_display.is_empty() { |
There was a problem hiding this comment.
Should human_display be Option<String>?
There was a problem hiding this comment.
Yeah I could do that.
| } | ||
|
|
||
| #[tokio::test] | ||
| async fn test_aggregate_explain_shows_aliased_expression() -> Result<()> { |
There was a problem hiding this comment.
These might be more concise as SLTs
I think the current behavior is intentional. My thinking:
count() is in the second group, so keeping aggr=[count()] is expected. The bug here is about user aliases like sum(...) as agg disappearing in physical explain, not about exposing every internal rewrite. So while the physical plan does execute the lowered count(1) form, showing count(1) as count() in explain would expose planner internals and would regress the compact count() output we already preserve elsewhere. If we want physical explain to show lowered forms more generally, I think that should be doable but will require some changes. |
A good example could be: Internally, COUNT() is lowered to count(1) so it can use the normal aggregate path. But the user did not write count(1), they wrote count(). So the planner needs a way to preserve that user-facing name without treating it like a real user alias. That is what the internal flag is for: it marks aliases that exist only because the planner rewrote the expression. Without that bit, physical explain cannot tell these two cases apart:
Those should display differently. For example, with: SELECT COUNT(*) AS total_rows FROM t there are really two alias layers:
The internal flag lets explain show count() as total_rows, instead of either exposing the lowered form count(1) as count() as total_rows or collapsing everything to just total_rows. |
|
This might just be personal preference speaking, but in a physical explain plan I'm looking for what the engine is actually doing (how is it being executed), not what I wrote as query. It doesn't make sense for me that the logical plan (which is more declarative in nature) shows the lowered version, while the physical plan (which is more imperative) does not. If anything it should be the other way around that logical hides this detail, and physical shows it. So if I got to choose I would prefer the left side plan of the diff image above rather than the right one. |
I think I like this approach too, showing lowered expressions in physical explain can be more useful. Let me see how the code looks for this. |
21c1690 to
2580bc5
Compare
|
Thank you for opening this pull request! Reviewer note: cargo-semver-checks reported the current version number is not SemVer-compatible with the changes in this pull request (compared against the base branch). Details |
| relation: None, | ||
| name: field.name().to_string(), | ||
| metadata: None, | ||
| is_internal: false, |
There was a problem hiding this comment.
Do we still need the is_internal flag? I may have overlooked it, but it doesn't seem to actually be used anymore. It is passed on everywhere, but in the end setting it to true vs false doesn't seem to have an impact on the explain output.
There was a problem hiding this comment.
No we don't need it now. I will remove it. Thanks you!
| 03)----CoalescePartitionsExec | ||
| 04)------AggregateExec: mode=Partial, gby=[], aggr=[sum(alias1), sum(alias2)] | ||
| 05)--------AggregateExec: mode=FinalPartitioned, gby=[alias1@0 as alias1], aggr=[alias2] | ||
| 05)--------AggregateExec: mode=FinalPartitioned, gby=[alias1@0 as alias1], aggr=[sum(__common_expr_1) as alias2] |
There was a problem hiding this comment.
Perfect! This is an example of exactly the kind of issue I was struggling with in the explain plan output. The left hand side here which only contains aggr=[alias2] makes the plan unusable.
| 01)AggregateExec: mode=Final, gby=[], aggr=[max(aggregate_test_100.c2) as percentile_cont(aggregate_test_100.c2,Float64(1))] | ||
| 02)--CoalescePartitionsExec | ||
| 03)----AggregateExec: mode=Partial, gby=[], aggr=[percentile_cont(aggregate_test_100.c2,Float64(1))] | ||
| 03)----AggregateExec: mode=Partial, gby=[], aggr=[max(aggregate_test_100.c2) as percentile_cont(aggregate_test_100.c2,Float64(1))] |
There was a problem hiding this comment.
This is a nice improvement since it makes the optimisation that happened visible in the plan.
pepijnve
left a comment
There was a problem hiding this comment.
If possible, I think it would help to revert the 'internal' portions of this PR. That would reduce the number of changes and improve the signal/noise ratio.
| - internal aggregate aliases may now show the underlying expression instead of | ||
| only the alias name | ||
|
|
||
| If you have tooling that parses the `aggr=[...]` text from physical `EXPLAIN`, |
There was a problem hiding this comment.
Are people actually doing this? I guess it might be necessary sometimes, but that seems extremely brittle. The more general is if the explain output is considered part of the stable API of DataFusion or not.
|
|
||
| #[doc(hidden)] | ||
| pub fn has_aliased_human_display(&self) -> bool { | ||
| self.human_display_is_aliased |
There was a problem hiding this comment.
Could this be a derived property of human_display or is it necessary for reliability to make it explicit? Just wondering if it's worth memoizing this eagerly.
There was a problem hiding this comment.
I kept it explicit because I was thinking of it as display metadata.
We could derive it by checking whether human_display ends with as <name>, but that puts us back into parsing display strings. The explicit flag lets the builder validate the invariant once, and the formatting/reverse code can use it without guessing.
What do you think?
There was a problem hiding this comment.
The downside is that I can now create a conflicting situation where the display string is aliased but the boolean indicates it isn't and vice versa. strip_alias_suffix is already kind of parsing the display string as well, so maybe the fact that we need to inspect the string value isn't that big of a problem?
The main reason I asked is because the addition of this field triggers a protobuf change as well and I was wondering if that was desirable.
There was a problem hiding this comment.
One solution to this could be to make human_display (String, Option<String>) (or an equivalent explicit type). In other words make the alias an explicit and optional part. That way you wouldn't have to strip it either.
There was a problem hiding this comment.
Let me try this and see how it turn out.
f25fb1a to
1275f19
Compare
|
There are few more cleanups left. |
pepijnve
left a comment
There was a problem hiding this comment.
For me these plan format changes are exactly what I was hoping for and I think we were able to eliminate some rough edges in the code along the way. At this point we probably need input from a maintainer.
| 02)--SortExec: expr=[l_returnflag@0 ASC NULLS LAST, l_linestatus@1 ASC NULLS LAST], preserve_partitioning=[true] | ||
| 03)----ProjectionExec: expr=[l_returnflag@0 as l_returnflag, l_linestatus@1 as l_linestatus, sum(lineitem.l_quantity)@2 as sum_qty, sum(lineitem.l_extendedprice)@3 as sum_base_price, sum(lineitem.l_extendedprice * Int64(1) - lineitem.l_discount)@4 as sum_disc_price, sum(lineitem.l_extendedprice * Int64(1) - lineitem.l_discount * Int64(1) + lineitem.l_tax)@5 as sum_charge, avg(lineitem.l_quantity)@6 as avg_qty, avg(lineitem.l_extendedprice)@7 as avg_price, avg(lineitem.l_discount)@8 as avg_disc, count(Int64(1))@9 as count_order] | ||
| 04)------AggregateExec: mode=FinalPartitioned, gby=[l_returnflag@0 as l_returnflag, l_linestatus@1 as l_linestatus], aggr=[sum(lineitem.l_quantity), sum(lineitem.l_extendedprice), sum(lineitem.l_extendedprice * Int64(1) - lineitem.l_discount), sum(lineitem.l_extendedprice * Int64(1) - lineitem.l_discount * Int64(1) + lineitem.l_tax), avg(lineitem.l_quantity), avg(lineitem.l_extendedprice), avg(lineitem.l_discount), count(Int64(1))] | ||
| 04)------AggregateExec: mode=FinalPartitioned, gby=[l_returnflag@0 as l_returnflag, l_linestatus@1 as l_linestatus], aggr=[sum(lineitem.l_quantity), sum(lineitem.l_extendedprice), sum(__common_expr_1) as sum(lineitem.l_extendedprice * Int64(1) - lineitem.l_discount), sum(__common_expr_1 * Some(1),20,0 + lineitem.l_tax) as sum(lineitem.l_extendedprice * Int64(1) - lineitem.l_discount * Int64(1) + lineitem.l_tax), avg(lineitem.l_quantity), avg(lineitem.l_extendedprice), avg(lineitem.l_discount), count(Int64(1))] |
There was a problem hiding this comment.
Is this intentional? The __comon_expr_ part seems like unfortunate noise. Same with Some(1). I think what we'd want is sum(lineitem.l_extendedprice * 1 - lineitem.l_discount)
There was a problem hiding this comment.
In the current state of this MR the __common_expr_1 part is certainly intentional. My argumentation above was that this exposes the fact that line 4 is not actually executing sum(lineitem.l_extendedprice * 1 - lineitem.l_discount), that's the logical view of things. Instead sum(__common_expr_1) is being computed and __common_expr_1 is calculated in line 7. For the physical plan I don't think you would want to obscure this.
The Some(1) bit is unfortunate indeed, but I think that's a separate issue. The physical explain output doesn't format literals as nicely as the logical plan output does.
There was a problem hiding this comment.
The
Some(1)bit is unfortunate indeed, but I think that's a separate issue. The physical explain output doesn't format literals as nicely as the logical plan output does.
Yeah I noticed this but left as is for now. @adriangb what's your overall thoughts on the current approach.
There was a problem hiding this comment.
Based on @adriangb's comment I think this change may be controversial. Do we want more input on this?
There was a problem hiding this comment.
Fwiw: I don't think it's a big deal either way. There's always going to be some tension between verbosity and readability. Maybe we need a flag for this (#21768 ?) or maybe if no one complains it's fine. Either way, I don't think there's a strong contract we promise with this, we can always revert the change.
| name: Option<String>, | ||
| human_displan: String, | ||
| human_display: Option<String>, | ||
| human_display_alias: Option<String>, |
There was a problem hiding this comment.
I explored this change a bit with Claude and came up with what I think is a good proposal. Here's the transcript / summary.
One concern about the public API surface here: this changes the signature of create_aggregate_expr_with_name_and_maybe_filter (a pub fn) by adding a new human_display_alias: Option<String> parameter and changing human_display from String to Option<String>. That's a silent break for any downstream code that builds aggregate physical expressions through this entry point — and it isn't called out in docs/source/library-user-guide/upgrading/54.0.0.md.
Stepping back: I think the existence of create_aggregate_expr_with_name_and_maybe_filter and create_aggregate_expr_and_maybe_filter as parallel public free functions is already a smell. They're really one thing — "lower a logical aggregate Expr into a physical AggregateFunctionExpr plus its filter and order-by" — split awkwardly across two signatures. We already have AggregateExprBuilder for the pure physical-construction half. What's missing is a builder for the logical→physical lowering half.
Could we kill two birds with one stone in this PR by introducing a LoweredAggregateBuilder?
// datafusion-physical-expr/src/aggregate.rs
pub struct LoweredAggregateBuilder<'a> { /* expr + schemas + execution_props + overrides */ }
pub struct LoweredAggregate {
pub aggregate: Arc<AggregateFunctionExpr>,
pub filter: Option<Arc<dyn PhysicalExpr>>,
pub order_bys: Vec<PhysicalSortExpr>,
}
impl<'a> LoweredAggregateBuilder<'a> {
pub fn new(expr: &'a Expr, logical_schema: &'a DFSchema, physical_schema: &'a Schema, execution_props: &'a ExecutionProps) -> Self;
pub fn with_name(mut self, name: impl Into<String>) -> Self;
pub fn build(self) -> Result<LoweredAggregate>;
}Internally, build() does what these two functions do today: unwrap aliases, derive name / human_display / human_display_alias, lower args / filter / order-by via the existing create_physical_* helpers, then hand off to AggregateExprBuilder. All the new aliased-display logic from this PR lives here.
The proposed migration shape:
- Add
LoweredAggregateBuildernext toAggregateExprBuilderindatafusion-physical-expr/src/aggregate.rs(thecreate_physical_*helpers it needs already live in that crate, so no new dependencies). - Revert the signature change to
create_aggregate_expr_with_name_and_maybe_filter. Keep both free functions on their pre-PR signatures. - Mark them
#[deprecated(note = "use LoweredAggregateBuilder")]. They become thin delegations — the deprecated_with_name_variant passes its singlehuman_displaystraight through with no alias, matching pre-PR behavior. The new aliased-display path is only reachable viaLoweredAggregateBuilder. - Migrate the planner's own call site (the only in-tree caller) over to
LoweredAggregateBuilderin this same PR, so deprecation warnings don't fire on internal code.
Why I'm suggesting this in this PR rather than a follow-up: this PR is already breaking the public surface. If we land it as-is and clean up later, downstream users pay the migration cost twice — once now for the new parameter, once again when we deprecate the function. Doing it in one shot means:
- Zero public signature breaks (
AggregateFunctionExpr::human_display() -> Option<&str>is still a break and still needs the upgrade-guide entry, but the free-function break disappears). - The upgrade guide gets a "prefer
LoweredAggregateBuilder" pointer instead of a parameter-list diff. - Future additions to aggregate construction stop touching public function signatures.
I realize this enlarges the PR and is partly scope creep on what was supposed to be an EXPLAIN fix — happy to help with the refactor (or pair with a committer who can) if it would be useful. What do you think?
There was a problem hiding this comment.
If this makes sense it might be nice to break out as it's own PR (at least introducing LoweredAggregateBuilder
adriangb
left a comment
There was a problem hiding this comment.
I don't have strong opinions on the new display vs. old display. There's always going to be a tradeoff between pretty vs. has all of the info / reflects the ugly reality. More importantly we don't have a strict API contract with that, we can change it and later revert or tweak if there are complaints.
I flagged the one public API change I saw, I think we should discuss that before merging.
|
I'm going to try to do this and see how it looks:
|
adriangb
left a comment
There was a problem hiding this comment.
Some minor nits but overall LGTM!
| self | ||
| } | ||
|
|
||
| #[doc(hidden)] |
| } | ||
| } | ||
|
|
||
| pub fn with_name(mut self, name: impl Into<String>) -> Self { |
There was a problem hiding this comment.
We should add docstrings to all of these. Ideally I can read the docstrings on these functions / structs with general DataFusion knowledge but without fully understanding how planning works and get some idea of what these do.
| } | ||
|
|
||
| #[cfg(test)] | ||
| mod tests { |
There was a problem hiding this comment.
I feel like we can probably port other unit tests to test the new builder directly? It seems like a good unit of code to unit test. Can be a followup / when we remove the old methods, but especially any tests testing the now deprecated methods are good candidates.
|
@pepijnve would you like to take another look |
| self.expr.hash(state); | ||
| self.relation.hash(state); | ||
| self.name.hash(state); | ||
| self.metadata.hash(state); |
There was a problem hiding this comment.
Does including metadata in equality and hash calculations risk any unintended side effects? If it's not required for the plan change itself, it might be safer to not include this in this PR.
|
|
||
| fn encode_human_display_alias(human_display: &str, alias: &str) -> String { | ||
| format!( | ||
| "{HUMAN_DISPLAY_ALIAS_PREFIX}{}:{alias}{human_display}", |
There was a problem hiding this comment.
An alternative here that may be kinder to other consumers would be to encode using {human_display} as {alias} kind of like the code was doing before. The split logic is less robust, but maybe good enough?
| 02)--SortExec: expr=[l_returnflag@0 ASC NULLS LAST, l_linestatus@1 ASC NULLS LAST], preserve_partitioning=[true] | ||
| 03)----ProjectionExec: expr=[l_returnflag@0 as l_returnflag, l_linestatus@1 as l_linestatus, sum(lineitem.l_quantity)@2 as sum_qty, sum(lineitem.l_extendedprice)@3 as sum_base_price, sum(lineitem.l_extendedprice * Int64(1) - lineitem.l_discount)@4 as sum_disc_price, sum(lineitem.l_extendedprice * Int64(1) - lineitem.l_discount * Int64(1) + lineitem.l_tax)@5 as sum_charge, avg(lineitem.l_quantity)@6 as avg_qty, avg(lineitem.l_extendedprice)@7 as avg_price, avg(lineitem.l_discount)@8 as avg_disc, count(Int64(1))@9 as count_order] | ||
| 04)------AggregateExec: mode=FinalPartitioned, gby=[l_returnflag@0 as l_returnflag, l_linestatus@1 as l_linestatus], aggr=[sum(lineitem.l_quantity), sum(lineitem.l_extendedprice), sum(lineitem.l_extendedprice * Int64(1) - lineitem.l_discount), sum(lineitem.l_extendedprice * Int64(1) - lineitem.l_discount * Int64(1) + lineitem.l_tax), avg(lineitem.l_quantity), avg(lineitem.l_extendedprice), avg(lineitem.l_discount), count(Int64(1))] | ||
| 04)------AggregateExec: mode=FinalPartitioned, gby=[l_returnflag@0 as l_returnflag, l_linestatus@1 as l_linestatus], aggr=[sum(lineitem.l_quantity), sum(lineitem.l_extendedprice), sum(__common_expr_1) as sum(lineitem.l_extendedprice * Int64(1) - lineitem.l_discount), sum(__common_expr_1 * Some(1),20,0 + lineitem.l_tax) as sum(lineitem.l_extendedprice * Int64(1) - lineitem.l_discount * Int64(1) + lineitem.l_tax), avg(lineitem.l_quantity), avg(lineitem.l_extendedprice), avg(lineitem.l_discount), count(Int64(1))] |
There was a problem hiding this comment.
Based on @adriangb's comment I think this change may be controversial. Do we want more input on this?
|
@kumarUjjawal if you wanted to reduce the diff it might be nice to split #21739 (comment) into a refactor only PR since it's mostly addressing an existing code smell. |

Which issue does this PR close?
Rationale for this change
Physical explain output only showed the alias for aliased aggregates. That made it hard to understand the plan, especially when the aggregate had a filter, explicit RESPECT NULLS, or a custom UDAF display.
What changes are included in this PR?
Are these changes tested?
Yes
Are there any user-facing changes?
Yes.
This is a public API change for users who build or pattern match Alias directly. The upgrade guide has been updated with the needed changes.