Skip to content

Add StatisticsContext parameter to partition_statistics#21815

Open
asolimando wants to merge 8 commits into
apache:mainfrom
asolimando:asolimando/partition-statistics-context
Open

Add StatisticsContext parameter to partition_statistics#21815
asolimando wants to merge 8 commits into
apache:mainfrom
asolimando:asolimando/partition-statistics-context

Conversation

@asolimando
Copy link
Copy Markdown
Member

@asolimando asolimando commented Apr 23, 2026

Which issue does this PR close?

Closes #20184

Rationale for this change

ExecutionPlan::partition_statistics forces each operator to re-fetch child statistics internally, causing redundant subtree walks in deep plans.

What changes are included in this PR?

  • Deprecate partition_statistics in favor of statistics_with_args(&self, args: &StatisticsArgs), an extensible signature that won't require downstream churn when new parameters are added
  • StatisticsArgs carries the partition index and a shared per-call StatsCache, eliminating redundant subtree walks within a single compute_statistics call
  • Child stats are pre-computed with partition=None and cached; operators look them up via args.child_stats_of(child) (overall) or args.child_stats_for(child) (partition-aware)
  • Criterion micro-benchmark on three plan shapes from [EPIC] Improve query planning speed #19795

Tests

Existing tests pass unchanged. New unit test verifies the caching contract.

Test plan

  • cargo fmt --all
  • cargo clippy --all-targets --all-features -- -D warnings
  • cargo test --profile ci --all-features on affected crates
  • Criterion benchmark: ~26x (coalesce chain), ~5x (cross-join tree), ~25x (filter chain) speedup

Disclaimer: I used AI to assist in the code generation, I have manually reviewed the output and it matches my intention and understanding.

@github-actions github-actions Bot added documentation Improvements or additions to documentation optimizer Optimizer rules core Core DataFusion crate datasource Changes to the datasource crate physical-plan Changes to the physical-plan crate labels Apr 23, 2026
@asolimando
Copy link
Copy Markdown
Member Author

Hi @xudong963, I have opened the PR as a prerequisite for #21122, as discussed.

This is a breaking change and I therefore added a section under .../library-user-guide/upgrading/54.0.0.md‎, I have checked around what usually goes there, but I'd appreciate if you could take a deeper look and confirm if I captured what's expected for the update guide.

Looking forward to your feedback!

@xudong963
Copy link
Copy Markdown
Member

@asolimando thanks, I'll review it next Monday! /cc @jonathanc-n

@asolimando
Copy link
Copy Markdown
Member Author

@asolimando thanks, I'll review it next Monday! /cc @jonathanc-n

Gentle reminder @xudong963 :)

Copy link
Copy Markdown
Member

@xudong963 xudong963 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@asolimando thanks! I'm sorry that I'm busy with others this week.

This PR doesn't fully solve the problem it claims to. The stated goal in the PR description and #20184 is to eliminate exponential recomputation. But for any plan containing a CoalescePartitionsExec, SortPreservingMergeExec, RepartitionExec, HashJoinExec (CollectLeft/Auto), CrossJoinExec, or NestedLoopJoinExec — which is most non-trivial plans — the operator restarts a fresh bottom-up walk from inside its own partition_statistics IIUC. So the recomputation isn't gone;

Caching sounds good, how about making caching part of StatisticsContext from day one, then we can have some benchmarks to show off the gains which will be easier for the community to accept the PR, wdyt?

///
/// [`StatisticsContext`]: crate::statistics_context::StatisticsContext
/// [`compute_statistics`]: crate::statistics_context::compute_statistics
fn partition_statistics(
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Noted and I will make sure to keep both APIs in the future! I will address this in the next iteration on the code and will resolve the discussion at that point.

let child_stats = plan
.children()
.iter()
.map(|child| compute_statistics(child.as_ref(), partition))
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

compute_statistics always recurses with the same partition. For partition-merging operators this is wasted work because they'll discard the context and recompute with None anyway

@asolimando
Copy link
Copy Markdown
Member Author

@asolimando thanks! I'm sorry that I'm busy with others this week.

This PR doesn't fully solve the problem it claims to. The stated goal in the PR description and #20184 is to eliminate exponential recomputation. But for any plan containing a CoalescePartitionsExec, SortPreservingMergeExec, RepartitionExec, HashJoinExec (CollectLeft/Auto), CrossJoinExec, or NestedLoopJoinExec — which is most non-trivial plans — the operator restarts a fresh bottom-up walk from inside its own partition_statistics IIUC. So the recomputation isn't gone;

Caching sounds good, how about making caching part of StatisticsContext from day one, then we can have some benchmarks to show off the gains which will be easier for the community to accept the PR, wdyt?

Thank you for your input @xudong963, no need to apologies, it's understandable!

You raise a fair point, we fully avoid the recomputation only for linear plans, but operators that call compute_statistics(child, None) internally don't benefit. This is noted in the "What remains for follow-up" section but I agree it might not be enough for the first iteration, and I anyway should have marked "partially closes #20184".

Re. the cache, I identified the need for the StatisticsRegistry already, and we discussed with @kosiew in the related PR (#21483, comment, branch asolimando/statistics-planner-with-statscache-v2). We agreed to defer it to limit scope, but this is the right place to discuss it.

One limitation I identified on the StatsCache (as I called it there), is around the cache key, which should "identify" an ExecutionPlan, which doesn't have any stable id other than its memory pointer ( so the cache key is effectively (Arc::as_ptr, partition)), but I am concerned of nodes being disposed (and re-used).

Cache lifecycle/scope:

  1. single invocation of compute_statistics (as described in Let partition_statistics accept pre-computed children statistics #20184): if we agree on this, then the concern is not valid, as the plan tree is "stable" during the lifetime. When e.g. CoalescePartitionsExec calls compute_statistics(child, None) internally, the cache already has the subtree results, fully eliminating redundant walks.

  2. multiple invocations of compute_statistics (same rule or cross-rules): here we necessarily need a stable node ID and we can't rely on the pointer, since nodes can be dropped/recreated

The scope of #20184 is, in my understanding, 1. (single walk), if you agree with that, I plan to use (Arc::as_ptr, partition) as cache key, and introducing node IDs and expanding the cache lifetime IMO be tackled as a followup (I can create issues for that, if the direction is confirmed), as with this solution we should already see computational benefits.

Re. benchmarks, do you have a specific workload in mind (e.g., TPC-DS, Q99)? Also, could I be added to the allowlist to trigger benchmark runs so I can iterate without requiring manual re-runs, in case I need multiple iterations?

WDYT?

@xudong963
Copy link
Copy Markdown
Member

Thanks for the thoughtful response @asolimando — the framing is exactly right, and the prior discussion with @kosiew in #21483 is helpful context.

On scope: agreed, let's land per-call caching in this PR (your Option 1) and treat cross-call caching with stable node IDs as a follow-up. Could you open an issue for Option 2 so we don't lose track?

On the cache key: (Arc::as_ptr, partition) is safe within a single synchronous compute_statistics walk — the Arcs are held by the plan tree and can't be dropped during the call, so pointer reuse isn't a concern. Good call.

On benchmarks: I'd avoid full TPC-DS Q99 — statistics computation is a small fraction of total query time and will get lost in noise. A targeted micro-bench is more informative:

  • Build a deeply nested plan (e.g., a 10+ deep UnionExec chain, or a chain of hash joins + repartitions) and time compute_statistics(plan, None) before/after this PR.
  • Optionally reuse a reproducer from [EPIC] Improve query planning speed #19795 (planning-speed EPIC) since deep plans are exactly that issue's pain point.

That should cleanly demonstrate the gain.

@asolimando
Copy link
Copy Markdown
Member Author

Thanks for the thoughtful response @asolimando — the framing is exactly right, and the prior discussion with @kosiew in #21483 is helpful context.

On scope: agreed, let's land per-call caching in this PR (your Option 1) and treat cross-call caching with stable node IDs as a follow-up. Could you open an issue for Option 2 so we don't lose track?

On the cache key: (Arc::as_ptr, partition) is safe within a single synchronous compute_statistics walk — the Arcs are held by the plan tree and can't be dropped during the call, so pointer reuse isn't a concern. Good call.

On benchmarks: I'd avoid full TPC-DS Q99 — statistics computation is a small fraction of total query time and will get lost in noise. A targeted micro-bench is more informative:

  • Build a deeply nested plan (e.g., a 10+ deep UnionExec chain, or a chain of hash joins + repartitions) and time compute_statistics(plan, None) before/after this PR.
  • Optionally reuse a reproducer from [EPIC] Improve query planning speed #19795 (planning-speed EPIC) since deep plans are exactly that issue's pain point.

That should cleanly demonstrate the gain.

Thanks for the confirmation and the clarifications, I will hopefully get to it early next week and I will ping you back as soon as I will have some updates!

@asolimando asolimando force-pushed the asolimando/partition-statistics-context branch from e135e8a to a8a3d6c Compare May 3, 2026 18:48
@github-actions
Copy link
Copy Markdown

github-actions Bot commented May 3, 2026

Thank you for opening this pull request!

Reviewer note: cargo-semver-checks reported the current version number is not SemVer-compatible with the changes in this pull request (compared against the base branch).

Details
     Cloning apache/main
    Building datafusion v53.1.0 (current)
       Built [  76.116s] (current)
     Parsing datafusion v53.1.0 (current)
      Parsed [   0.031s] (current)
    Building datafusion v53.1.0 (baseline)
       Built [  75.549s] (baseline)
     Parsing datafusion v53.1.0 (baseline)
      Parsed [   0.031s] (baseline)
    Checking datafusion v53.1.0 -> v53.1.0 (no change; assume patch)
     Checked [   0.688s] 222 checks: 222 pass, 30 skip
     Summary no semver update required
    Finished [ 154.546s] datafusion
    Building datafusion-datasource v53.1.0 (current)
       Built [  33.508s] (current)
     Parsing datafusion-datasource v53.1.0 (current)
      Parsed [   0.027s] (current)
    Building datafusion-datasource v53.1.0 (baseline)
       Built [  33.404s] (baseline)
     Parsing datafusion-datasource v53.1.0 (baseline)
      Parsed [   0.027s] (baseline)
    Checking datafusion-datasource v53.1.0 -> v53.1.0 (no change; assume patch)
     Checked [   0.295s] 222 checks: 222 pass, 30 skip
     Summary no semver update required
    Finished [  68.725s] datafusion-datasource
    Building datafusion-physical-optimizer v53.1.0 (current)
       Built [  34.199s] (current)
     Parsing datafusion-physical-optimizer v53.1.0 (current)
      Parsed [   0.020s] (current)
    Building datafusion-physical-optimizer v53.1.0 (baseline)
       Built [  34.057s] (baseline)
     Parsing datafusion-physical-optimizer v53.1.0 (baseline)
      Parsed [   0.020s] (baseline)
    Checking datafusion-physical-optimizer v53.1.0 -> v53.1.0 (no change; assume patch)
     Checked [   0.130s] 222 checks: 222 pass, 30 skip
     Summary no semver update required
    Finished [  69.840s] datafusion-physical-optimizer
    Building datafusion-physical-plan v53.1.0 (current)
       Built [  30.050s] (current)
     Parsing datafusion-physical-plan v53.1.0 (current)
      Parsed [   0.117s] (current)
    Building datafusion-physical-plan v53.1.0 (baseline)
       Built [  30.215s] (baseline)
     Parsing datafusion-physical-plan v53.1.0 (baseline)
      Parsed [   0.116s] (baseline)
    Checking datafusion-physical-plan v53.1.0 -> v53.1.0 (no change; assume patch)
     Checked [   0.631s] 222 checks: 221 pass, 1 fail, 0 warn, 30 skip

--- failure trait_method_marked_deprecated: trait method #[deprecated] added ---

Description:
A trait method is now #[deprecated]. Downstream crates will get a compiler warning when using this method.
        ref: https://doc.rust-lang.org/reference/attributes/diagnostics.html#the-deprecated-attribute
       impl: https://github.com/obi1kenobi/cargo-semver-checks/tree/v0.47.0/src/lints/trait_method_marked_deprecated.ron

Failed in:
  method partition_statistics in trait datafusion_physical_plan::execution_plan::ExecutionPlan in /home/runner/work/datafusion/datafusion/datafusion/physical-plan/src/execution_plan.rs:97
  method partition_statistics in trait datafusion_physical_plan::ExecutionPlan in /home/runner/work/datafusion/datafusion/datafusion/physical-plan/src/execution_plan.rs:97

     Summary semver requires new minor version: 0 major and 1 minor checks failed
    Finished [  62.744s] datafusion-physical-plan

@asolimando
Copy link
Copy Markdown
Member Author

Hey @xudong963, I've pushed new commits implementing what we discussed (force-pushed to rebase on latest main, but the first two commits (f36ef32, 12a2fc1) are unchanged from the previous push).

A walkthrough of the new commits:

  • f36ef32 adds StatisticsContext parameter to partition_statistics, keeping the old method as deprecated, as required by the API health guidelines
  • b380893 adds partition_statistics_with_context as the new entry point
  • bb09951 adds StatsCache to StatisticsContext, shared across the entire compute_statistics walk
  • 2f843ef adds a Criterion micro-benchmark on two plan shapes from [EPIC] Improve query planning speed #19795:
    • CoalescePartitionsExec chain (depth 50): ~25x speedup over non-shared-cache baseline
    • CrossJoinExec binary tree (depth 7, 128 leaves): ~3x speedup, mirrors physical_many_self_joins from sql_planner.rs
  • a8a3d6c addresses the wasted partition forwarding: compute_statistics_inner now always pre-computes children with partition=None. Partition-preserving operators request per-partition stats on demand via compute_child_statistics, so partition-merging operators use child_stats() directly instead of triggering re-walks

Re. the benchmark: the numbers are from the average of 5 local runs, and they are conservative, as the baseline still benefits from an ephemeral per-walk cache within each re-walk, the true baseline would be no caching at all, and it would show a larger gap. Since this benchmark is new, I couldn't find a better way to show a before/after run. The improvement is clear anyway, but I just wanted to mention it for completeness.

Will open a follow-up issue for cross-call caching with stable node IDs (Option 2) once this lands, as StatsCache exists nowhere at the moment, I am afraid it would be confusing if filed now.

Looking forward to your review!

@asolimando
Copy link
Copy Markdown
Member Author

(rebased on latest main for conflict resolution, mechanical fixes only)

Comment thread datafusion/physical-plan/src/filter.rs Outdated
Comment on lines +585 to +590
let input_stats = match partition {
Some(_) => Arc::unwrap_or_clone(
ctx.compute_child_statistics(self.input.as_ref(), partition)?,
),
None => Arc::unwrap_or_clone(Arc::clone(&ctx.child_stats()[0])),
};
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Per-operator boilerplate is repetitive and bug-prone. Almost every partition-preserving operator now contains:

let stats = match partition {
    Some(_) => ctx.compute_child_statistics(self.input.as_ref(), partition)?,
    None => Arc::clone(&ctx.child_stats()[0]),
};

This should be a helper on the context: ctx.child_stats_for(0, self.input.as_ref(), partition) or similar. Five identical match blocks across FilterExec, CoalesceBatchesExec, BufferExec, CooperativeExec, OutputRequirementExec is five places to make the same mistake when the contract evolves.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we added a StatisticsArgs structure as I proposed above, we could perhaps have this as a method on StatisticsArgs

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Makes total sense and, as suggested, StatisticsArgs proved to be a good location for this. We now have:

  • args.child_stats_for(self.input.as_ref()), which replaces the match block across all partition-preserving operators
  • args.child_stats_of(child), for partition-merging operators

Addressed in bc32cf2

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

also worth measuring on a deep FilterExec chain queried at Some(0). (The context is if you ask compute_statistics(plan, Some(0)) on a deep filter chain, the framework first walks the entire tree computing None stats, then each filter turns around and asks for Some(0) stats on demand (which triggers another cached walk). The shared cache makes the second walk cheap, but for partition-preserving plans we end up populating both None and Some(p) entries for every node

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Covered in bf43bc7: I have added a FilterExec chain at depths 10/20/50. It shows ~2x cost of per-partition vs overall, and ~25x speedup over non-shared-cache baseline at depth 50.

The 2x cost is expected due to the second walk, and as you were anticipating, the cache still makes it cheap enough.

@xudong963 xudong963 requested review from alamb and gabotechs May 11, 2026 06:59
@alamb
Copy link
Copy Markdown
Contributor

alamb commented May 12, 2026

Thanks for the ping -- I will try and review this shortly. I am totally swamped trying to review multiple 1000+ line PRs (and trying to give them thoughtful reviews and understand the implications)

Copy link
Copy Markdown
Contributor

@alamb alamb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you @asolimando and @xudong963 -- this is looking like good progress. I left some thoughts,.

Ok(Arc::new(Statistics::new_unknown(&self.schema())))
}

/// Returns statistics for a specific partition of this `ExecutionPlan` node.
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we are going to add (yet antoher) statistics API I suggest we add one that will be easier to extend in the future -- specifically one that can also add new parameters without a major API change

For example, what do you think about something like

trait ExecutionPlan {
...
    fn statistics_with_args(
        &self,
        args: StatisticsArgs,
     ) -> Result<Arc<Statistics>> ...

....



/// Arguments passed to [`ExecutionPlan::statistics_with_args`] call
struct StatisticsArgs {
        partition: Option<usize>,
        ctx: &StatisticsContext,
}

That way we can add new parameters without major downstream churn again. This is similar to what we have done with other APIs like call_with_args

pub struct TableFunctionArgs<'e, 's> {
/// Call arguments.
exprs: &'e [Expr],
/// Session within which the function is called.
session: &'s dyn Session,
}

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks a lot @alamb, I was aiming at something similar with StatisticsContext, having already in mind the integration of ExpressionAnalyzer and StatisticsRegistry from #21122 and #21483, respectively, but your proposal reads better.

I hope I have captured your suggestion correctly in bc32cf2.

Comment thread datafusion/physical-plan/src/filter.rs Outdated
Comment on lines +585 to +590
let input_stats = match partition {
Some(_) => Arc::unwrap_or_clone(
ctx.compute_child_statistics(self.input.as_ref(), partition)?,
),
None => Arc::unwrap_or_clone(Arc::clone(&ctx.child_stats()[0])),
};
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we added a StatisticsArgs structure as I proposed above, we could perhaps have this as a method on StatisticsArgs

use std::rc::Rc;
use std::sync::Arc;

/// Per-call memoization cache for [`compute_statistics`].
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is nice

/// [`ExecutionPlan::partition_statistics_with_context`] with the pre-computed
/// child statistics.
///
/// Results are memoized within a single call: operators that internally
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think putting this caching mechanism into StatisticsArgs could potentially make the API cleaner and more encapsulated

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done in bc32cf2

pub struct StatisticsContext {
/// Pre-computed statistics for each child of the current node,
/// in the same order as [`ExecutionPlan::children`].
child_stats: Vec<Arc<Statistics>>,
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we have a StatsCache maybe we could always use that (rather than also having Vec of statistics) 🤔

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have implemented the suggestion in 53bbf5e, I think it was the right call but I have left it as a separate commit so it's simpler to check

@asolimando
Copy link
Copy Markdown
Member Author

Thank you @xudong963 and @alamb for your feedback and reviews!

I am off until early next week with limited connectivity but I will get back to you soon, here and in related PRs/issues around statistics.

@alamb
Copy link
Copy Markdown
Contributor

alamb commented May 14, 2026

Thank you @xudong963 and @alamb for your feedback and reviews!

I am off until early next week with limited connectivity but I will get back to you soon, here and in related PRs/issues around statistics.

Sounds good -- thank you.

It will probably be good timing -- we'll get the 54 release out and then we can add these new APIs in 55

Introduce StatisticsContext that carries pre-computed child statistics
and external context for statistics computation. Change the
ExecutionPlan::partition_statistics signature to accept it, and add
compute_statistics() utility for bottom-up computation with automatic
child stats threading.

Update all ~35 in-tree ExecutionPlan implementations and ~40 call
sites. Passthrough operators return ctx.child_stats() directly,
transform operators use it instead of re-fetching from children,
and operators that always need overall child stats (RepartitionExec,
CoalescePartitionsExec, SortPreservingMergeExec, SortExec non-preserving,
HashJoinExec CollectLeft/Auto, CrossJoinExec, NestedLoopJoinExec)
call compute_statistics with None internally.
Non-breaking change per API health policy: existing impls continue
to work via default delegation. Fixes missed ScalarSubqueryExec.
Memoize results within a single compute_statistics invocation using
pointer-based cache keys. Operators now use ctx.compute_child_statistics
instead of calling compute_statistics directly, so partition-merging
and asymmetric join operators hit the cache for subtrees already walked.
Coalesce chain and cross-join tree (apache#19795) benchmarks comparing
cached vs non-shared-cache statistics computation.
Introduce StatisticsArgs struct combining partition, child_stats, and
cache. Adds child_stats_for helper to eliminate per-operator boilerplate.
Removes StatisticsContext (flattened into StatisticsArgs).
@asolimando asolimando force-pushed the asolimando/partition-statistics-context branch from 3d66565 to 53bbf5e Compare May 19, 2026 18:57
@asolimando
Copy link
Copy Markdown
Member Author

@alamb @xudong963, thanks again for your reviews.

I have pushed 3 new commits addressing your latest feedback (force-pushed to rebase on latest main):

  • bc32cf2: StatisticsArgs struct, child_stats_for/child_stats_of helpers, removed StatisticsContext
  • 53bbf5e: child_stats Vec removed, all lookups go through the cache
  • bf43bc7: FilterExec chain benchmark

Benchmark numbers unchanged (~26x coalesce chain, ~5x cross-join tree, ~25x filter chain at depth 50), and the code reads much better with your suggestions.

@asolimando asolimando requested review from alamb and xudong963 May 19, 2026 19:19
@asolimando asolimando force-pushed the asolimando/partition-statistics-context branch from 53bbf5e to b5484eb Compare May 19, 2026 20:10
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

auto detected api change Auto detected API change core Core DataFusion crate datasource Changes to the datasource crate documentation Improvements or additions to documentation optimizer Optimizer rules physical-plan Changes to the physical-plan crate

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Let partition_statistics accept pre-computed children statistics

3 participants