Skip to content

fix(isthmus): std_dev, variance function mappings#780

Draft
nielspardon wants to merge 3 commits into
substrait-io:mainfrom
nielspardon:par-stddev
Draft

fix(isthmus): std_dev, variance function mappings#780
nielspardon wants to merge 3 commits into
substrait-io:mainfrom
nielspardon:par-stddev

Conversation

@nielspardon

@nielspardon nielspardon commented Mar 26, 2026

Copy link
Copy Markdown
Member

This PR implements the bidirectional conversion between Calcite and Substrait for the
statistical aggregate functions standard deviation and variance (STDDEV_POP,
STDDEV_SAMP, VAR_POP, VAR_SAMP).

Problem

Substrait uses a single function name for both the population and sample variants:

  • std_dev for both STDDEV_POP and STDDEV_SAMP
  • variance for both VAR_POP and VAR_SAMP

The population (n denominator) vs. sample (n-1 denominator) distinction is captured by a
distribution value (POPULATION or SAMPLE). Previously these functions were not mapped,
so the existing TPC-DS test cases using them were silently mis-mapped to AVG (Calcite
represents these statistical functions with SqlAvgAggFunction).

Solution

This uses the non-deprecated signatures introduced in substrait-io/substrait#1011
(Substrait v0.87.0), which carry the distinction as an enum function argument (the
leading distribution argument, e.g. std_dev:req_fp64), rather than the deprecated
function-option form (substrait-io/substrait#1019). Resolves #803.

Since the input arguments are cast to FP64 when necessary, the integer-based signatures
proposed in substrait-io/substrait#1012 are not required.

Calcite → Substrait:

  • Added function mappings in FunctionMappings for all four statistical operators.
  • AggregateFunctionConverter synthesizes the leading distribution enum operand based on
    the Calcite SqlKind, so the generic function matcher resolves the enum-arg variant and
    builds the EnumArg automatically (no bespoke option plumbing).
  • Statistical inputs are cast to FP64 where necessary.

Substrait → Calcite:

  • FunctionConverter.getSqlOperatorFromSubstraitFunc disambiguates the population/sample
    operator from the distribution enum argument.
  • SubstraitRelNodeConverter and PreCalciteAggregateValidator skip the non-value enum
    argument when building Calcite aggregate operands.

DSL & shared enum:

  • A shared StatisticalDistribution enum (in :core) is the single source of truth for the
    SAMPLE / POPULATION values used by both the DSL builder and isthmus.
  • SubstraitBuilder gains stddevPopulation, stddevSample, variancePopulation, and
    varianceSample convenience methods.

Testing

  • AggregationFunctionsTest exercises full round trips (POJO ⇄ proto and Substrait ⇄
    Calcite) for all four functions, with and without grouping.
  • New StatisticalFunctionTest verifies the SQL round trip and asserts that each SQL
    operator maps to the enum-arg signature (std_dev:req_fp64 etc.) with the correct
    distribution EnumArg and no function options.

🤖 Generated with AI

@mbwhite mbwhite left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM - will be good to get these cleaned-up

@bestbeforetoday

Copy link
Copy Markdown
Member

For similar special-case conversion handling with scalar functions, we made a special effort to extract the special case handling to separate classes outside of the main ScalarFunctionConverter. Perhaps we should do something similar for special case function handling in AggregateFunctionConverter? See PR #401.

I certainly do not see the implementation in #401 as perfect. Ideally function conversion would be refactored to better support special case function handling. Perhaps by making the building blocks of function and type mapping more easily composable and reusable by different converters. That doesn't need to happen in one step though; more likely a refactor once things are working.

@nielspardon

Copy link
Copy Markdown
Member Author

For similar special-case conversion handling with scalar functions, we made a special effort to extract the special case handling to separate classes outside of the main ScalarFunctionConverter. Perhaps we should do something similar for special case function handling in AggregateFunctionConverter? See PR #401.

I certainly do not see the implementation in #401 as perfect. Ideally function conversion would be refactored to better support special case function handling. Perhaps by making the building blocks of function and type mapping more easily composable and reusable by different converters. That doesn't need to happen in one step though; more likely a refactor once things are working.

I actually started with an AggregateFunctionMapper similar to ScalarFunctionMapper but then ran into some challenges with that approach since I only have access to the Calcite AggregateCall in the AggregateFunctionConverter and can not modify the input of the Aggregate relation node which I need to do for casting the types of input fields. I also feel like a bigger redesign might be worthwhile exploring.

@nielspardon nielspardon deleted the par-stddev branch June 11, 2026 09:50
@nielspardon nielspardon restored the par-stddev branch June 11, 2026 09:50
@nielspardon nielspardon reopened this Jun 11, 2026
Signed-off-by: Niels Pardon <par@zurich.ibm.com>
Use the non-deprecated std_dev/variance signatures that carry the
SAMPLE/POPULATION distinction as a leading "distribution" enum argument
(std_dev:req_fp64 etc.) instead of the now-deprecated function option.

During Calcite -> Substrait conversion the distribution enum operand is
synthesized so the generic function matcher resolves the enum-arg variant
and builds the EnumArg; the reverse direction disambiguates the Calcite
operator from that argument. A shared StatisticalDistribution enum is
added in :core so the DSL builder and isthmus share one source of truth.

Resolves substrait-io#803

Signed-off-by: Niels Pardon <par@zurich.ibm.com>
Collapse the three-way branch in getSqlOperatorFromSubstraitFunc into a
single conditional: when a distribution enum argument is present, narrow
the candidate operators by it (falling back to all operators when output
type filtering yielded none). Behavior is unchanged.

Signed-off-by: Niels Pardon <par@zurich.ibm.com>
@nielspardon nielspardon marked this pull request as draft June 15, 2026 18:17
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

use enum arg function signatures for std_dev and variance functions instead of function option signatures

3 participants