fix(isthmus): std_dev, variance function mappings#780
Conversation
mbwhite
left a comment
There was a problem hiding this comment.
LGTM - will be good to get these cleaned-up
|
For similar special-case conversion handling with scalar functions, we made a special effort to extract the special case handling to separate classes outside of the main ScalarFunctionConverter. Perhaps we should do something similar for special case function handling in AggregateFunctionConverter? See PR #401. I certainly do not see the implementation in #401 as perfect. Ideally function conversion would be refactored to better support special case function handling. Perhaps by making the building blocks of function and type mapping more easily composable and reusable by different converters. That doesn't need to happen in one step though; more likely a refactor once things are working. |
I actually started with an |
Signed-off-by: Niels Pardon <par@zurich.ibm.com>
Use the non-deprecated std_dev/variance signatures that carry the SAMPLE/POPULATION distinction as a leading "distribution" enum argument (std_dev:req_fp64 etc.) instead of the now-deprecated function option. During Calcite -> Substrait conversion the distribution enum operand is synthesized so the generic function matcher resolves the enum-arg variant and builds the EnumArg; the reverse direction disambiguates the Calcite operator from that argument. A shared StatisticalDistribution enum is added in :core so the DSL builder and isthmus share one source of truth. Resolves substrait-io#803 Signed-off-by: Niels Pardon <par@zurich.ibm.com>
Collapse the three-way branch in getSqlOperatorFromSubstraitFunc into a single conditional: when a distribution enum argument is present, narrow the candidate operators by it (falling back to all operators when output type filtering yielded none). Behavior is unchanged. Signed-off-by: Niels Pardon <par@zurich.ibm.com>
This PR implements the bidirectional conversion between Calcite and Substrait for the
statistical aggregate functions standard deviation and variance (
STDDEV_POP,STDDEV_SAMP,VAR_POP,VAR_SAMP).Problem
Substrait uses a single function name for both the population and sample variants:
std_devfor bothSTDDEV_POPandSTDDEV_SAMPvariancefor bothVAR_POPandVAR_SAMPThe population (n denominator) vs. sample (n-1 denominator) distinction is captured by a
distributionvalue (POPULATIONorSAMPLE). Previously these functions were not mapped,so the existing TPC-DS test cases using them were silently mis-mapped to
AVG(Calciterepresents these statistical functions with
SqlAvgAggFunction).Solution
This uses the non-deprecated signatures introduced in substrait-io/substrait#1011
(Substrait v0.87.0), which carry the distinction as an enum function argument (the
leading
distributionargument, e.g.std_dev:req_fp64), rather than the deprecatedfunction-option form (substrait-io/substrait#1019). Resolves #803.
Since the input arguments are cast to
FP64when necessary, the integer-based signaturesproposed in substrait-io/substrait#1012 are not required.
Calcite → Substrait:
FunctionMappingsfor all four statistical operators.AggregateFunctionConvertersynthesizes the leadingdistributionenum operand based onthe Calcite
SqlKind, so the generic function matcher resolves the enum-arg variant andbuilds the
EnumArgautomatically (no bespoke option plumbing).FP64where necessary.Substrait → Calcite:
FunctionConverter.getSqlOperatorFromSubstraitFuncdisambiguates the population/sampleoperator from the
distributionenum argument.SubstraitRelNodeConverterandPreCalciteAggregateValidatorskip the non-value enumargument when building Calcite aggregate operands.
DSL & shared enum:
StatisticalDistributionenum (in:core) is the single source of truth for theSAMPLE/POPULATIONvalues used by both the DSL builder and isthmus.SubstraitBuildergainsstddevPopulation,stddevSample,variancePopulation, andvarianceSampleconvenience methods.Testing
AggregationFunctionsTestexercises full round trips (POJO ⇄ proto and Substrait ⇄Calcite) for all four functions, with and without grouping.
StatisticalFunctionTestverifies the SQL round trip and asserts that each SQLoperator maps to the enum-arg signature (
std_dev:req_fp64etc.) with the correctdistributionEnumArgand no function options.🤖 Generated with AI