Skip to content

Refactor and fix variance functions#1051

Merged
sgrif merged 1 commit into
mainfrom
sg-fix-variance
Jun 10, 2026
Merged

Refactor and fix variance functions#1051
sgrif merged 1 commit into
mainfrom
sg-fix-variance

Conversation

@sgrif

@sgrif sgrif commented Jun 10, 2026

Copy link
Copy Markdown
Contributor

This refactor has some more significant behavior changes than the
earlier ones, as the existing implementation of variance functions was
actually extremely incorrect. On the data sets in the integration tests
I've added, its answer was off by several trillion (which is more than
50% off). I believe the old code was trying to use the same formula
we're using here, but was subtly wrong. I've included what the actual
mathematical formula is, along with a citation to a math textbook with
more details so that you can verify the implementation is correct even
if you don't know how to compute the variance of a population without
computing the mean off the top of your head (but that's something we all
know by heart, right?).

The old integration tests weren't doing a good job testing. They only
used small, uniform data sets, and used complex logic to determine the
expected value instead of testing against a known value or what
unsharded PG says. I've replaced them with a test that checks against
two large, statistically significant data sets, with one test for each
data type which changes the output type of SUM that we can support.

The formula we're using is the standard way to approximate the variance
of a population without calculating the mean. Notably, it is an
approximation, and any more precise formula would either require us to
get the entire data set from the shards, or calculate the mean and do
multiple round trips. The latter isn't necessarily an unreasonable
solution, but it would definitely be a larger structural change.
This wikipedia page
has more details on the problem, and specifically notes the precsion
issue with the algorithm we're using

The approximation does not lose precision on numeric, or types for which
sum returns numeric such as int8. The precision loss is fairly
reasonable for double precision, and big yikes for single precision
floats. I think we might be able to solve this in the future by always
casting to numeric on the shards, and passing the expected output type
from the schema rather than determining it based on the input we get
from the rows. Either way, this implementation is certainly not less
precise than what we were doing before, so I've left that for future
work.

With this, the refactor of aggregate functions is almost done, and can
be quickly followed up by some fun cleanup PRs where we delete the
structural shit that was only needed while we were halfway done. Leaving
that for another PR though, as this one is large enough by itself.

@sgrif sgrif force-pushed the sg-fix-variance branch from 2b9e7bb to ff65e45 Compare June 10, 2026 21:03
@sgrif sgrif changed the title wip Refactor and fix variance functions Jun 10, 2026
@sgrif

sgrif commented Jun 10, 2026

Copy link
Copy Markdown
Contributor Author

I highly recommend reviewing this in split diff mode, not unified.

@sgrif sgrif requested a review from levkk June 10, 2026 21:05
@sgrif sgrif marked this pull request as ready for review June 10, 2026 21:05
pub(super) struct Variance {
sumsq: Sum,
sum: Sum,
count: von::Count,

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My favorite of all the counts.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I giggle every time I get to write it.

@codecov

codecov Bot commented Jun 10, 2026

Copy link
Copy Markdown

Codecov Report

❌ Patch coverage is 58.79121% with 75 lines in your changes missing coverage. Please review.

Files with missing lines Patch % Lines
pgdog/src/backend/pool/connection/aggregate.rs 9.75% 37 Missing ⚠️
.../src/backend/pool/connection/aggregate/variance.rs 72.16% 27 Missing ⚠️
pgdog/src/frontend/router/parser/aggregate.rs 0.00% 11 Missing ⚠️

📢 Thoughts on this report? Let us know!

@levkk levkk left a comment

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes. A thousand times yes.

This refactor has some more significant behavior changes than the
earlier ones, as the existing implementation of variance functions was
actually extremely incorrect. On the data sets in the integration tests
I've added, its answer was off by several trillion (which is more than
50% off). I believe the old code was trying to use the same formula
we're using here, but was subtly wrong. I've included what the actual
mathematical formula is, along with a citation to a math textbook with
more details so that you can verify the implementation is correct even
if you don't know how to compute the variance of a population without
computing the mean off the top of your head (but that's something we all
know by heart, right?).

The old integration tests weren't doing a good job testing. They only
used small, uniform data sets, and used complex logic to determine the
expected value instead of testing against a known value or what
unsharded PG says. I've replaced them with a test that checks against
two large, statistically significant data sets, with one test for each
data type which changes the output type of `SUM` that we can support.

The formula we're using is the standard way to approximate the variance
of a population without calculating the mean. Notably, it is an
*approximation*, and any more precise formula would either require us to
get the entire data set from the shards, or calculate the mean and do
multiple round trips. The latter isn't necessarily an unreasonable
solution, but it would definitely be a larger structural change.
[This Wikipedia article]([200~https://en.wikipedia.org/wiki/Algorithms_for_calculating_variance)
has more details on the tradeoffs, and specifically notes the precision
issue with the formula we're using.

The approximation does not lose precision on numeric, or types for which
sum returns numeric such as int8. The precision loss is fairly
reasonable for double precision, and big yikes for single precision
floats. I *think* we might be able to solve this in the future by always
casting to numeric on the shards, and passing the expected output type
from the schema rather than determining it based on the input we get
from the rows. Either way, this implementation is certainly not *less*
precise than what we were doing before, so I've left that for future
work.

With this, the refactor of aggregate functions is almost done, and can
be quickly followed up by some fun cleanup PRs where we delete the
structural shit that was only needed while we were halfway done. Leaving
that for another PR though, as this one is large enough by itself.

Fix an incorrect comment
@sgrif sgrif force-pushed the sg-fix-variance branch from 04ac7d0 to 80c0e33 Compare June 10, 2026 21:31
@sgrif sgrif merged commit 5a3c945 into main Jun 10, 2026
4 checks passed
@sgrif sgrif deleted the sg-fix-variance branch June 10, 2026 21:31
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants