Refactor and fix variance functions by sgrif · Pull Request #1051 · pgdogdev/pgdog

sgrif · 2026-06-10T16:29:09Z

This refactor has some more significant behavior changes than the
earlier ones, as the existing implementation of variance functions was
actually extremely incorrect. On the data sets in the integration tests
I've added, its answer was off by several trillion (which is more than
50% off). I believe the old code was trying to use the same formula
we're using here, but was subtly wrong. I've included what the actual
mathematical formula is, along with a citation to a math textbook with
more details so that you can verify the implementation is correct even
if you don't know how to compute the variance of a population without
computing the mean off the top of your head (but that's something we all
know by heart, right?).

The old integration tests weren't doing a good job testing. They only
used small, uniform data sets, and used complex logic to determine the
expected value instead of testing against a known value or what
unsharded PG says. I've replaced them with a test that checks against
two large, statistically significant data sets, with one test for each
data type which changes the output type of SUM that we can support.

The formula we're using is the standard way to approximate the variance
of a population without calculating the mean. Notably, it is an
approximation, and any more precise formula would either require us to
get the entire data set from the shards, or calculate the mean and do
multiple round trips. The latter isn't necessarily an unreasonable
solution, but it would definitely be a larger structural change.
This wikipedia page
has more details on the problem, and specifically notes the precsion
issue with the algorithm we're using

The approximation does not lose precision on numeric, or types for which
sum returns numeric such as int8. The precision loss is fairly
reasonable for double precision, and big yikes for single precision
floats. I think we might be able to solve this in the future by always
casting to numeric on the shards, and passing the expected output type
from the schema rather than determining it based on the input we get
from the rows. Either way, this implementation is certainly not less
precise than what we were doing before, so I've left that for future
work.

With this, the refactor of aggregate functions is almost done, and can
be quickly followed up by some fun cleanup PRs where we delete the
structural shit that was only needed while we were halfway done. Leaving
that for another PR though, as this one is large enough by itself.

sgrif · 2026-06-10T21:05:03Z

I highly recommend reviewing this in split diff mode, not unified.

levkk · 2026-06-10T21:08:30Z

+pub(super) struct Variance {
+    sumsq: Sum,
+    sum: Sum,
+    count: von::Count,


My favorite of all the counts.

I giggle every time I get to write it.

codecov · 2026-06-10T21:14:28Z

Codecov Report

❌ Patch coverage is 58.79121% with 75 lines in your changes missing coverage. Please review.

Files with missing lines	Patch %	Lines
pgdog/src/backend/pool/connection/aggregate.rs	9.75%	37 Missing ⚠️
.../src/backend/pool/connection/aggregate/variance.rs	72.16%	27 Missing ⚠️
pgdog/src/frontend/router/parser/aggregate.rs	0.00%	11 Missing ⚠️

📢 Thoughts on this report? Let us know!

levkk

Yes. A thousand times yes.

This refactor has some more significant behavior changes than the earlier ones, as the existing implementation of variance functions was actually extremely incorrect. On the data sets in the integration tests I've added, its answer was off by several trillion (which is more than 50% off). I believe the old code was trying to use the same formula we're using here, but was subtly wrong. I've included what the actual mathematical formula is, along with a citation to a math textbook with more details so that you can verify the implementation is correct even if you don't know how to compute the variance of a population without computing the mean off the top of your head (but that's something we all know by heart, right?). The old integration tests weren't doing a good job testing. They only used small, uniform data sets, and used complex logic to determine the expected value instead of testing against a known value or what unsharded PG says. I've replaced them with a test that checks against two large, statistically significant data sets, with one test for each data type which changes the output type of `SUM` that we can support. The formula we're using is the standard way to approximate the variance of a population without calculating the mean. Notably, it is an *approximation*, and any more precise formula would either require us to get the entire data set from the shards, or calculate the mean and do multiple round trips. The latter isn't necessarily an unreasonable solution, but it would definitely be a larger structural change. [This Wikipedia article]([200~https://en.wikipedia.org/wiki/Algorithms_for_calculating_variance) has more details on the tradeoffs, and specifically notes the precision issue with the formula we're using. The approximation does not lose precision on numeric, or types for which sum returns numeric such as int8. The precision loss is fairly reasonable for double precision, and big yikes for single precision floats. I *think* we might be able to solve this in the future by always casting to numeric on the shards, and passing the expected output type from the schema rather than determining it based on the input we get from the rows. Either way, this implementation is certainly not *less* precise than what we were doing before, so I've left that for future work. With this, the refactor of aggregate functions is almost done, and can be quickly followed up by some fun cleanup PRs where we delete the structural shit that was only needed while we were halfway done. Leaving that for another PR though, as this one is large enough by itself. Fix an incorrect comment

sgrif force-pushed the sg-fix-variance branch from 2b9e7bb to ff65e45 Compare June 10, 2026 21:03

sgrif changed the title ~~wip~~ Refactor and fix variance functions Jun 10, 2026

sgrif requested a review from levkk June 10, 2026 21:05

sgrif marked this pull request as ready for review June 10, 2026 21:05

levkk reviewed Jun 10, 2026

View reviewed changes

levkk approved these changes Jun 10, 2026

View reviewed changes

sgrif force-pushed the sg-fix-variance branch from 04ac7d0 to 80c0e33 Compare June 10, 2026 21:31

sgrif merged commit 5a3c945 into main Jun 10, 2026
4 checks passed

sgrif deleted the sg-fix-variance branch June 10, 2026 21:31

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Refactor and fix variance functions#1051

Refactor and fix variance functions#1051
sgrif merged 1 commit into
mainfrom
sg-fix-variance

sgrif commented Jun 10, 2026 •

edited

Loading

Uh oh!

sgrif commented Jun 10, 2026

Uh oh!

levkk Jun 10, 2026

Uh oh!

sgrif Jun 10, 2026

Uh oh!

codecov Bot commented Jun 10, 2026

Uh oh!

levkk left a comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

sgrif commented Jun 10, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

sgrif commented Jun 10, 2026

Uh oh!

levkk Jun 10, 2026

Choose a reason for hiding this comment

Uh oh!

sgrif Jun 10, 2026

Choose a reason for hiding this comment

Uh oh!

codecov Bot commented Jun 10, 2026

Codecov Report

Uh oh!

levkk left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

sgrif commented Jun 10, 2026 •

edited

Loading