Refactor and fix variance functions#1051
Merged
Merged
Conversation
Contributor
Author
|
I highly recommend reviewing this in split diff mode, not unified. |
levkk
reviewed
Jun 10, 2026
| pub(super) struct Variance { | ||
| sumsq: Sum, | ||
| sum: Sum, | ||
| count: von::Count, |
Collaborator
There was a problem hiding this comment.
My favorite of all the counts.
Contributor
Author
There was a problem hiding this comment.
I giggle every time I get to write it.
Codecov Report❌ Patch coverage is 📢 Thoughts on this report? Let us know! |
levkk
approved these changes
Jun 10, 2026
levkk
left a comment
Collaborator
There was a problem hiding this comment.
Yes. A thousand times yes.
This refactor has some more significant behavior changes than the earlier ones, as the existing implementation of variance functions was actually extremely incorrect. On the data sets in the integration tests I've added, its answer was off by several trillion (which is more than 50% off). I believe the old code was trying to use the same formula we're using here, but was subtly wrong. I've included what the actual mathematical formula is, along with a citation to a math textbook with more details so that you can verify the implementation is correct even if you don't know how to compute the variance of a population without computing the mean off the top of your head (but that's something we all know by heart, right?). The old integration tests weren't doing a good job testing. They only used small, uniform data sets, and used complex logic to determine the expected value instead of testing against a known value or what unsharded PG says. I've replaced them with a test that checks against two large, statistically significant data sets, with one test for each data type which changes the output type of `SUM` that we can support. The formula we're using is the standard way to approximate the variance of a population without calculating the mean. Notably, it is an *approximation*, and any more precise formula would either require us to get the entire data set from the shards, or calculate the mean and do multiple round trips. The latter isn't necessarily an unreasonable solution, but it would definitely be a larger structural change. [This Wikipedia article]([200~https://en.wikipedia.org/wiki/Algorithms_for_calculating_variance) has more details on the tradeoffs, and specifically notes the precision issue with the formula we're using. The approximation does not lose precision on numeric, or types for which sum returns numeric such as int8. The precision loss is fairly reasonable for double precision, and big yikes for single precision floats. I *think* we might be able to solve this in the future by always casting to numeric on the shards, and passing the expected output type from the schema rather than determining it based on the input we get from the rows. Either way, this implementation is certainly not *less* precise than what we were doing before, so I've left that for future work. With this, the refactor of aggregate functions is almost done, and can be quickly followed up by some fun cleanup PRs where we delete the structural shit that was only needed while we were halfway done. Leaving that for another PR though, as this one is large enough by itself. Fix an incorrect comment
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
This refactor has some more significant behavior changes than the
earlier ones, as the existing implementation of variance functions was
actually extremely incorrect. On the data sets in the integration tests
I've added, its answer was off by several trillion (which is more than
50% off). I believe the old code was trying to use the same formula
we're using here, but was subtly wrong. I've included what the actual
mathematical formula is, along with a citation to a math textbook with
more details so that you can verify the implementation is correct even
if you don't know how to compute the variance of a population without
computing the mean off the top of your head (but that's something we all
know by heart, right?).
The old integration tests weren't doing a good job testing. They only
used small, uniform data sets, and used complex logic to determine the
expected value instead of testing against a known value or what
unsharded PG says. I've replaced them with a test that checks against
two large, statistically significant data sets, with one test for each
data type which changes the output type of
SUMthat we can support.The formula we're using is the standard way to approximate the variance
of a population without calculating the mean. Notably, it is an
approximation, and any more precise formula would either require us to
get the entire data set from the shards, or calculate the mean and do
multiple round trips. The latter isn't necessarily an unreasonable
solution, but it would definitely be a larger structural change.
This wikipedia page
has more details on the problem, and specifically notes the precsion
issue with the algorithm we're using
The approximation does not lose precision on numeric, or types for which
sum returns numeric such as int8. The precision loss is fairly
reasonable for double precision, and big yikes for single precision
floats. I think we might be able to solve this in the future by always
casting to numeric on the shards, and passing the expected output type
from the schema rather than determining it based on the input we get
from the rows. Either way, this implementation is certainly not less
precise than what we were doing before, so I've left that for future
work.
With this, the refactor of aggregate functions is almost done, and can
be quickly followed up by some fun cleanup PRs where we delete the
structural shit that was only needed while we were halfway done. Leaving
that for another PR though, as this one is large enough by itself.