Skip to content

Feat/bot leaderboard/v2.3#4435

Draft
lsabor wants to merge 20 commits into
mainfrom
feat/bot-leaderboard/v2.3
Draft

Feat/bot leaderboard/v2.3#4435
lsabor wants to merge 20 commits into
mainfrom
feat/bot-leaderboard/v2.3

Conversation

@lsabor

@lsabor lsabor commented Feb 26, 2026

Copy link
Copy Markdown
Contributor

No description provided.

@coderabbitai

coderabbitai Bot commented Feb 26, 2026

Copy link
Copy Markdown
Contributor

Important

Review skipped

Draft detected.

Please check the settings in the CodeRabbit UI or the .coderabbit.yaml file in this repository. To trigger a single review, invoke the @coderabbitai review command.

⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

Run ID: 61c05057-700c-45ec-8cf9-95b6088710ec

You can disable this status message by setting the reviews.review_status to false in the CodeRabbit configuration file.

Use the checkbox below for a quick retry:

  • 🔍 Trigger review
✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Commit unit tests in branch feat/bot-leaderboard/v2.3

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

@github-actions

github-actions Bot commented Feb 26, 2026

Copy link
Copy Markdown
Contributor

🚀 Preview Environment

Your preview environment is ready!

Resource Details
🌐 Preview URL https://metaculus-pr-4435-feat-bot-leaderboard-v2-3-preview.mtcl.cc
📦 Docker Image ghcr.io/metaculus/metaculus:feat-bot-leaderboard-v2.3-72a65ac
🗄️ PostgreSQL NeonDB branch preview/pr-4435-feat-bot-leaderboard-v2-3
Redis Fly Redis mtc-redis-pr-4435-feat-bot-leaderboard-v2-3

Details

  • Commit: 3fd3ebcb237f35a258e5d5671b198b2c25537a31
  • Branch: feat/bot-leaderboard/v2.3
  • Fly App: metaculus-pr-4435-feat-bot-leaderboard-v2-3

ℹ️ Preview Environment Info

Isolation:

  • PostgreSQL and Redis are fully isolated from production
  • Each PR gets its own database branch and Redis instance
  • Changes pushed to this PR will trigger a new deployment

Limitations:

  • Background workers and cron jobs are not deployed in preview environments
  • If you need to test background jobs, use Heroku staging environments

Cleanup:

  • This preview will be automatically destroyed when the PR is closed

Metac bots with metac_bot metadata but used as internal agents (metac-azimuth, metac-agent) should not be included in leaderboard calculations.

Co-authored-by: Cursor <cursoragent@cursor.com>

@lsabor lsabor left a comment

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nothing looked out of place to me in what @colesussmeier added. I'm the initial author, so I can't approve, but feel free to merge as is or address the tiny nit I added.

Comment thread scoring/management/commands/update_global_bot_leaderboard.py Outdated
* Make estimate_variances_from_head_to_head return a uniform tuple

Always return tuple[float, float | None] instead of conditionally
returning either a bare float or a tuple, so callers have a single
shape to unpack. The second element stays None unless
include_discrimination is set.

Co-authored-by: Cursor <cursoragent@cursor.com>

* Extract AIB_PROJECT_IDS module constant

Replace the AIB project-id list duplicated inside gather_data with a
single module-level AIB_PROJECT_IDS constant.

Co-authored-by: Cursor <cursoragent@cursor.com>

* Add aib_minibench_only question-selection mode

Factor the question project filter into a project_filter Q object and
add an aib_minibench_only flag that restricts the leaderboard to AIB
and Minibench questions. Tag CSV output with the _AIBMiniB suffix.

Co-authored-by: Cursor <cursoragent@cursor.com>

* Drop the community aggregate on low-human and minibench questions

Add a min_human_forecasters threshold: on community questions with
fewer than that many distinct human forecasters, keep the question but
drop the Community Aggregate head-to-head matches. Do the same for
minibench questions, which have no real human crowd (also skip building
the aggregate for them in gather_data). Tag CSV output with _MinHF.

Co-authored-by: Cursor <cursoragent@cursor.com>

* Implement non_metac_bots_by_year

Replace the NotImplementedError with the per-year split for third-party
bots: rewrite their head-to-head ids to year-tagged strings ("name
(YYYY)"), parallel to the cp/pro aggregate split. This also drops them
from non_metac_bot_ids membership so the per-year history bypasses the
recency filter. Guard with an assert that include_non_metac_bots is set.

Co-authored-by: Cursor <cursoragent@cursor.com>

* Apply the participation threshold per parent across year splits

Add participation_parent_key to map year-split player ids
("... (YYYY)") to their parent, and apply min_participation_count to
the parent's combined question set. This keeps an established
aggregate/bot from being dropped just because individual per-year
slices are sparse.

Co-authored-by: Cursor <cursoragent@cursor.com>

* Combine year-split players into one leaderboard entry per model

Add combine_year_split_players, which collapses per-year community/pro
aggregates and non_metac_bots_by_year bots into a single combined entry
(contribution-count-weighted mean skill, CI via SE propagation, summed
counts), mirroring the front-end re-aggregation. Apply it to the
leaderboard DB save and CSV output while keeping the per-year fit
intact for the discrimination and distribution diagnostics.

Co-authored-by: Cursor <cursoragent@cursor.com>

* Update default run config and tidy parameter comments

Set the Command.handle() run configuration to the current v2.3 defaults
(include_minibench, min_human_forecasters, non_metac_bots_by_year, bot
recency/score windows, ALS off, etc.) and wire the new
aib_minibench_only / min_human_forecasters kwargs through the call.
Move the explanatory comments off the function signature.

Co-authored-by: Cursor <cursoragent@cursor.com>

* Remove low-human participation questions prior to gather_data step

* docstring note for combine_year_split_players

---------

Co-authored-by: Cursor <cursoragent@cursor.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

FutureEval Leaderboard Additions

2 participants