feat: dockerhub-sync worker for repo_docker pull counts (CM-1213)#4163
feat: dockerhub-sync worker for repo_docker pull counts (CM-1213)#4163joanreyero wants to merge 3 commits into
Conversation
There was a problem hiding this comment.
Pull request overview
Adds a new long-running dockerhub-sync worker under services/apps/packages_worker to (1) discover Docker Hub images for GitHub repos and (2) refresh/snapshot Docker Hub pull counts daily, backed by new repos.docker_checked_at and a partitioned repo_docker_pulls_daily table.
Changes:
- Introduces Docker Hub discovery + refresh loop with per-token GitHub parking and per-IP Hub parking.
- Adds Docker Hub fetch + Dockerfile-detection utilities, persistence helpers, and initial unit tests.
- Extends packages-db schema with
docker_checked_at, backlog/staleness indexes, andrepo_docker_pulls_daily(range-partitioned).
Reviewed changes
Copilot reviewed 12 out of 13 changed files in this pull request and generated 10 comments.
Show a summary per file
| File | Description |
|---|---|
| services/apps/packages_worker/src/dockerhub/index.ts | Core discovery/refresh loop, rate-limit parking, page processing |
| services/apps/packages_worker/src/dockerhub/fetchDockerhub.ts | Docker Hub API client + error classification |
| services/apps/packages_worker/src/dockerhub/detectDockerfile.ts | GitHub GraphQL probe for Dockerfile presence |
| services/apps/packages_worker/src/dockerhub/upsertRepoDocker.ts | Upserts into repo_docker and daily snapshot table |
| services/apps/packages_worker/src/dockerhub/types.ts | Shared types + FetchError |
| services/apps/packages_worker/src/dockerhub/candidates.ts | Candidate image-name generation + validation |
| services/apps/packages_worker/src/dockerhub/tests/fetchDockerhub.test.ts | Unit tests for Hub fetch behavior |
| services/apps/packages_worker/src/dockerhub/tests/candidates.test.ts | Unit tests for candidate generation |
| services/apps/packages_worker/src/bin/dockerhub-sync.ts | Worker entrypoint and shutdown wiring |
| services/apps/packages_worker/src/config.ts | Adds getDockerhubConfig() env parsing |
| services/apps/packages_worker/package.json | Adds start/dev scripts for dockerhub-sync (protected file) |
| scripts/services/dockerhub-sync.yaml | Compose service for running dockerhub-sync |
| backend/src/osspckgs/migrations/V1779710880__initial_schema.sql | Schema: docker_checked_at, indexes, repo_docker_pulls_daily table |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| await qx.result( | ||
| ` | ||
| INSERT INTO repo_docker (repo_id, image_name, pulls, stars, last_synced_at) | ||
| VALUES ($(repoId), $(imageName), $(pulls), $(stars), NOW()) | ||
| ON CONFLICT (image_name) DO UPDATE SET | ||
| repo_id = COALESCE(repo_docker.repo_id, EXCLUDED.repo_id), | ||
| pulls = EXCLUDED.pulls, | ||
| stars = EXCLUDED.stars, | ||
| last_synced_at = NOW() | ||
| `, | ||
| { repoId, imageName: r.imageName, pulls: r.pulls, stars: r.stars }, | ||
| ) | ||
|
|
||
| await qx.result( | ||
| ` | ||
| INSERT INTO repo_docker_pulls_daily (image_name, date, pulls_total) | ||
| VALUES ($(imageName), CURRENT_DATE, $(pulls)) | ||
| ON CONFLICT (image_name, date) DO UPDATE SET pulls_total = EXCLUDED.pulls_total | ||
| `, | ||
| { imageName: r.imageName, pulls: r.pulls }, | ||
| ) |
| // eslint-disable-next-line @typescript-eslint/no-explicit-any | ||
| const json = (await response.json()) as any | ||
|
|
| const remaining = parseInt(response.headers.get('x-ratelimit-remaining') ?? '', 10) | ||
| const resetSec = parseInt(response.headers.get('x-ratelimit-reset') ?? '0', 10) | ||
| const resetMs = resetSec ? resetSec * 1000 + 5_000 : Date.now() + 65_000 | ||
|
|
||
| if (response.status === 429 || (Number.isFinite(remaining) && remaining <= 0)) { | ||
| throw new FetchError('RATE_LIMIT', `Rate limited on ${imageName}`, resetMs) | ||
| } |
| it('classifies x-ratelimit-remaining: 0 as RATE_LIMIT even on 200', async () => { | ||
| mockFetch(200, { pull_count: 1 }, { 'x-ratelimit-remaining': '0' }) | ||
| await expect(fetchDockerhub(BASE, 'a/b')).rejects.toMatchObject({ kind: 'RATE_LIMIT' }) | ||
| }) |
| if (response.status === 404) throw new FetchError('NOT_FOUND', `404 for ${imageName}`) | ||
| if (response.status >= 500) { | ||
| throw new FetchError('TRANSIENT', `${response.status} for ${imageName}`) | ||
| } | ||
| if (!response.ok) { | ||
| // 400/401/403 etc — treat as a miss; Hub sometimes 400s on malformed slugs. | ||
| throw new FetchError('NOT_FOUND', `${response.status} for ${imageName}`) | ||
| } |
| if (!(err instanceof FetchError)) throw err | ||
|
|
||
| if (err.kind === 'NOT_FOUND') return null | ||
| if (err.kind === 'MALFORMED') { | ||
| log.warn({ imageName }, err.message) | ||
| return null | ||
| } | ||
| if (err.kind === 'RATE_LIMIT') { | ||
| hubParkedUntil = err.resetAt ?? Date.now() + 60_000 | ||
| continue | ||
| } |
| try { | ||
| const outcome = await discoverRepo(qx, row, token, config) | ||
| if (outcome === 'hit') hits++ | ||
| else if (outcome === 'miss') misses++ | ||
| else skipped++ | ||
| } catch (err) { | ||
| if (err instanceof FetchError && err.kind === 'RATE_LIMIT') { | ||
| const resetAt = err.resetAt ?? Date.now() + 60_000 | ||
| const waitMs = Math.max(1_000, resetAt - Date.now()) | ||
| parkedUntil.set(token, resetAt) | ||
| log.warn( | ||
| { tokenIdx, parkedUntil: new Date(resetAt).toISOString() }, | ||
| `token#${tokenIdx} rate limited — parking for ${Math.round(waitMs / 1000)}s`, | ||
| ) | ||
| await new Promise((r) => setTimeout(r, waitMs)) | ||
| failed++ | ||
| } else { | ||
| log.error({ url: row.url, err }, 'Unexpected discovery error') | ||
| failed++ | ||
| } | ||
| } |
| async function fetchRefreshPage( | ||
| qx: QueryExecutor, | ||
| cursor: string | null, | ||
| config: DockerhubConfig, | ||
| ): Promise<RefreshImageRow[]> { | ||
| return qx.select( | ||
| ` | ||
| SELECT id, repo_id, image_name | ||
| FROM repo_docker | ||
| WHERE last_synced_at < NOW() - make_interval(hours => $(refreshIntervalHours)) | ||
| AND ($(cursor)::bigint IS NULL OR id > $(cursor)::bigint) | ||
| ORDER BY id | ||
| LIMIT $(batchSize) | ||
| `, |
There was a problem hiding this comment.
Intentional — packages_worker keeps DB access as inline pg-promise SQL by convention (see .claude/skills/packages-worker-add-entrypoint: "do not add files to services/libs/data-access-layer"). The sibling enricher/ and osv/ workers follow the same pattern; packages-db is a separate database from the main CDP DAL.
There was a problem hiding this comment.
Retracting the above — themarolt established the opposite on #4149 and OSV queries already live in data-access-layer/src/packages/osv.ts. Moved in 09d2a00: all repo_docker / repo_docker_pulls_daily / repos.docker_checked_at queries now in services/libs/data-access-layer/src/packages/repoDocker.ts; the worker keeps HTTP + loop orchestration only.
| CREATE TABLE repo_docker_pulls_daily ( | ||
| image_name text NOT NULL, | ||
| date date NOT NULL, | ||
| pulls_total bigint NOT NULL, | ||
| PRIMARY KEY (image_name, date) | ||
| ) | ||
| PARTITION BY RANGE (date); |
There was a problem hiding this comment.
Matches downloads_daily ~30 lines above in the same migration — partition creation is deferred to the pg_partman setup step documented in the comment block (partman.create_parent('public.repo_docker_pulls_daily', 'date', '1 month', 3)). Adding a DEFAULT partition would defeat the point of monthly partitioning. Noted in the PR description under reviewer notes.
| async function processDiscoveryPage( | ||
| qx: QueryExecutor, | ||
| rows: DiscoveryRepoRow[], | ||
| parkedUntil: Map<string, number>, | ||
| config: DockerhubConfig, | ||
| ): Promise<{ hits: number; misses: number; skipped: number; failed: number }> { | ||
| let hits = 0 | ||
| let misses = 0 | ||
| let skipped = 0 | ||
| let failed = 0 | ||
| let nextIdx = 0 |
There was a problem hiding this comment.
Acknowledged — deferring loop-level mock tests to a follow-up. The leaf functions (candidates, fetchDockerhub error classification) are unit-tested, and the loop was validated end-to-end against 65 curated + 1000 random prod repos (results in PR description). The retry-after-park path is now exercised by the fix in d200d43.
Standalone loop worker (modeled on github-repos-enricher) that: - discovers Docker images for GitHub repos via Dockerfile-gated <owner>/<name> probing on hub.docker.com/v2 - refreshes pull/star counts daily into repo_docker - snapshots lifetime pull_count into repo_docker_pulls_daily for delta-at-query-time daily granularity Schema (V1779710880 edited in place, pre-prod): - repos.docker_checked_at + partial index for discovery backlog - repo_docker_pulls_daily partitioned by date (pg_partman, mirrors downloads_daily) - repo_docker_stale_idx on last_synced_at Tested against a 1000-repo random sample from prod public.repositories: 2.6% hit rate on Hub; 87% of repos have no Dockerfile; ghcr.io is the dominant registry for the remainder. CI-workflow parsing and ghcr/quay probes scoped as follow-ups. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com> Signed-off-by: Joan Reyero <joan@reyero.io>
6fb6051 to
b23c298
Compare
- Retry the same row after a GitHub rate-limit park instead of abandoning it (cursor would otherwise advance past unprobed repos until end-of-sweep). - Serialize Docker Hub calls via a promise chain so the per-token GitHub fan-out cannot fire concurrent requests against the per-IP Hub budget. - 401/403 from Hub now classified AUTH and propagated, so a misconfigured base URL fails fast instead of silently marking every image gone. - Stop discarding valid 200 responses when x-ratelimit-remaining=0. - Wrap repo_docker + repo_docker_pulls_daily writes in a transaction. - Classify non-JSON GitHub GraphQL bodies as MALFORMED. Not addressed (replied on PR): - Inline SQL stays per packages_worker convention (matches enricher/osv). - repo_docker_pulls_daily partition setup deferred to pg_partman, same as downloads_daily in the same migration. - Loop-level retry/parking tests deferred; validated against 1065 real repos. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com> Signed-off-by: Joan Reyero <joan@reyero.io>
| // eslint-disable-next-line @typescript-eslint/no-explicit-any | ||
| let json: any | ||
| try { | ||
| json = await response.json() | ||
| } catch (err) { | ||
| throw new FetchError( | ||
| 'MALFORMED', | ||
| `Non-JSON body for ${owner}/${name}: ${(err as Error).message}`, | ||
| ) | ||
| } |
| let response: Response | ||
| try { | ||
| response = await fetch(GRAPHQL_URL, { | ||
| method: 'POST', | ||
| headers: { | ||
| Authorization: `bearer ${token}`, | ||
| 'Content-Type': 'application/json', | ||
| }, | ||
| body: JSON.stringify({ query: DOCKERFILE_QUERY, variables: { owner, name } }), | ||
| }) | ||
| } catch (err) { | ||
| throw new FetchError( | ||
| 'TRANSIENT', | ||
| `Network error for ${owner}/${name}: ${(err as Error).message}`, | ||
| ) | ||
| } |
| const url = `${baseUrl}/repositories/${imageName}/` | ||
|
|
||
| let response: Response | ||
| try { | ||
| response = await fetch(url, { | ||
| method: 'GET', | ||
| headers: { Accept: 'application/json' }, | ||
| }) | ||
| } catch (err) { | ||
| throw new FetchError('TRANSIENT', `Network error for ${imageName}: ${(err as Error).message}`) | ||
| } |
Per themarolt's review on #4149, packages-db queries belong in services/libs/data-access-layer/src/packages/ alongside osv.ts. The worker now imports fetchStaleRepoDocker, fetchPendingDockerRepos, upsertRepoDockerRow, upsertRepoDockerDailySnapshot, touchRepoDocker, markRepoDockerChecked from @crowd/data-access-layer; dockerhub/upsertRepoDocker.ts is reduced to the tx orchestrator. Query strings unchanged. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com> Signed-off-by: Joan Reyero <joan@reyero.io>
There was a problem hiding this comment.
Cursor Bugbot has reviewed your changes and found 1 potential issue.
❌ Bugbot Autofix is OFF. To automatically fix reported issues with cloud agents, have a team admin enable autofix in the Cursor dashboard.
Reviewed by Cursor Bugbot for commit 09d2a00. Configure here.
| if (err.kind === 'RATE_LIMIT') { | ||
| hubParkedUntil = err.resetAt ?? Date.now() + 60_000 | ||
| continue | ||
| } |
There was a problem hiding this comment.
Hub rate limits exhaust retries
High Severity
In hubFetchInner, each Docker Hub RATE_LIMIT uses continue on the same for loop that caps attempts at MAX_RETRIES. After a few 429s in one fetch, the loop exits and returns null, so discovery can mark a repo as checked with no Hub hit, and refresh can touch a live image as gone.
Reviewed by Cursor Bugbot for commit 09d2a00. Configure here.


Summary
Standalone loop worker (sibling of
github-repos-enricher) that discovers Docker Hub images for repos in packages-db and tracks their pull counts with daily granularity.Discovery (Option B-lite): one GitHub GraphQL call per repo checks for
DockerfileatHEAD:Dockerfile,docker/Dockerfile,build/Dockerfile. If present, probeshub.docker.com/v2/repositories/<owner>/<name>/. Hits are upserted torepo_docker; every repo getsrepos.docker_checked_atset so the backlog drains.Refresh: known images with stale
last_synced_atare re-fetched daily; lifetimepull_countis written torepo_docker.pullsand snapshotted intorepo_docker_pulls_daily(per-day deltas viaLAG()at query time — Hub doesn't expose daily counts).Loop: each tick processes one refresh page then one discovery page; idles when both empty. GitHub calls fan out across
ENRICHER_GITHUB_TOKENSwith per-token parking; Hub calls are sequential with a single per-IP park (Hub rate limit is per-IP, ~180/window).Schema (V1779710880 edited in place — pre-prod)
repos.docker_checked_at timestamptz+ partial indexrepos_docker_pending_idx(WHERE host='github' AND docker_checked_at IS NULL)repo_docker_pulls_daily(image_name, date, pulls_total)partitioned by date (register with pg_partman alongsidedownloads_daily)repo_docker_stale_idxonrepo_docker(last_synced_at)Files
src/dockerhub/{index,types,candidates,detectDockerfile,fetchDockerhub,upsertRepoDocker}.ts+ 15 vitest casessrc/bin/dockerhub-sync.ts,src/config.ts(getDockerhubConfig)scripts/services/dockerhub-sync.yaml,package.jsonscripts (port 9235)Validation against prod data
Ran against a random 1000-repo sample from
public.repositories(prod):<owner>/<name>on Hub<owner>/<name>on Hub7.5 min / 1 token / 0 errors / 0 rate-limits. Top finds:
ollama/ollama(140M pulls),hashicorp/packer(47M),semgrep/semgrep(32M).Follow-ups (scoped, not in this PR)
A CI-workflow-parsing census on the same 1000 repos showed:
scaleway/cli,qmcgaw/gluetun,nervos/ckb,paketobuildpacks/*, …)Ranked by ROI:
registrycolumn onrepo_docker. ~2× total coverage..github/workflowsextraction. +17 Hub images/1000.library/official-image allowlist; broader Dockerfile path probing.Reviewer notes
backend/.env.dist.{local,composed}need theDOCKERHUB_*block appended (couldn't write.env*from the dev session — see commit message for values).pnpm formatinpackages_workeris currently broken (strips TS generics);format-checkfails on pre-existing files too. Not CI-gated for this workspace. Separate fix needed.package.jsonchange is +3 script entries only.🤖 Generated with Claude Code
Note
Medium Risk
Touches packages-db schema (migration edit), long-running external API sync with rate limits, and depends on pg_partman partitions for
repo_docker_pulls_dailyinserts—misconfig or missing partitions could stall snapshots until ops fix partman.Overview
Adds a
dockerhub-syncloop worker (alongsidegithub-repos-enricher) that links GitHub repos in packages-db to Docker Hub images and keeps pull metrics fresh.Discovery: For GitHub repos with no recent
docker_checked_at, one GraphQL call checks for a Dockerfile at common paths; if found, it probes Hub atowner/name(validated/lowercased slugs only—nolibrary/heuristic). Hits upsertrepo_dockerand setrepos.docker_checked_atso the backlog drains.Refresh: Stale
repo_dockerrows are re-fetched on a daily interval; lifetimepull_countupdatesrepo_dockerand a daily snapshot in newrepo_docker_pulls_daily(partitioned; deltas viaLAG()at query time). Hub calls are serialized with per-IP rate-limit parking; GitHub discovery fans out acrossENRICHER_GITHUB_TOKENS.Schema/env/ops: Extends the pre-prod initial migration with
docker_checked_at, partial index for pending discovery,repo_docker_stale_idx, andrepo_docker_pulls_daily; addsDOCKERHUB_*env, DAL inrepoDocker.ts, compose servicedockerhub-sync.yaml, and package scripts/bin entry.Reviewed by Cursor Bugbot for commit 09d2a00. Bugbot is set up for automated code reviews on this repo. Configure here.