Skip to content

feat: dockerhub-sync worker for repo_docker pull counts (CM-1213)#4163

Open
joanreyero wants to merge 3 commits into
mainfrom
feat/CM-1213-dockerhub-sync
Open

feat: dockerhub-sync worker for repo_docker pull counts (CM-1213)#4163
joanreyero wants to merge 3 commits into
mainfrom
feat/CM-1213-dockerhub-sync

Conversation

@joanreyero
Copy link
Copy Markdown
Contributor

@joanreyero joanreyero commented Jun 3, 2026

Summary

Standalone loop worker (sibling of github-repos-enricher) that discovers Docker Hub images for repos in packages-db and tracks their pull counts with daily granularity.

Discovery (Option B-lite): one GitHub GraphQL call per repo checks for Dockerfile at HEAD:Dockerfile, docker/Dockerfile, build/Dockerfile. If present, probes hub.docker.com/v2/repositories/<owner>/<name>/. Hits are upserted to repo_docker; every repo gets repos.docker_checked_at set so the backlog drains.

Refresh: known images with stale last_synced_at are re-fetched daily; lifetime pull_count is written to repo_docker.pulls and snapshotted into repo_docker_pulls_daily (per-day deltas via LAG() at query time — Hub doesn't expose daily counts).

Loop: each tick processes one refresh page then one discovery page; idles when both empty. GitHub calls fan out across ENRICHER_GITHUB_TOKENS with per-token parking; Hub calls are sequential with a single per-IP park (Hub rate limit is per-IP, ~180/window).

Schema (V1779710880 edited in place — pre-prod)

  • repos.docker_checked_at timestamptz + partial index repos_docker_pending_idx (WHERE host='github' AND docker_checked_at IS NULL)
  • repo_docker_pulls_daily(image_name, date, pulls_total) partitioned by date (register with pg_partman alongside downloads_daily)
  • repo_docker_stale_idx on repo_docker(last_synced_at)

Files

  • src/dockerhub/{index,types,candidates,detectDockerfile,fetchDockerhub,upsertRepoDocker}.ts + 15 vitest cases
  • src/bin/dockerhub-sync.ts, src/config.ts (getDockerhubConfig)
  • scripts/services/dockerhub-sync.yaml, package.json scripts (port 9235)

Validation against prod data

Ran against a random 1000-repo sample from public.repositories (prod):

Outcome n %
Hit — <owner>/<name> on Hub 26 2.6%
Dockerfile present, no <owner>/<name> on Hub 102 10.2%
No Dockerfile at probed paths 869 86.9%
GitHub 404 3 0.3%

7.5 min / 1 token / 0 errors / 0 rate-limits. Top finds: ollama/ollama (140M pulls), hashicorp/packer (47M), semgrep/semgrep (32M).

Follow-ups (scoped, not in this PR)

A CI-workflow-parsing census on the same 1000 repos showed:

  • 66 repos (6.6%) have GHA that publishes a container
  • Registry split: ghcr.io 41 · docker.io 35 · quay.io 8 — ghcr is the dominant target for CDP's LF/CNCF-heavy population
  • CI parsing recovers +17 Hub images v1 misses (org/name differs: scaleway/cli, qmcgaw/gluetun, nervos/ckb, paketobuildpacks/*, …)
  • 31 repos publish only to non-Hub registries

Ranked by ROI:

  1. ghcr.io/quay.io probes — ghcr namespaces == GitHub orgs by design, so the existing heuristic works as-is; needs a registry column on repo_docker. ~2× total coverage.
  2. CI-workflow parsing — replace Dockerfile gate with .github/workflows extraction. +17 Hub images/1000.
  3. library/ official-image allowlist; broader Dockerfile path probing.

Reviewer notes

  • backend/.env.dist.{local,composed} need the DOCKERHUB_* block appended (couldn't write .env* from the dev session — see commit message for values).
  • pnpm format in packages_worker is currently broken (strips TS generics); format-check fails on pre-existing files too. Not CI-gated for this workspace. Separate fix needed.
  • package.json change is +3 script entries only.

🤖 Generated with Claude Code


Note

Medium Risk
Touches packages-db schema (migration edit), long-running external API sync with rate limits, and depends on pg_partman partitions for repo_docker_pulls_daily inserts—misconfig or missing partitions could stall snapshots until ops fix partman.

Overview
Adds a dockerhub-sync loop worker (alongside github-repos-enricher) that links GitHub repos in packages-db to Docker Hub images and keeps pull metrics fresh.

Discovery: For GitHub repos with no recent docker_checked_at, one GraphQL call checks for a Dockerfile at common paths; if found, it probes Hub at owner/name (validated/lowercased slugs only—no library/ heuristic). Hits upsert repo_docker and set repos.docker_checked_at so the backlog drains.

Refresh: Stale repo_docker rows are re-fetched on a daily interval; lifetime pull_count updates repo_docker and a daily snapshot in new repo_docker_pulls_daily (partitioned; deltas via LAG() at query time). Hub calls are serialized with per-IP rate-limit parking; GitHub discovery fans out across ENRICHER_GITHUB_TOKENS.

Schema/env/ops: Extends the pre-prod initial migration with docker_checked_at, partial index for pending discovery, repo_docker_stale_idx, and repo_docker_pulls_daily; adds DOCKERHUB_* env, DAL in repoDocker.ts, compose service dockerhub-sync.yaml, and package scripts/bin entry.

Reviewed by Cursor Bugbot for commit 09d2a00. Bugbot is set up for automated code reviews on this repo. Configure here.

Copilot AI review requested due to automatic review settings June 3, 2026 13:40
Comment thread services/apps/packages_worker/src/dockerhub/index.ts
Comment thread services/apps/packages_worker/src/dockerhub/index.ts
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Adds a new long-running dockerhub-sync worker under services/apps/packages_worker to (1) discover Docker Hub images for GitHub repos and (2) refresh/snapshot Docker Hub pull counts daily, backed by new repos.docker_checked_at and a partitioned repo_docker_pulls_daily table.

Changes:

  • Introduces Docker Hub discovery + refresh loop with per-token GitHub parking and per-IP Hub parking.
  • Adds Docker Hub fetch + Dockerfile-detection utilities, persistence helpers, and initial unit tests.
  • Extends packages-db schema with docker_checked_at, backlog/staleness indexes, and repo_docker_pulls_daily (range-partitioned).

Reviewed changes

Copilot reviewed 12 out of 13 changed files in this pull request and generated 10 comments.

Show a summary per file
File Description
services/apps/packages_worker/src/dockerhub/index.ts Core discovery/refresh loop, rate-limit parking, page processing
services/apps/packages_worker/src/dockerhub/fetchDockerhub.ts Docker Hub API client + error classification
services/apps/packages_worker/src/dockerhub/detectDockerfile.ts GitHub GraphQL probe for Dockerfile presence
services/apps/packages_worker/src/dockerhub/upsertRepoDocker.ts Upserts into repo_docker and daily snapshot table
services/apps/packages_worker/src/dockerhub/types.ts Shared types + FetchError
services/apps/packages_worker/src/dockerhub/candidates.ts Candidate image-name generation + validation
services/apps/packages_worker/src/dockerhub/tests/fetchDockerhub.test.ts Unit tests for Hub fetch behavior
services/apps/packages_worker/src/dockerhub/tests/candidates.test.ts Unit tests for candidate generation
services/apps/packages_worker/src/bin/dockerhub-sync.ts Worker entrypoint and shutdown wiring
services/apps/packages_worker/src/config.ts Adds getDockerhubConfig() env parsing
services/apps/packages_worker/package.json Adds start/dev scripts for dockerhub-sync (protected file)
scripts/services/dockerhub-sync.yaml Compose service for running dockerhub-sync
backend/src/osspckgs/migrations/V1779710880__initial_schema.sql Schema: docker_checked_at, indexes, repo_docker_pulls_daily table

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment on lines +10 to +30
await qx.result(
`
INSERT INTO repo_docker (repo_id, image_name, pulls, stars, last_synced_at)
VALUES ($(repoId), $(imageName), $(pulls), $(stars), NOW())
ON CONFLICT (image_name) DO UPDATE SET
repo_id = COALESCE(repo_docker.repo_id, EXCLUDED.repo_id),
pulls = EXCLUDED.pulls,
stars = EXCLUDED.stars,
last_synced_at = NOW()
`,
{ repoId, imageName: r.imageName, pulls: r.pulls, stars: r.stars },
)

await qx.result(
`
INSERT INTO repo_docker_pulls_daily (image_name, date, pulls_total)
VALUES ($(imageName), CURRENT_DATE, $(pulls))
ON CONFLICT (image_name, date) DO UPDATE SET pulls_total = EXCLUDED.pulls_total
`,
{ imageName: r.imageName, pulls: r.pulls },
)
Comment on lines +60 to +62
// eslint-disable-next-line @typescript-eslint/no-explicit-any
const json = (await response.json()) as any

Comment on lines +29 to +35
const remaining = parseInt(response.headers.get('x-ratelimit-remaining') ?? '', 10)
const resetSec = parseInt(response.headers.get('x-ratelimit-reset') ?? '0', 10)
const resetMs = resetSec ? resetSec * 1000 + 5_000 : Date.now() + 65_000

if (response.status === 429 || (Number.isFinite(remaining) && remaining <= 0)) {
throw new FetchError('RATE_LIMIT', `Rate limited on ${imageName}`, resetMs)
}
Comment on lines +71 to +74
it('classifies x-ratelimit-remaining: 0 as RATE_LIMIT even on 200', async () => {
mockFetch(200, { pull_count: 1 }, { 'x-ratelimit-remaining': '0' })
await expect(fetchDockerhub(BASE, 'a/b')).rejects.toMatchObject({ kind: 'RATE_LIMIT' })
})
Comment on lines +37 to +44
if (response.status === 404) throw new FetchError('NOT_FOUND', `404 for ${imageName}`)
if (response.status >= 500) {
throw new FetchError('TRANSIENT', `${response.status} for ${imageName}`)
}
if (!response.ok) {
// 400/401/403 etc — treat as a miss; Hub sometimes 400s on malformed slugs.
throw new FetchError('NOT_FOUND', `${response.status} for ${imageName}`)
}
Comment on lines +37 to +47
if (!(err instanceof FetchError)) throw err

if (err.kind === 'NOT_FOUND') return null
if (err.kind === 'MALFORMED') {
log.warn({ imageName }, err.message)
return null
}
if (err.kind === 'RATE_LIMIT') {
hubParkedUntil = err.resetAt ?? Date.now() + 60_000
continue
}
Comment on lines +232 to +252
try {
const outcome = await discoverRepo(qx, row, token, config)
if (outcome === 'hit') hits++
else if (outcome === 'miss') misses++
else skipped++
} catch (err) {
if (err instanceof FetchError && err.kind === 'RATE_LIMIT') {
const resetAt = err.resetAt ?? Date.now() + 60_000
const waitMs = Math.max(1_000, resetAt - Date.now())
parkedUntil.set(token, resetAt)
log.warn(
{ tokenIdx, parkedUntil: new Date(resetAt).toISOString() },
`token#${tokenIdx} rate limited — parking for ${Math.round(waitMs / 1000)}s`,
)
await new Promise((r) => setTimeout(r, waitMs))
failed++
} else {
log.error({ url: row.url, err }, 'Unexpected discovery error')
failed++
}
}
Comment on lines +92 to +105
async function fetchRefreshPage(
qx: QueryExecutor,
cursor: string | null,
config: DockerhubConfig,
): Promise<RefreshImageRow[]> {
return qx.select(
`
SELECT id, repo_id, image_name
FROM repo_docker
WHERE last_synced_at < NOW() - make_interval(hours => $(refreshIntervalHours))
AND ($(cursor)::bigint IS NULL OR id > $(cursor)::bigint)
ORDER BY id
LIMIT $(batchSize)
`,
Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Intentional — packages_worker keeps DB access as inline pg-promise SQL by convention (see .claude/skills/packages-worker-add-entrypoint: "do not add files to services/libs/data-access-layer"). The sibling enricher/ and osv/ workers follow the same pattern; packages-db is a separate database from the main CDP DAL.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Retracting the above — themarolt established the opposite on #4149 and OSV queries already live in data-access-layer/src/packages/osv.ts. Moved in 09d2a00: all repo_docker / repo_docker_pulls_daily / repos.docker_checked_at queries now in services/libs/data-access-layer/src/packages/repoDocker.ts; the worker keeps HTTP + loop orchestration only.

Comment on lines +577 to +583
CREATE TABLE repo_docker_pulls_daily (
image_name text NOT NULL,
date date NOT NULL,
pulls_total bigint NOT NULL,
PRIMARY KEY (image_name, date)
)
PARTITION BY RANGE (date);
Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Matches downloads_daily ~30 lines above in the same migration — partition creation is deferred to the pg_partman setup step documented in the comment block (partman.create_parent('public.repo_docker_pulls_daily', 'date', '1 month', 3)). Adding a DEFAULT partition would defeat the point of monthly partitioning. Noted in the PR description under reviewer notes.

Comment on lines +208 to +218
async function processDiscoveryPage(
qx: QueryExecutor,
rows: DiscoveryRepoRow[],
parkedUntil: Map<string, number>,
config: DockerhubConfig,
): Promise<{ hits: number; misses: number; skipped: number; failed: number }> {
let hits = 0
let misses = 0
let skipped = 0
let failed = 0
let nextIdx = 0
Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Acknowledged — deferring loop-level mock tests to a follow-up. The leaf functions (candidates, fetchDockerhub error classification) are unit-tested, and the loop was validated end-to-end against 65 curated + 1000 random prod repos (results in PR description). The retry-after-park path is now exercised by the fix in d200d43.

Standalone loop worker (modeled on github-repos-enricher) that:
- discovers Docker images for GitHub repos via Dockerfile-gated <owner>/<name>
  probing on hub.docker.com/v2
- refreshes pull/star counts daily into repo_docker
- snapshots lifetime pull_count into repo_docker_pulls_daily for delta-at-query-time
  daily granularity

Schema (V1779710880 edited in place, pre-prod):
- repos.docker_checked_at + partial index for discovery backlog
- repo_docker_pulls_daily partitioned by date (pg_partman, mirrors downloads_daily)
- repo_docker_stale_idx on last_synced_at

Tested against a 1000-repo random sample from prod public.repositories:
2.6% hit rate on Hub; 87% of repos have no Dockerfile; ghcr.io is the dominant
registry for the remainder. CI-workflow parsing and ghcr/quay probes scoped as
follow-ups.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Signed-off-by: Joan Reyero <joan@reyero.io>
@joanreyero joanreyero force-pushed the feat/CM-1213-dockerhub-sync branch from 6fb6051 to b23c298 Compare June 3, 2026 15:42
- Retry the same row after a GitHub rate-limit park instead of abandoning it
  (cursor would otherwise advance past unprobed repos until end-of-sweep).
- Serialize Docker Hub calls via a promise chain so the per-token GitHub fan-out
  cannot fire concurrent requests against the per-IP Hub budget.
- 401/403 from Hub now classified AUTH and propagated, so a misconfigured base
  URL fails fast instead of silently marking every image gone.
- Stop discarding valid 200 responses when x-ratelimit-remaining=0.
- Wrap repo_docker + repo_docker_pulls_daily writes in a transaction.
- Classify non-JSON GitHub GraphQL bodies as MALFORMED.

Not addressed (replied on PR):
- Inline SQL stays per packages_worker convention (matches enricher/osv).
- repo_docker_pulls_daily partition setup deferred to pg_partman, same as
  downloads_daily in the same migration.
- Loop-level retry/parking tests deferred; validated against 1065 real repos.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Signed-off-by: Joan Reyero <joan@reyero.io>
Copilot AI review requested due to automatic review settings June 3, 2026 16:11
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 13 out of 14 changed files in this pull request and generated 3 comments.

Comment on lines +60 to +69
// eslint-disable-next-line @typescript-eslint/no-explicit-any
let json: any
try {
json = await response.json()
} catch (err) {
throw new FetchError(
'MALFORMED',
`Non-JSON body for ${owner}/${name}: ${(err as Error).message}`,
)
}
Comment on lines +23 to +38
let response: Response
try {
response = await fetch(GRAPHQL_URL, {
method: 'POST',
headers: {
Authorization: `bearer ${token}`,
'Content-Type': 'application/json',
},
body: JSON.stringify({ query: DOCKERFILE_QUERY, variables: { owner, name } }),
})
} catch (err) {
throw new FetchError(
'TRANSIENT',
`Network error for ${owner}/${name}: ${(err as Error).message}`,
)
}
Comment on lines +17 to +27
const url = `${baseUrl}/repositories/${imageName}/`

let response: Response
try {
response = await fetch(url, {
method: 'GET',
headers: { Accept: 'application/json' },
})
} catch (err) {
throw new FetchError('TRANSIENT', `Network error for ${imageName}: ${(err as Error).message}`)
}
Per themarolt's review on #4149, packages-db queries belong in
services/libs/data-access-layer/src/packages/ alongside osv.ts. The worker now
imports fetchStaleRepoDocker, fetchPendingDockerRepos, upsertRepoDockerRow,
upsertRepoDockerDailySnapshot, touchRepoDocker, markRepoDockerChecked from
@crowd/data-access-layer; dockerhub/upsertRepoDocker.ts is reduced to the tx
orchestrator. Query strings unchanged.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Signed-off-by: Joan Reyero <joan@reyero.io>
Copy link
Copy Markdown

@cursor cursor Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cursor Bugbot has reviewed your changes and found 1 potential issue.

Fix All in Cursor

❌ Bugbot Autofix is OFF. To automatically fix reported issues with cloud agents, have a team admin enable autofix in the Cursor dashboard.

Reviewed by Cursor Bugbot for commit 09d2a00. Configure here.

if (err.kind === 'RATE_LIMIT') {
hubParkedUntil = err.resetAt ?? Date.now() + 60_000
continue
}
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hub rate limits exhaust retries

High Severity

In hubFetchInner, each Docker Hub RATE_LIMIT uses continue on the same for loop that caps attempts at MAX_RETRIES. After a few 429s in one fetch, the loop exits and returns null, so discovery can mark a repo as checked with no Hub hit, and refresh can touch a live image as gone.

Fix in Cursor Fix in Web

Reviewed by Cursor Bugbot for commit 09d2a00. Configure here.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants