feat: pom fetcher#4144
Conversation
|
|
There was a problem hiding this comment.
Pull request overview
Introduces a new Maven “POM fetcher” worker loop to enrich packages data in the osspckgs database by fetching Maven Central metadata/POMs, and adds corresponding data-access-layer helpers for selecting candidates and upserting packages/maintainers.
Changes:
- Added
@crowd/data-access-layerosspckgs module with queries for Maven enrichment candidates and upserts intopackages,maintainers, andpackage_maintainers. - Added a
pom-fetcherworker (config + entrypoint + enrichment loop) that resolves latest Maven versions and extracts POM metadata (licenses, SCM, developers/contributors). - Wired up scripts/deps for running the new worker (package.json scripts, docker-compose service yaml, lockfile updates).
Reviewed changes
Copilot reviewed 11 out of 13 changed files in this pull request and generated 12 comments.
Show a summary per file
| File | Description |
|---|---|
| services/libs/data-access-layer/src/osspckgs/types.ts | Adds DB-facing types for osspckgs package/maintainer upserts and universe rows. |
| services/libs/data-access-layer/src/osspckgs/packages.ts | Adds query to list Maven universe packages needing enrichment + upsert into packages. |
| services/libs/data-access-layer/src/osspckgs/maintainers.ts | Adds upserts for maintainers and package_maintainers. |
| services/libs/data-access-layer/src/osspckgs/index.ts | Re-exports osspckgs DAL surface. |
| services/libs/data-access-layer/src/index.ts | Exposes osspckgs DAL from the package root. |
| services/apps/packages_worker/src/pom-fetcher/runPomEnrichmentLoop.ts | Implements batch/concurrent enrichment loop and persistence of extracted metadata. |
| services/apps/packages_worker/src/pom-fetcher/metadata.ts | Resolves latest version via maven-metadata.xml. |
| services/apps/packages_worker/src/pom-fetcher/extract.ts | Fetches POMs and extracts fields with limited parent inheritance traversal. |
| services/apps/packages_worker/src/config.ts | Adds pom-fetcher config loader. |
| services/apps/packages_worker/src/bin/pom-fetcher.ts | Adds runnable entrypoint with shutdown handling. |
| services/apps/packages_worker/package.json | Adds scripts and deps (axios, fast-xml-parser) for pom-fetcher. |
| scripts/services/pom-fetcher.yaml | Adds docker-compose service definition for pom-fetcher. |
| pnpm-lock.yaml | Updates lockfile for new deps (but includes an unexpected workspace importer). |
Files not reviewed (1)
- pnpm-lock.yaml: Language not supported
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
|
Your PR title doesn't contain a Jira issue key. Consider adding it for better traceability. Example:
Projects:
Please add a Jira issue key to your PR title. |
Signed-off-by: Mouad BANI <mouad-mb@outlook.com>
Signed-off-by: Mouad BANI <mouad-mb@outlook.com>
…ved to packages_worker) Signed-off-by: Mouad BANI <mouad-mb@outlook.com>
Signed-off-by: Umberto Sgueglia <usgueglia@contractor.linuxfoundation.org>
Signed-off-by: Umberto Sgueglia <usgueglia@contractor.linuxfoundation.org>
Signed-off-by: Umberto Sgueglia <usgueglia@contractor.linuxfoundation.org>
Signed-off-by: Umberto Sgueglia <usgueglia@contractor.linuxfoundation.org>
Signed-off-by: Umberto Sgueglia <usgueglia@contractor.linuxfoundation.org>
Signed-off-by: Umberto Sgueglia <usgueglia@contractor.linuxfoundation.org>
Signed-off-by: Umberto Sgueglia <usgueglia@contractor.linuxfoundation.org>
Signed-off-by: Umberto Sgueglia <usgueglia@contractor.linuxfoundation.org>
Signed-off-by: Umberto Sgueglia <usgueglia@contractor.linuxfoundation.org>
Signed-off-by: Umberto Sgueglia <usgueglia@contractor.linuxfoundation.org>
Signed-off-by: Umberto Sgueglia <usgueglia@contractor.linuxfoundation.org>
Signed-off-by: Umberto Sgueglia <usgueglia@contractor.linuxfoundation.org>
There was a problem hiding this comment.
Cursor Bugbot has reviewed your changes and found 1 potential issue.
There are 2 total unresolved issues (including 1 from previous review).
❌ Bugbot Autofix is OFF. To automatically fix reported issues with cloud agents, have a team admin enable autofix in the Cursor dashboard.
Reviewed by Cursor Bugbot for commit 3c087e1. Configure here.
Signed-off-by: Umberto Sgueglia <usgueglia@contractor.linuxfoundation.org>

Summary
Adds a Maven POM fetcher to the packages_worker service that syncs Maven Central package metadata into the packages DB. It pulls candidates from packages_universe, extracts metadata from POM files (with parent-chain resolution), and populates package, version, maintainer, and repository data. This brings the Maven ecosystem to parity with the existing npm pipeline so critical Maven packages get high-quality, enriched metadata for downstream analytics.
Changes
Type of change
Note
Medium Risk
Large new external-ingestion path with high write volume to packages-db and aggressive default batch/concurrency in sample env; mitigated by idempotent upserts and rate-limit handling, but operational misconfiguration could throttle Maven Central or overload the DB.
Overview
Adds a Maven POM enrichment pipeline in
packages_workerthat syncs critical Maven packages from Central intopackages-db(packages, versions, maintainers, repos), with a DB-only path for non-critical packages that is implemented but not scheduled (non-critical Temporal registration is commented out).Runtime:
packages-workernow registers amaven-criticalTemporal schedule (cron*/1in code) that runs one batch per tick with incremental behavior (skip full POM work whenlatest_versionis unchanged).maven-backfilldrains the critical queue in a resumable one-shot with full extraction. A separatemaven-workerentrypoint and compose service run only the Maven schedule for local isolation. New required env:POM_FETCHER_*and optionalPOM_FETCHER_MAVEN_BASE_URL(defaults to Central; sample env points at the GCS mirror).Implementation highlights: HTTP fetch of
maven-metadata.xmland POMs via axios / fast-xml-parser, parent POM resolution (up to 8 hops) with an in-process LRU cache and request coalescing, namespace-ordered batches for cache locality, 403/429 backoff, and transactional upserts with deadlock retry.@crowd/data-access-layergainsosspckgshelpers (listMavenPackagesToSync,upsertPackage, version/maintainer/repo upserts, audit). Maintainer emails are stored as SHA-256 hashes. Unit tests cover prerelease detection, repo URL parsing, and SCM normalization.Ops / misc:
pnpm-lock.yamlpicks upaxiosandfast-xml-parser; local.env.dist.localfills placeholder GCS creds and POM fetcher tuning; packages-worker build metadata referencesmaven-worker.Reviewed by Cursor Bugbot for commit 02a4d95. Bugbot is set up for automated code reviews on this repo. Configure here.