Skip to content

feat: pom fetcher#4144

Open
ulemons wants to merge 19 commits into
mainfrom
feat/pom-fetcher
Open

feat: pom fetcher#4144
ulemons wants to merge 19 commits into
mainfrom
feat/pom-fetcher

Conversation

@ulemons
Copy link
Copy Markdown
Contributor

@ulemons ulemons commented May 26, 2026

Summary

Adds a Maven POM fetcher to the packages_worker service that syncs Maven Central package metadata into the packages DB. It pulls candidates from packages_universe, extracts metadata from POM files (with parent-chain resolution), and populates package, version, maintainer, and repository data. This brings the Maven ecosystem to parity with the existing npm pipeline so critical Maven packages get high-quality, enriched metadata for downstream analytics.

Changes

  • Two-tier fetch strategy — non-critical packages are DB-only (copy universe stats, no HTTP, ~1000 pkg/sec); critical packages get full POM extraction with parent-chain resolution (max 8 hops) for description, homepage, SCM/repo, licenses, maintainers, and the full version list.
  • Two entry points — bin/packages-worker.ts registers the maven-critical Temporal schedule for incremental syncing (skips POM extraction when the version is unchanged), and bin/maven-backfill.ts (pnpm backfill:maven) does a one-shot, resumable full-extraction backfill. The DB state is the cursor, so re-runs pick up where they left off.
  • Module-level parent POM cache (extract.ts) — coordinate-keyed LRU with request coalescing, caches only successful fetches (never null, to avoid poisoning), no TTL since Maven coordinates are immutable. This is the main lever against Maven Central rate limiting and works because the rank_in_ecosystem ordering clusters sibling artifacts that share parent POMs. Exposes getPomCacheStats() for hit-rate observability.
  • New osspckgs data-access-layer module — query functions for packages, versions, maintainers, and repos (functional, pg-promise via queryExecutor), shared across the worker.
  • Delta API support (deltaApi.ts) for incremental upstream change detection, plus benchmark and data-quality validation scripts.
  • Adds unit tests for the pure normalization functions.
  • Maven-specific config in config.ts (POM_FETCHER_REFRESH_DAYS, POM_CACHE_MAX_ENTRIES, etc.) and .env.dist.local entries.

Type of change

  • Bug fix
  • New feature
  • Refactor / cleanup
  • Performance improvement
  • Chore / dependency update
  • Documentation

Note

Medium Risk
Large new external-ingestion path with high write volume to packages-db and aggressive default batch/concurrency in sample env; mitigated by idempotent upserts and rate-limit handling, but operational misconfiguration could throttle Maven Central or overload the DB.

Overview
Adds a Maven POM enrichment pipeline in packages_worker that syncs critical Maven packages from Central into packages-db (packages, versions, maintainers, repos), with a DB-only path for non-critical packages that is implemented but not scheduled (non-critical Temporal registration is commented out).

Runtime: packages-worker now registers a maven-critical Temporal schedule (cron */1 in code) that runs one batch per tick with incremental behavior (skip full POM work when latest_version is unchanged). maven-backfill drains the critical queue in a resumable one-shot with full extraction. A separate maven-worker entrypoint and compose service run only the Maven schedule for local isolation. New required env: POM_FETCHER_* and optional POM_FETCHER_MAVEN_BASE_URL (defaults to Central; sample env points at the GCS mirror).

Implementation highlights: HTTP fetch of maven-metadata.xml and POMs via axios / fast-xml-parser, parent POM resolution (up to 8 hops) with an in-process LRU cache and request coalescing, namespace-ordered batches for cache locality, 403/429 backoff, and transactional upserts with deadlock retry. @crowd/data-access-layer gains osspckgs helpers (listMavenPackagesToSync, upsertPackage, version/maintainer/repo upserts, audit). Maintainer emails are stored as SHA-256 hashes. Unit tests cover prerelease detection, repo URL parsing, and SCM normalization.

Ops / misc: pnpm-lock.yaml picks up axios and fast-xml-parser; local .env.dist.local fills placeholder GCS creds and POM fetcher tuning; packages-worker build metadata references maven-worker.

Reviewed by Cursor Bugbot for commit 02a4d95. Bugbot is set up for automated code reviews on this repo. Configure here.

Copilot AI review requested due to automatic review settings May 26, 2026 15:59
@CLAassistant
Copy link
Copy Markdown

CLAassistant commented May 26, 2026

CLA assistant check
Thank you for your submission! We really appreciate it. Like many open source projects, we ask that you all sign our Contributor License Agreement before we can accept your contribution.
0 out of 2 committers have signed the CLA.

❌ mbani01
❌ ulemons
You have signed the CLA already but the status is still pending? Let us recheck it.

Comment thread services/apps/packages_worker/package.json Fixed
Comment thread services/apps/packages_worker/package.json Fixed
Comment thread services/apps/packages_worker/package.json Fixed
Comment thread services/apps/packages_worker/package.json Fixed
Copy link
Copy Markdown
Contributor

@github-actions github-actions Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Conventional Commits FTW!

@ulemons ulemons changed the base branch from main to feat/track-packages May 26, 2026 16:00
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Introduces a new Maven “POM fetcher” worker loop to enrich packages data in the osspckgs database by fetching Maven Central metadata/POMs, and adds corresponding data-access-layer helpers for selecting candidates and upserting packages/maintainers.

Changes:

  • Added @crowd/data-access-layer osspckgs module with queries for Maven enrichment candidates and upserts into packages, maintainers, and package_maintainers.
  • Added a pom-fetcher worker (config + entrypoint + enrichment loop) that resolves latest Maven versions and extracts POM metadata (licenses, SCM, developers/contributors).
  • Wired up scripts/deps for running the new worker (package.json scripts, docker-compose service yaml, lockfile updates).

Reviewed changes

Copilot reviewed 11 out of 13 changed files in this pull request and generated 12 comments.

Show a summary per file
File Description
services/libs/data-access-layer/src/osspckgs/types.ts Adds DB-facing types for osspckgs package/maintainer upserts and universe rows.
services/libs/data-access-layer/src/osspckgs/packages.ts Adds query to list Maven universe packages needing enrichment + upsert into packages.
services/libs/data-access-layer/src/osspckgs/maintainers.ts Adds upserts for maintainers and package_maintainers.
services/libs/data-access-layer/src/osspckgs/index.ts Re-exports osspckgs DAL surface.
services/libs/data-access-layer/src/index.ts Exposes osspckgs DAL from the package root.
services/apps/packages_worker/src/pom-fetcher/runPomEnrichmentLoop.ts Implements batch/concurrent enrichment loop and persistence of extracted metadata.
services/apps/packages_worker/src/pom-fetcher/metadata.ts Resolves latest version via maven-metadata.xml.
services/apps/packages_worker/src/pom-fetcher/extract.ts Fetches POMs and extracts fields with limited parent inheritance traversal.
services/apps/packages_worker/src/config.ts Adds pom-fetcher config loader.
services/apps/packages_worker/src/bin/pom-fetcher.ts Adds runnable entrypoint with shutdown handling.
services/apps/packages_worker/package.json Adds scripts and deps (axios, fast-xml-parser) for pom-fetcher.
scripts/services/pom-fetcher.yaml Adds docker-compose service definition for pom-fetcher.
pnpm-lock.yaml Updates lockfile for new deps (but includes an unexpected workspace importer).
Files not reviewed (1)
  • pnpm-lock.yaml: Language not supported

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread services/libs/data-access-layer/src/osspckgs/types.ts
Comment thread services/libs/data-access-layer/src/osspckgs/types.ts
Comment thread services/libs/data-access-layer/src/osspckgs/packages.ts
Comment thread services/libs/data-access-layer/src/osspckgs/packages.ts Outdated
Comment thread services/libs/data-access-layer/src/osspckgs/packages.ts Outdated
Comment thread services/apps/packages_worker/src/pom-fetcher/runPomEnrichmentLoop.ts Outdated
Comment thread services/apps/packages_worker/src/pom-fetcher/runPomEnrichmentLoop.ts Outdated
Comment thread services/apps/packages_worker/src/maven/extract.ts
Comment thread services/apps/packages_worker/src/config.ts Outdated
Comment thread pnpm-lock.yaml Outdated
Base automatically changed from feat/track-packages to main May 26, 2026 17:44
@ulemons ulemons changed the title Feat/pom fetcher feat: pom fetcher May 27, 2026
@ulemons ulemons self-assigned this May 27, 2026
Copilot AI review requested due to automatic review settings June 2, 2026 13:49
@ulemons ulemons force-pushed the feat/pom-fetcher branch from b0812f9 to d907907 Compare June 2, 2026 13:49
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 21 out of 24 changed files in this pull request and generated 16 comments.

Files not reviewed (1)
  • pnpm-lock.yaml: Language not supported

Comment thread services/libs/data-access-layer/src/osspckgs/packages.ts
Comment thread services/libs/data-access-layer/src/osspckgs/packages.ts
Comment thread services/libs/data-access-layer/src/osspckgs/packages.ts Outdated
Comment thread services/libs/data-access-layer/src/osspckgs/packages.ts Outdated
Comment thread services/libs/data-access-layer/src/osspckgs/packages.ts Outdated
Comment thread services/apps/packages_worker/src/maven/README.md Outdated
Comment thread services/apps/packages_worker/src/maven/runMavenEnrichmentLoop.ts
Comment thread services/libs/data-access-layer/src/osspckgs/versions.ts Outdated
Comment thread services/apps/packages_worker/src/maven/runMavenEnrichmentLoop.ts Outdated
Comment thread services/apps/packages_worker/src/maven/extract.ts
Copilot AI review requested due to automatic review settings June 3, 2026 19:36
@ulemons ulemons force-pushed the feat/pom-fetcher branch from 1fef57d to 27c4836 Compare June 3, 2026 19:40
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 25 out of 28 changed files in this pull request and generated 9 comments.

Files not reviewed (1)
  • pnpm-lock.yaml: Language not supported

Comment thread services/apps/packages_worker/src/scripts/validateDataQuality.ts Outdated
Comment thread services/apps/packages_worker/src/maven/metadata.ts
Comment thread services/apps/packages_worker/src/maven/metadata.ts
Comment thread services/apps/packages_worker/src/maven/extract.ts
Comment thread services/apps/packages_worker/src/maven/schedule.ts
Comment thread services/apps/packages_worker/src/maven/README.md
Comment thread services/apps/packages_worker/src/maven/README.md
Comment thread backend/.env.dist.local Outdated
Comment thread services/apps/packages_worker/src/maven/README.md Outdated
Copilot AI review requested due to automatic review settings June 3, 2026 20:21
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 24 out of 27 changed files in this pull request and generated 15 comments.

Files not reviewed (1)
  • pnpm-lock.yaml: Language not supported

Comment thread services/libs/data-access-layer/src/osspckgs/versions.ts Outdated
Comment thread services/apps/packages_worker/src/maven/metadata.ts
Comment thread services/apps/packages_worker/src/maven/extract.ts
Comment thread services/apps/packages_worker/src/maven/schedule.ts
Comment thread services/apps/packages_worker/src/maven/README.md
Comment thread services/apps/packages_worker/src/scripts/validateDataQuality.ts Outdated
Comment thread services/apps/packages_worker/package.json
Comment thread services/apps/packages_worker/src/maven/extract.ts
Comment thread services/libs/data-access-layer/src/osspckgs/packages.ts
Comment thread services/libs/data-access-layer/src/osspckgs/packages.ts Outdated
Copilot AI review requested due to automatic review settings June 3, 2026 20:40
@ulemons ulemons marked this pull request as ready for review June 3, 2026 20:45
@github-actions
Copy link
Copy Markdown
Contributor

github-actions Bot commented Jun 3, 2026

⚠️ Jira Issue Key Missing

Your PR title doesn't contain a Jira issue key. Consider adding it for better traceability.

Example:

  • feat: add user authentication (CM-123)
  • feat: add user authentication (IN-123)

Projects:

  • CM: Community Data Platform
  • IN: Insights

Please add a Jira issue key to your PR title.

Comment thread services/apps/packages_worker/src/scripts/validateDataQuality.ts Outdated
Comment thread services/apps/packages_worker/package.json
Comment thread services/apps/packages_worker/src/maven/runMavenEnrichmentLoop.ts
Comment thread services/libs/data-access-layer/src/osspckgs/packages.ts
Comment thread backend/.env.dist.local Outdated
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 24 out of 27 changed files in this pull request and generated 12 comments.

Files not reviewed (1)
  • pnpm-lock.yaml: Language not supported

Comment thread services/apps/packages_worker/src/scripts/validateDataQuality.ts Outdated
Comment thread services/apps/packages_worker/src/maven/metadata.ts
Comment thread services/apps/packages_worker/src/maven/runMavenEnrichmentLoop.ts Outdated
Comment thread services/apps/packages_worker/src/maven/extract.ts
Comment thread services/apps/packages_worker/src/maven/extract.ts
Comment thread services/apps/packages_worker/src/maven/README.md
Comment thread services/apps/packages_worker/src/maven/README.md
Comment thread backend/.env.dist.local Outdated
Comment thread services/libs/data-access-layer/src/osspckgs/packages.ts
Comment thread services/libs/data-access-layer/src/osspckgs/repos.ts
mbani01 added 3 commits June 4, 2026 09:14
Signed-off-by: Mouad BANI <mouad-mb@outlook.com>
Signed-off-by: Mouad BANI <mouad-mb@outlook.com>
…ved to packages_worker)

Signed-off-by: Mouad BANI <mouad-mb@outlook.com>
ulemons added 13 commits June 4, 2026 09:14
Signed-off-by: Umberto Sgueglia <usgueglia@contractor.linuxfoundation.org>
Signed-off-by: Umberto Sgueglia <usgueglia@contractor.linuxfoundation.org>
Signed-off-by: Umberto Sgueglia <usgueglia@contractor.linuxfoundation.org>
Signed-off-by: Umberto Sgueglia <usgueglia@contractor.linuxfoundation.org>
Signed-off-by: Umberto Sgueglia <usgueglia@contractor.linuxfoundation.org>
Signed-off-by: Umberto Sgueglia <usgueglia@contractor.linuxfoundation.org>
Signed-off-by: Umberto Sgueglia <usgueglia@contractor.linuxfoundation.org>
Signed-off-by: Umberto Sgueglia <usgueglia@contractor.linuxfoundation.org>
Signed-off-by: Umberto Sgueglia <usgueglia@contractor.linuxfoundation.org>
Signed-off-by: Umberto Sgueglia <usgueglia@contractor.linuxfoundation.org>
Signed-off-by: Umberto Sgueglia <usgueglia@contractor.linuxfoundation.org>
Signed-off-by: Umberto Sgueglia <usgueglia@contractor.linuxfoundation.org>
Signed-off-by: Umberto Sgueglia <usgueglia@contractor.linuxfoundation.org>
Copilot AI review requested due to automatic review settings June 4, 2026 07:15
@ulemons ulemons force-pushed the feat/pom-fetcher branch from ec419bc to 52f5515 Compare June 4, 2026 07:15
Comment thread services/apps/packages_worker/src/maven/extract.ts
Comment thread services/apps/packages_worker/src/maven/schedule.ts
Comment thread services/apps/packages_worker/src/maven/deltaApi.ts Outdated
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 24 out of 27 changed files in this pull request and generated 10 comments.

Files not reviewed (1)
  • pnpm-lock.yaml: Language not supported

Comment thread services/apps/packages_worker/src/maven/runMavenEnrichmentLoop.ts Outdated
Comment thread services/apps/packages_worker/src/maven/extract.ts
Comment thread services/apps/packages_worker/src/maven/extract.ts
Comment thread services/apps/packages_worker/src/maven/metadata.ts
Comment thread services/apps/packages_worker/src/maven/README.md
Comment thread services/apps/packages_worker/src/maven/README.md
Comment thread backend/.env.dist.local Outdated
Comment thread services/libs/data-access-layer/src/osspckgs/versions.ts
Comment thread services/apps/packages_worker/package.json
Comment thread services/apps/packages_worker/src/scripts/validateDataQuality.ts Outdated
Signed-off-by: Umberto Sgueglia <usgueglia@contractor.linuxfoundation.org>
Comment thread services/apps/packages_worker/src/maven/schedule.ts
Signed-off-by: Umberto Sgueglia <usgueglia@contractor.linuxfoundation.org>
Copilot AI review requested due to automatic review settings June 4, 2026 09:03
Copy link
Copy Markdown

@cursor cursor Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cursor Bugbot has reviewed your changes and found 1 potential issue.

There are 2 total unresolved issues (including 1 from previous review).

Fix All in Cursor

❌ Bugbot Autofix is OFF. To automatically fix reported issues with cloud agents, have a team admin enable autofix in the Cursor dashboard.

Reviewed by Cursor Bugbot for commit 3c087e1. Configure here.

Comment thread scripts/builders/packages-worker.env
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 23 out of 27 changed files in this pull request and generated 12 comments.

Files not reviewed (1)
  • pnpm-lock.yaml: Language not supported

Comment thread services/apps/packages_worker/src/maven/extract.ts
Comment thread services/apps/packages_worker/src/maven/runMavenEnrichmentLoop.ts
Comment thread services/apps/packages_worker/src/maven/metadata.ts
Comment thread services/apps/packages_worker/src/maven/README.md
Comment thread services/apps/packages_worker/src/maven/README.md
Comment thread scripts/builders/packages-worker.env
Comment thread backend/.env.dist.local
Comment thread services/apps/packages_worker/src/maven/extract.ts
Comment thread services/apps/packages_worker/src/maven/README.md
Comment thread services/apps/packages_worker/src/maven/README.md Outdated
Signed-off-by: Umberto Sgueglia <usgueglia@contractor.linuxfoundation.org>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants