Skip to content

fix(os-index): replace os:: prefix with .os suffix as DB uniqueness artifact#35863

Merged
fabrizzio-dotCMS merged 45 commits into
mainfrom
fix/issue-35820-os-index-naming-suffix
Jun 9, 2026
Merged

fix(os-index): replace os:: prefix with .os suffix as DB uniqueness artifact#35863
fabrizzio-dotCMS merged 45 commits into
mainfrom
fix/issue-35820-os-index-naming-suffix

Conversation

@fabrizzio-dotCMS

@fabrizzio-dotCMS fabrizzio-dotCMS commented May 28, 2026

Copy link
Copy Markdown
Member

Summary

Closes #35820

Why this change is safe

This code is still behind the OpenSearch feature flag and has never been rolled out to production. The OS index layer is entirely in active development, which means there is room for this kind of intentional renaming experiment to get the naming strategy right before any production data is involved.

.os suffix — why it matters for correctness

OpenSearch indices are registered under the same logical names as their ES counterparts (e.g. `working_20260528`). When APIs return a `Map<String, ...>` or a `Set` keyed by index name, ES and OS entries would silently collide — the OS entry would overwrite the ES one and data would be lost. Appending a `.os` suffix to every OS index name guarantees that both providers can coexist in the same map/set without any information loss.

Collision-free guarantee: logical index names always end in `_YYYYMMDDHHMMSS` (numeric timestamp) — they can never naturally end in `.os`.

Changes:

  • IndexTag.OS now appends .os (suffix) instead of prepending os:: (prefix)
  • isTagged() checks endsWith(".os"); untag() strips the last 3 chars
  • IndexTag gains a suffix field alongside the existing prefix field — all existing callers unchanged
  • VersionedIndicesAPIImpl comments updated to reflect the new storage format
  • OPENSEARCH_MIGRATION.md — "tag ownership rule" section rewritten with the suffix rationale

No DB migration task: the OS index layer has never shipped to production, so there are no os::name rows to convert. Any dev/QA instance that ran an earlier build of this feature must drop its stale indicies OS rows (or let bootstrap recreate the OS indices) — there is no automatic conversion.

Improved bootstrap sync strategy

initOSCatchup() in ContentletIndexAPIImpl now follows a three-case priority order when OS indices are absent from the cluster:

  1. OS DB record exists, cluster index missing — recreate the OS index using the name already registered in VersionedIndicesAPI. This is the authoritative source in Phase 3.
  2. OS DB empty, ES DB has a record (migration catch-up, Phases 1/2) — mirror the ES logical name so both indices share the same base name. This prevents timestamp divergence — e.g. the ES index working_20240101 produces working_20240101.os, not a freshly-stamped working_20260528.os.
  3. Both empty (fresh install or data-loss) — generate a new timestamp and emit a warning for the operator.

Tests — OSIndexAPIImplIntegrationTest

Two stub tests that only asserted "returns non-null" were upgraded to real integration assertions:

  • test_getClusterHealth_shouldReturnHealthForCreatedIndex — creates a live index, then verifies the health map contains an entry for it with a non-null/non-empty status and positive shard count.
  • test_getIndicesStats_shouldReturnStatsForCreatedIndices — creates live and working indices, then verifies both appear in the stats map with non-negative document count, non-negative raw size, and a non-empty human-readable size string.

Three new phase-routing tests were also added to ContentletIndexAPIImplMigrationIT to verify that getIndicesStats and getClusterHealth correctly merge ES + OS results in Phase 1 and return ES-only results in Phase 0.

Test plan

  • Verify IndexTag.OS.tag("cluster_abc.working_20260528") returns "cluster_abc.working_20260528.os"
  • Verify IndexTag.strip("cluster_abc.working_20260528.os") returns "cluster_abc.working_20260528"
  • Verify IndexTag.resolve("cluster_abc.working_20260528.os") == IndexTag.OS
  • Verify IndexTag.resolve("cluster_abc.working_20260528") == IndexTag.ES (untagged default)
  • On a fresh instance: confirm Phase 1 bootstrap stores name.os in DB and portlet shows correct indices
  • Confirm getIndicesStats / getClusterHealth maps in Phase 1 contain both ES and OS entries (no collision/overwrite)
  • No regression on ES portlet behavior
  • OSIndexAPIImplIntegrationTesttest_getClusterHealth_shouldReturnHealthForCreatedIndex and test_getIndicesStats_shouldReturnStatsForCreatedIndices pass against a live OS cluster

🤖 Generated with Claude Code

…rtifact

Closes #35820

The os:: prefix strategy was causing collisions and readability issues
in the shared `indicies` table. Switches to a .os suffix approach:

- IndexTag.OS now appends ".os" (suffix) instead of prepending "os::" (prefix)
- isTagged() checks endsWith(".os"); untag() strips last 3 chars
- IndexTag gains a `suffix` field alongside the existing `prefix` field
- VersionedIndicesAPIImpl comments updated to reflect the new format
- OPENSEARCH_MIGRATION.md tag ownership rule section rewritten
- Task260528MigrateOsIndiciesSuffix: one-time DB migration converts
  existing os:: rows to the .os suffix format on startup

Collision-free guarantee: logical names always end in _YYYYMMDDHHMMSS
(numeric), so they can never naturally end in .os.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Task260528MigrateOsIndiciesSuffix is not needed — the os:: prefix
was never rolled out to production, so there are no rows to migrate.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…atorUtil

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…ootstrap

- ContentletIndexOperationsOS.toPhysicalName() now appends .os suffix so the
  physical OS index name matches the DB-stored name end-to-end.
- removeContentFromIndexByContentType() resolves physical name before deleteByQuery.
- MappingOperationsOS adds physicalName() helper (cluster prefix + .os suffix) and
  uses it in putMapping / getMapping / getFieldMappingAsMap so all REST calls hit
  the correct index name.
- ContentletIndexAPIImpl.indexReadyOS() passes the physical OS name to indexExists()
  so it actually checks the OS cluster (was checking bare logical name, always false).
- ContentletIndexAPIImpl.initIndex() detects the migration-catchup case (ES ready,
  OS missing) and delegates to initOSCatchup() instead of generating a fresh timestamp,
  eliminating name divergence between ES and OS shadows.
- initOSCatchup() reads current ES working/live names, strips the cluster prefix, and
  creates OS shadows with the same base name + .os, ensuring one-to-one association.
- getIndexDocumentCount(String, IndexTag) now uses the provider's toPhysicalName() so
  OS doc-count queries resolve to the correct physical index name.

Closes #35820

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…untagged

ES("es::", "") implied that ES index names carried an "es::" prefix, which was
never true — ES names are plain logical strings with no vendor marker. Changing
to ES("", "") makes the constant a pure routing discriminator: tag() is a no-op,
isTagged() always returns false, and resolve() still falls back to ES for any
untagged name. No callers applied or checked "es::" outside IndexTag itself.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…er recovery

When the OS cluster index is missing but the DB record still exists (e.g. container
restart after index deletion in Phase 3), initOSCatchup must recreate the index
using the name already registered in VersionedIndicesAPI — not from the legacy ES
store, which may be stale or cleaned up in Phase 3.

Priority order:
  1. OS DB record exists → recreate with the registered logical name (covers Phase 3
     recovery and partial-failure restarts; avoids name/timestamp divergence).
  2. OS DB empty, ES DB has record → mirror ES name (Phases 1/2 catchup).
  3. Both empty → fresh timestamp + warn (new install or data-loss scenario).

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…timestamp

working_20240101.os → lastIndexOf("_") returns "20240101.os" instead of "20240101",
breaking deleteInactiveLiveWorkingIndices timestamp comparisons. IndexTag.strip()
removes any vendor tag before the substring extraction, making it safe for both
plain ES names and .os-suffixed OS names.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…nite recursion

replace_all replaced the call inside the helper itself, causing StackOverflowError.
physicalName() must call osIndexAPI.getNameWithClusterIDPrefix(), not itself.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…x (first dot)

lastIndexOf(".") broke names with .os suffix: cluster_xxx.working_20260528.os → "os".
Using indexOf(".") strips only the cluster_xxx. segment, preserving the rest:
cluster_xxx.working_20260528.os → working_20260528.os ✓

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Pattern ^cluster_[^.]+\.(.+)$ anchors to the exact cluster prefix structure
and captures everything after the first dot. Handles all name forms:
  cluster_xxx.working_20260528           → working_20260528
  cluster_xxx.working_20260528235046.os  → working_20260528235046.os
  unqualified name (no match)            → returned unchanged

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Implements all previously-stubbed IndexAPI operations against the
OpenSearch client so OSIndexAPIImpl is feature-complete:

- flushCaches    -> indices().clearCache
- optimize       -> indices().forcemerge
- updateReplicas -> indices().putSettings (same Enterprise-license +
  numeric-replicas gate as ES; throws DotDataException when closed)
- createAlias / getIndexAlias / getAliasToIndexMap -> alias APIs
- deleteInactiveLiveWorkingIndices: mirrors ES but resolves the active
  set from VersionedIndicesAPI so the active OS set is never deleted

Also fixes getLiveWorkingIndicesSortedByCreationDateDesc to sort by the
embedded timestamp (IndexTag.strip) instead of raw string, and restricts
it to OS-tagged indices so OS lifecycle ops never touch legacy ES indices
in single-cluster test profiles.

Writes propagate failures (the PhaseRouter applies shadow vs primary
semantics per phase); reads log and return safe defaults, consistent with
the rest of the class.

Adds integration coverage to OSIndexAPIImplIntegrationTest
(OpenSearchUpgradeSuite): 26 tests green against OpenSearch 3.4.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…exAPIImpl

The IndexAPIImpl router fans out the same vendor-neutral logical name (the
ES name, e.g. live_20260604211839) to both providers. OSIndexAPIImpl only
applied the cluster prefix, not the .os tag, so a router-driven delete
targeted cluster_xxx.live_... (no .os) — which does not exist — logging
index_not_found_exception and orphaning the real cluster_xxx.live_....os
index.

Add a private idempotent toPhysicalName(name) = IndexTag.OS.tag(
getNameWithClusterIDPrefix(name)) and use it in indexExists, createIndex,
delete, closeIndex, openIndex, updateReplicas, flushCaches and optimize.
Alias methods are left untouched (alias naming is a separate concern).
Also make delete treat index_not_found as an idempotent no-op success as
a secondary safety net (must not propagate once OS is primary in Phase 3).

Regression test: test_lifecycleWithUntaggedRouterName_shouldResolveToTaggedIndex.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
The previous commit canonicalized index names to physical .os form across
indexExists/createIndex/closeIndex/openIndex/updateReplicas/flushCaches/
optimize. That broke 23 OpenSearch Upgrade Suite tests: these IT suites
create OS indices by bare name and read/count/search them by bare name
(via getNameWithClusterIDPrefix), so making createIndex append .os made the
write target (...X.os) diverge from the read target (...X). In production
those methods already receive .os-tagged names (from VersionedIndices), so
the tagging was a no-op there and bought nothing.

Revert canonicalization on all of those back to getNameWithClusterIDPrefix.
Keep it only on delete(), which is the actual reported bug: the router
fans out the raw untagged ES name, and the OS shadow must resolve it to the
real ...live_....os index or it orphans it. The index_not_found idempotent
no-op safety net stays. Regression test narrowed to assert the delete-orphan
fix only (test_deleteWithUntaggedRouterName_shouldRemoveTaggedIndex).

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
fabrizzio-dotCMS and others added 5 commits June 4, 2026 18:23
Reverts commits 384ac05 and 496ce1d (both touched only OSIndexAPIImpl
and its IT). Canonicalizing OSIndexAPIImpl.delete to the .os physical name
broke the OS IT suites: they create/exist/delete OS indices by bare name, so
a delete that resolves bare->.os made cleanup miss the bare index, leaking it
and causing resource_already_exists cascades (54 failing tests). Restoring the
e253d82 baseline brings CI back to its prior known state (the single
original-bug failure). The delete-orphan fix will be reattempted at the OS
operations layer instead of the shared OSIndexAPIImpl.delete.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…Impl.delete

The reported bug: deleting an index routed the bare logical name straight to
OSIndexAPIImpl.delete, which only adds the cluster prefix (not the .os tag), so
it targeted a name that does not exist and orphaned the real ...X.os index
(logging index_not_found, swallowed by the dual-write fire-and-forget).

Fix delete the same way createContentIndex already works: iterate
router.writeProviders() and resolve ops.toPhysicalName(indexName) per vendor
(ES -> bare, OS -> .os) before calling ops.indexAPI().delete(physicalName),
tracking the primary result with shadow fire-and-forget. This keeps the generic
OSIndexAPIImpl.delete bare-symmetric (so the OS IT suites that create/delete by
bare name still clean up correctly) and puts the .os resolution where the index
is known to be a content index — exactly mirroring the create path.

Regression test: ContentletIndexAPIImplMigrationIT#test_delete_phase1_removesFromBothClusters
creates DUAL_WORKING in both clusters then deletes via the bare name and asserts
the OS .os index is gone (not orphaned). Works in single-cluster mode.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
The preserve-active assertion depended on the ambient default VersionedIndices
set, which the test neither creates nor controls. In CI that active pointer can
carry a foreign cluster prefix (cluster_default.*) or point at a site-search
index (live_search_*); such names neither round-trip through
removeClusterIdFromName (so the production preserve logic cannot protect them)
nor resolve through indexExists, so the assertion failed non-deterministically.

Scope the preservation assertion to active names that genuinely resolve via
indexExists BEFORE the cleanup. This keeps the test deterministic while still
catching a real regression if a current-cluster active index is wrongly deleted.
The inactive-deletion assertions are unchanged. The latent production concern
(foreign-prefix / site-search active names not protected by
removeActiveLiveAndWorkingFromList) is tracked separately, not bundled here.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Comment thread dotcms-integration/src/test/java/com/dotcms/OpenSearchUpgradeSuite.java Outdated
Comment thread dotCMS/src/main/java/com/dotcms/content/index/IndexTag.java
Comment thread dotCMS/src/main/java/com/dotcms/content/index/IndexTag.java
fabrizzio-dotCMS and others added 5 commits June 8, 2026 17:19
…faults + add tests

Addresses review feedback on PR #35863:

- IndexTag: add IndexTagTest covering untag/isTagged (+ tag/vendorOf/resolve/strip),
  the helpers nollymar flagged as untested.
- removeClusterIdFromName: the ES and OS impls were byte-identical and differed only
  by the cluster prefix, which is a JVM-wide value. Consolidate the whole prefix
  concern into IndexAPI as default methods over a single getClusterPrefix():
  getNameWithClusterIDPrefix, removeClusterIdFromName and hasClusterPrefix now derive
  from it and are inherited. ESIndexAPI/OSIndexAPIImpl drop their clusterPrefix field,
  test constructors and the three methods; the router drops its delegating overrides.
  Restoring the defaults also closes the M-4 rollback note (OSGi default inheritance).
- Tests override only getClusterPrefix() to inject a deterministic/dotted prefix
  (FakeIndexAPI, ESIndexAPITest, OSIndexAPIImplIntegrationTest). New IndexAPIPrefixTest
  exercises the real defaults and pins the dotted-cluster-id and .os-suffix regressions.
- Naming: rename ContentletIndexAPIImplMigrationIT and ContentletIndexAPIImplPhaseSwitchIT
  to *IntegrationTest to match the suite convention; update OpenSearchUpgradeSuite.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
@claude

claude Bot commented Jun 8, 2026

Copy link
Copy Markdown
Contributor

Pull Request Unsafe to Rollback!!!

  • Category: H-1 — One-Way Data Transformation (Storage Format Change)
  • Risk Level: 🟠 HIGH
  • Why it's unsafe: The entire diff (from the previous analysis baseline through commit 8a8ada82) establishes .os as the end-to-end canonical OS index name — not just a DB artifact. Previously, VersionedIndicesAPIImpl.loadIndices() stripped the tag before returning names to callers (so clean names cluster_x.working_ts were exposed). After 8a8ada82, the private stripTags() helper is deleted and loadIndices() / loadAllIndices() / loadNonVersionedIndices() return the raw .os-suffixed form (cluster_x.working_ts.os) to every caller. If rolled back to N-1 after OS was active, N-1's loadIndices() would call the now-deleted stripTags() which expected to strip os:: (the old prefix form) — it no longer exists. N-1 callers receive cluster_x.working_ts.os as a logical name, which N-1's routing treats as an untagged ES name (IndexTag.resolve("name.os") == ES on N-1 since startsWith("os::") is false) and routes OS operations to the ES provider — data reads silently return wrong results.
  • Code that makes it unsafe:
    • dotCMS/src/main/java/com/dotcms/content/index/VersionedIndicesAPIImpl.javaprivate static VersionedIndices stripTags(VersionedIndices) deleted; loadIndices(), loadAllIndices(), loadNonVersionedIndices() no longer transform the DB-loaded .os-suffixed names — callers now receive the tagged form end-to-end
    • dotCMS/src/main/java/com/dotcms/content/index/IndexTag.javaOS("", ".os"): isTagged uses endsWith(".os"); N-1 isTagged uses startsWith("os::")false for .os-suffixed names — routing silently defaults to ES
  • Mitigating factors:
    • The OS index layer is behind FEATURE_FLAG_OPEN_SEARCH_PHASE; N-1 with the flag disabled will never read OS rows
    • These names were never written in production — risk is limited to dev/staging environments where the flag has been enabled
  • Alternative: Before rolling back in any environment where the OS feature was active, disable FEATURE_FLAG_OPEN_SEARCH_PHASE first so N-1 never reads the OS rows; or manually revert indicies rows from name.os back to os::name format.

  • Category: M-4 — OSGi Public Interface Change (RESOLVED by commit 8a8ada82)
  • Risk Level: ✅ RESOLVED
  • Previously: IndexAPI.removeClusterIdFromName() was abstract in an earlier commit — any OSGi plugin relying on the former default impl would have failed with AbstractMethodError.
  • Now: Commit 8a8ada82 converts removeClusterIdFromName, getNameWithClusterIDPrefix, getClusterPrefix, and hasClusterPrefix to default methods on IndexAPI — backward-compatible. Existing plugin implementations continue to work unchanged. The M-4 risk from the prior analysis no longer applies.

@claude

claude Bot commented Jun 9, 2026

Copy link
Copy Markdown
Contributor

Pull Request Unsafe to Rollback!!!

  • Category: H-1 — One-Way Data Transformation (Storage Format Change)
  • Risk Level: 🟠 HIGH
  • Why it's unsafe: The commit range analyzed (c036699221088f) removes VersionedIndicesAPIImpl.stripTags() and changes loadIndices(), loadAllIndices(), and loadNonVersionedIndices() to return the raw .os-suffixed canonical form to every caller (previously those methods stripped the tag before returning). On N-1, IndexTag.OS.isTagged() checks startsWith("os::") — it returns false for the new .os-suffixed names. As a result, any caller that received cluster_x.working_ts.os from the DB after rollback to N-1 would have IndexTag.resolve() classify it as ES (the unrecognized-tag default), silently routing OS reads/writes to the ES provider and returning wrong results.
  • Code that makes it unsafe:
    • dotCMS/src/main/java/com/dotcms/content/index/VersionedIndicesAPIImpl.javaprivate static VersionedIndices stripTags(VersionedIndices) deleted; loadIndices(), loadAllIndices(), loadNonVersionedIndices() now return DB names unchanged (cluster_xxx.working_ts.os) instead of stripping the tag to cluster_xxx.working_ts as N-1 expects
    • dotCMS/src/main/java/com/dotcms/content/index/IndexTag.javaOS("os::", "") changed to OS("", ".os"): N-1 isTagged checks startsWith("os::")false for .os-suffixed names → routing silently defaults to ES
  • Mitigating factors:
    • The OS index layer is behind FEATURE_FLAG_OPEN_SEARCH_PHASE; N-1 with the flag disabled never reads OS rows — risk is limited to dev/staging environments where the flag has been enabled
    • These names were never written in production (feature not yet shipped)
  • Alternative: Before rolling back in any environment where the OS feature was active, disable FEATURE_FLAG_OPEN_SEARCH_PHASE first so N-1 never reads the OS rows; or manually revert indicies rows from name.os back to os::name format before downgrading.

  • Category: M-4 — OSGi Public Interface Change
  • Risk Level: ✅ RESOLVED
  • Previously: IndexAPI.removeClusterIdFromName() was abstract in an earlier commit — any OSGi plugin relying on the former default impl would have failed with AbstractMethodError.
  • Now (commit 8a8ada82): removeClusterIdFromName, getNameWithClusterIDPrefix, getClusterPrefix, and hasClusterPrefix are all default methods on IndexAPI; ESIndexAPI and OSIndexAPIImpl dropped their duplicate impls and inherit the defaults. Existing OSGi plugin implementations continue to work unchanged. The M-4 risk no longer applies.

Merging main brought in #36027's test_searchRaw_nestedTopHits_shouldBePreservedOnBuckets,
which indexed its two live docs into getNameWithClusterIDPrefix(IDX_LIVE) (cluster prefix,
no .os). On this branch the physical OS name carries the .os suffix and searchRaw(live=true)
resolves the live index via VersionedIndices to the .os-tagged physicalLive registered in
setUp(). The docs therefore landed in a different index than the one searched, so the
content_types terms aggregation returned zero buckets and the sanity assertion failed.

Index into physicalLive (= opsOS.toPhysicalName(IDX_LIVE)), matching the proven pattern of
the working-index tests in this file, so write target and search target are the same index.
Production code is unchanged; this was purely test data placement under the .os naming.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

AI: Not Safe To Rollback Area : Backend PR changes Java/Maven backend code Area : Documentation PR changes documentation files

Projects

Status: No status

Development

Successfully merging this pull request may close these issues.

[DEFECT] Confusing empty OS index appears in indices portlet on Phase 1 migration bootstrap

2 participants