fix(os-index): replace os:: prefix with .os suffix as DB uniqueness artifact#35863
Merged
Conversation
…rtifact Closes #35820 The os:: prefix strategy was causing collisions and readability issues in the shared `indicies` table. Switches to a .os suffix approach: - IndexTag.OS now appends ".os" (suffix) instead of prepending "os::" (prefix) - isTagged() checks endsWith(".os"); untag() strips last 3 chars - IndexTag gains a `suffix` field alongside the existing `prefix` field - VersionedIndicesAPIImpl comments updated to reflect the new format - OPENSEARCH_MIGRATION.md tag ownership rule section rewritten - Task260528MigrateOsIndiciesSuffix: one-time DB migration converts existing os:: rows to the .os suffix format on startup Collision-free guarantee: logical names always end in _YYYYMMDDHHMMSS (numeric), so they can never naturally end in .os. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Closed
7 tasks
Task260528MigrateOsIndiciesSuffix is not needed — the os:: prefix was never rolled out to production, so there are no rows to migrate. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…atorUtil Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…ootstrap - ContentletIndexOperationsOS.toPhysicalName() now appends .os suffix so the physical OS index name matches the DB-stored name end-to-end. - removeContentFromIndexByContentType() resolves physical name before deleteByQuery. - MappingOperationsOS adds physicalName() helper (cluster prefix + .os suffix) and uses it in putMapping / getMapping / getFieldMappingAsMap so all REST calls hit the correct index name. - ContentletIndexAPIImpl.indexReadyOS() passes the physical OS name to indexExists() so it actually checks the OS cluster (was checking bare logical name, always false). - ContentletIndexAPIImpl.initIndex() detects the migration-catchup case (ES ready, OS missing) and delegates to initOSCatchup() instead of generating a fresh timestamp, eliminating name divergence between ES and OS shadows. - initOSCatchup() reads current ES working/live names, strips the cluster prefix, and creates OS shadows with the same base name + .os, ensuring one-to-one association. - getIndexDocumentCount(String, IndexTag) now uses the provider's toPhysicalName() so OS doc-count queries resolve to the correct physical index name. Closes #35820 Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…untagged
ES("es::", "") implied that ES index names carried an "es::" prefix, which was
never true — ES names are plain logical strings with no vendor marker. Changing
to ES("", "") makes the constant a pure routing discriminator: tag() is a no-op,
isTagged() always returns false, and resolve() still falls back to ES for any
untagged name. No callers applied or checked "es::" outside IndexTag itself.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…er recovery
When the OS cluster index is missing but the DB record still exists (e.g. container
restart after index deletion in Phase 3), initOSCatchup must recreate the index
using the name already registered in VersionedIndicesAPI — not from the legacy ES
store, which may be stale or cleaned up in Phase 3.
Priority order:
1. OS DB record exists → recreate with the registered logical name (covers Phase 3
recovery and partial-failure restarts; avoids name/timestamp divergence).
2. OS DB empty, ES DB has record → mirror ES name (Phases 1/2 catchup).
3. Both empty → fresh timestamp + warn (new install or data-loss scenario).
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…timestamp
working_20240101.os → lastIndexOf("_") returns "20240101.os" instead of "20240101",
breaking deleteInactiveLiveWorkingIndices timestamp comparisons. IndexTag.strip()
removes any vendor tag before the substring extraction, making it safe for both
plain ES names and .os-suffixed OS names.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…nite recursion replace_all replaced the call inside the helper itself, causing StackOverflowError. physicalName() must call osIndexAPI.getNameWithClusterIDPrefix(), not itself. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…x (first dot)
lastIndexOf(".") broke names with .os suffix: cluster_xxx.working_20260528.os → "os".
Using indexOf(".") strips only the cluster_xxx. segment, preserving the rest:
cluster_xxx.working_20260528.os → working_20260528.os ✓
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Pattern ^cluster_[^.]+\.(.+)$ anchors to the exact cluster prefix structure and captures everything after the first dot. Handles all name forms: cluster_xxx.working_20260528 → working_20260528 cluster_xxx.working_20260528235046.os → working_20260528235046.os unqualified name (no match) → returned unchanged Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Implements all previously-stubbed IndexAPI operations against the OpenSearch client so OSIndexAPIImpl is feature-complete: - flushCaches -> indices().clearCache - optimize -> indices().forcemerge - updateReplicas -> indices().putSettings (same Enterprise-license + numeric-replicas gate as ES; throws DotDataException when closed) - createAlias / getIndexAlias / getAliasToIndexMap -> alias APIs - deleteInactiveLiveWorkingIndices: mirrors ES but resolves the active set from VersionedIndicesAPI so the active OS set is never deleted Also fixes getLiveWorkingIndicesSortedByCreationDateDesc to sort by the embedded timestamp (IndexTag.strip) instead of raw string, and restricts it to OS-tagged indices so OS lifecycle ops never touch legacy ES indices in single-cluster test profiles. Writes propagate failures (the PhaseRouter applies shadow vs primary semantics per phase); reads log and return safe defaults, consistent with the rest of the class. Adds integration coverage to OSIndexAPIImplIntegrationTest (OpenSearchUpgradeSuite): 26 tests green against OpenSearch 3.4. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…exAPIImpl The IndexAPIImpl router fans out the same vendor-neutral logical name (the ES name, e.g. live_20260604211839) to both providers. OSIndexAPIImpl only applied the cluster prefix, not the .os tag, so a router-driven delete targeted cluster_xxx.live_... (no .os) — which does not exist — logging index_not_found_exception and orphaning the real cluster_xxx.live_....os index. Add a private idempotent toPhysicalName(name) = IndexTag.OS.tag( getNameWithClusterIDPrefix(name)) and use it in indexExists, createIndex, delete, closeIndex, openIndex, updateReplicas, flushCaches and optimize. Alias methods are left untouched (alias naming is a separate concern). Also make delete treat index_not_found as an idempotent no-op success as a secondary safety net (must not propagate once OS is primary in Phase 3). Regression test: test_lifecycleWithUntaggedRouterName_shouldResolveToTaggedIndex. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
The previous commit canonicalized index names to physical .os form across indexExists/createIndex/closeIndex/openIndex/updateReplicas/flushCaches/ optimize. That broke 23 OpenSearch Upgrade Suite tests: these IT suites create OS indices by bare name and read/count/search them by bare name (via getNameWithClusterIDPrefix), so making createIndex append .os made the write target (...X.os) diverge from the read target (...X). In production those methods already receive .os-tagged names (from VersionedIndices), so the tagging was a no-op there and bought nothing. Revert canonicalization on all of those back to getNameWithClusterIDPrefix. Keep it only on delete(), which is the actual reported bug: the router fans out the raw untagged ES name, and the OS shadow must resolve it to the real ...live_....os index or it orphans it. The index_not_found idempotent no-op safety net stays. Regression test narrowed to assert the delete-orphan fix only (test_deleteWithUntaggedRouterName_shouldRemoveTaggedIndex). Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Reverts commits 384ac05 and 496ce1d (both touched only OSIndexAPIImpl and its IT). Canonicalizing OSIndexAPIImpl.delete to the .os physical name broke the OS IT suites: they create/exist/delete OS indices by bare name, so a delete that resolves bare->.os made cleanup miss the bare index, leaking it and causing resource_already_exists cascades (54 failing tests). Restoring the e253d82 baseline brings CI back to its prior known state (the single original-bug failure). The delete-orphan fix will be reattempted at the OS operations layer instead of the shared OSIndexAPIImpl.delete. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…Impl.delete The reported bug: deleting an index routed the bare logical name straight to OSIndexAPIImpl.delete, which only adds the cluster prefix (not the .os tag), so it targeted a name that does not exist and orphaned the real ...X.os index (logging index_not_found, swallowed by the dual-write fire-and-forget). Fix delete the same way createContentIndex already works: iterate router.writeProviders() and resolve ops.toPhysicalName(indexName) per vendor (ES -> bare, OS -> .os) before calling ops.indexAPI().delete(physicalName), tracking the primary result with shadow fire-and-forget. This keeps the generic OSIndexAPIImpl.delete bare-symmetric (so the OS IT suites that create/delete by bare name still clean up correctly) and puts the .os resolution where the index is known to be a content index — exactly mirroring the create path. Regression test: ContentletIndexAPIImplMigrationIT#test_delete_phase1_removesFromBothClusters creates DUAL_WORKING in both clusters then deletes via the bare name and asserts the OS .os index is gone (not orphaned). Works in single-cluster mode. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
The preserve-active assertion depended on the ambient default VersionedIndices set, which the test neither creates nor controls. In CI that active pointer can carry a foreign cluster prefix (cluster_default.*) or point at a site-search index (live_search_*); such names neither round-trip through removeClusterIdFromName (so the production preserve logic cannot protect them) nor resolve through indexExists, so the assertion failed non-deterministically. Scope the preservation assertion to active names that genuinely resolve via indexExists BEFORE the cleanup. This keeps the test deterministic while still catching a real regression if a current-cluster active index is wrongly deleted. The inactive-deletion assertions are unchanged. The latent production concern (foreign-prefix / site-search active names not protected by removeActiveLiveAndWorkingFromList) is tracked separately, not bundled here. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
nollymar
reviewed
Jun 5, 2026
nollymar
reviewed
Jun 5, 2026
nollymar
reviewed
Jun 5, 2026
nollymar
reviewed
Jun 5, 2026
…faults + add tests Addresses review feedback on PR #35863: - IndexTag: add IndexTagTest covering untag/isTagged (+ tag/vendorOf/resolve/strip), the helpers nollymar flagged as untested. - removeClusterIdFromName: the ES and OS impls were byte-identical and differed only by the cluster prefix, which is a JVM-wide value. Consolidate the whole prefix concern into IndexAPI as default methods over a single getClusterPrefix(): getNameWithClusterIDPrefix, removeClusterIdFromName and hasClusterPrefix now derive from it and are inherited. ESIndexAPI/OSIndexAPIImpl drop their clusterPrefix field, test constructors and the three methods; the router drops its delegating overrides. Restoring the defaults also closes the M-4 rollback note (OSGi default inheritance). - Tests override only getClusterPrefix() to inject a deterministic/dotted prefix (FakeIndexAPI, ESIndexAPITest, OSIndexAPIImplIntegrationTest). New IndexAPIPrefixTest exercises the real defaults and pins the dotted-cluster-id and .os-suffix regressions. - Naming: rename ContentletIndexAPIImplMigrationIT and ContentletIndexAPIImplPhaseSwitchIT to *IntegrationTest to match the suite convention; update OpenSearchUpgradeSuite. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Contributor
|
Pull Request Unsafe to Rollback!!!
|
nollymar
approved these changes
Jun 9, 2026
5 tasks
Contributor
|
Pull Request Unsafe to Rollback!!!
|
Merging main brought in #36027's test_searchRaw_nestedTopHits_shouldBePreservedOnBuckets, which indexed its two live docs into getNameWithClusterIDPrefix(IDX_LIVE) (cluster prefix, no .os). On this branch the physical OS name carries the .os suffix and searchRaw(live=true) resolves the live index via VersionedIndices to the .os-tagged physicalLive registered in setUp(). The docs therefore landed in a different index than the one searched, so the content_types terms aggregation returned zero buckets and the sanity assertion failed. Index into physicalLive (= opsOS.toPhysicalName(IDX_LIVE)), matching the proven pattern of the working-index tests in this file, so write target and search target are the same index. Production code is unchanged; this was purely test data placement under the .os naming. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Closes #35820
Why this change is safe
This code is still behind the OpenSearch feature flag and has never been rolled out to production. The OS index layer is entirely in active development, which means there is room for this kind of intentional renaming experiment to get the naming strategy right before any production data is involved.
.ossuffix — why it matters for correctnessOpenSearch indices are registered under the same logical names as their ES counterparts (e.g. `working_20260528`). When APIs return a `Map<String, ...>` or a `Set` keyed by index name, ES and OS entries would silently collide — the OS entry would overwrite the ES one and data would be lost. Appending a `.os` suffix to every OS index name guarantees that both providers can coexist in the same map/set without any information loss.
Collision-free guarantee: logical index names always end in `_YYYYMMDDHHMMSS` (numeric timestamp) — they can never naturally end in `.os`.
Changes:
IndexTag.OSnow appends.os(suffix) instead of prependingos::(prefix)isTagged()checksendsWith(".os");untag()strips the last 3 charsIndexTaggains asuffixfield alongside the existingprefixfield — all existing callers unchangedVersionedIndicesAPIImplcomments updated to reflect the new storage formatOPENSEARCH_MIGRATION.md— "tag ownership rule" section rewritten with the suffix rationaleImproved bootstrap sync strategy
initOSCatchup()inContentletIndexAPIImplnow follows a three-case priority order when OS indices are absent from the cluster:VersionedIndicesAPI. This is the authoritative source in Phase 3.working_20240101producesworking_20240101.os, not a freshly-stampedworking_20260528.os.Tests —
OSIndexAPIImplIntegrationTestTwo stub tests that only asserted "returns non-null" were upgraded to real integration assertions:
test_getClusterHealth_shouldReturnHealthForCreatedIndex— creates a live index, then verifies the health map contains an entry for it with a non-null/non-empty status and positive shard count.test_getIndicesStats_shouldReturnStatsForCreatedIndices— creates live and working indices, then verifies both appear in the stats map with non-negative document count, non-negative raw size, and a non-empty human-readable size string.Three new phase-routing tests were also added to
ContentletIndexAPIImplMigrationITto verify thatgetIndicesStatsandgetClusterHealthcorrectly merge ES + OS results in Phase 1 and return ES-only results in Phase 0.Test plan
IndexTag.OS.tag("cluster_abc.working_20260528")returns"cluster_abc.working_20260528.os"IndexTag.strip("cluster_abc.working_20260528.os")returns"cluster_abc.working_20260528"IndexTag.resolve("cluster_abc.working_20260528.os") == IndexTag.OSIndexTag.resolve("cluster_abc.working_20260528") == IndexTag.ES(untagged default)name.osin DB and portlet shows correct indicesgetIndicesStats/getClusterHealthmaps in Phase 1 contain both ES and OS entries (no collision/overwrite)OSIndexAPIImplIntegrationTest—test_getClusterHealth_shouldReturnHealthForCreatedIndexandtest_getIndicesStats_shouldReturnStatsForCreatedIndicespass against a live OS cluster🤖 Generated with Claude Code