SOLR-17767: Add a Java Agent (SIP-24)#4471
Conversation
…tion
- solr/server/build.gradle: add agent fat JAR to libExt (transitive=false)
so it lands in server/lib/ext/ of the packaged distribution used by BATS
- CoreContainer.java: fix Class.forName to use CoreContainer's classloader
instead of null (bootstrap); the agent JAR is in lib/ext/ loaded by
Jetty's server classloader, not the bootstrap classloader
- ViolationMetricsReporter: replace Codahale registerGauge (non-existent
in this OTel-based branch) with observableLongCounter reflective call;
remove buildGauge/Proxy; metrics appear in Prometheus format as
security_agent_violations_{file,network,exit,exec}_total
- test_security_agent.bats: add --user-managed to avoid embedded ZooKeeper
crash; update metrics assertion to Prometheus format name; fix metrics
endpoint URL (no group/prefix params); all 3 tests now pass
Fix sysprop names Add BATS tests to verify actual invocation
…it/network/exec; remove redundant VirtualThreadCompatibilityTest - Add testPrograms sourceSet in agent-sm compiling FileViolation, ExitViolation, NetworkViolation, ExecViolation into a standalone JAR (no Solr/ByteBuddy deps). - Wire AGENT_TEST_PROGRAMS_JAR into integrationTests task; BATS test #4 drops the inline javac+heredoc in favour of the pre-compiled JAR. - Add BATS tests #5 (System.exit), #6 (SocketChannel.connect to RFC 5737 TEST-NET-1), #7 (ProcessBuilder exec) — all verify SecurityException in enforce mode. - Delete VirtualThreadCompatibilityTest: StackWalker virtual-thread compatibility is guaranteed by JDK spec and already exercised by other unit tests. - Fix rat-sources.gradle: exclude src/test-programs/** for :solr:agent-sm.
…atus != 123 in BATS System.exit(0) is unreachable after a successful exit, making refute_output vacuous and masking a broken agent (exit code 0 would trigger assert_failure for the wrong reason). Using exit code 123 as a sentinel lets the BATS test explicitly verify that the interceptor fired (SecurityException in output, status != 123).
# Conflicts: # solr/core/gradle.lockfile # solr/solrj-zookeeper/gradle.lockfile # solr/test-framework/gradle.lockfile
…n-walk exec, codeBase matching
Part A — stale class names:
- dev-docs/security-agent.adoc: SolrSecurityPolicy → AgentPolicy, StackInspector → StackCallerClassChainExtractor
- build.gradle comment: SolrSecurityPolicy, FileAccessInterceptor → AgentPolicy, FileInterceptor
B1 — FileInterceptor "read" was a no-op; add explicit policy check for Files.read() path
B2 — Remove tautological `args instanceof Object` guard (always true)
B4 — SocketChannelInterceptor: implement codeBase matching via StackWalker instead of silently
skipping codeBase-scoped entries; add isCallerFromCodeBase() helper + tests
B5 — ProcessExecInterceptor: walk full chain (like SystemExitInterceptor) instead of checking
only direct caller; add AgentPolicy.isChainThatCanExec() + tests
B6 — Remove 0.0.0.0 from trustedHosts (unspecified bind address, not loopback)
B7 — PermittedPath: use java.io.File.separator instead of hardcoded "/" and "\\"
B8 — SolrAgentEntryPoint: .or(isAbstract()) → .and(not(isAbstract())) in file method matcher
B9 — FileInterceptor: remove redundant second getCallerClass() call inside delete branch
B10 — AgentPolicy.resetForTesting(): add comment explaining deliberate unsynchronized write
B11 — Rename bootOtel → bootAgent in SolrAgentEntryPoint
B12 — Fix typo: testApprovedCallerAnywherInChain → testApprovedCallerAnywhereInChain
B13 — FileInterceptor: move resolveRealPath() Javadoc above the correct method; note why
toRealPath() must not be called from the advice (circular interception)
B14 — ViolationMetricsReporter.findMethod(): match by param type simple names, not just count
B15 — SecurityViolationLogger: clarify stack-trace comment; deduplicate debug block
…onventions
All Solr metrics should start with "solr.*". Rename:
security.agent.violations.{file,network,exit,exec}
→ solr.security.agent.violations.{file,network,exit,exec}
In Prometheus format: solr_security_agent_violations_*_total.
Update BATS test assertion, ref-guide, and Javadoc accordingly.
…Programs; use NIO FileSystems.getSeparator() in PermittedPath - testProgramsAnnotationProcessor config was pulling in error-prone deps not in the dependency lock file; disable annotation processing for trivial test programs. - PermittedPath used java.io.File (forbidden API); switch to FileSystems.getDefault().getSeparator() per Solr's NIO-only policy.
…Interceptor Add `import java.net.URL;`, replace `java.net.URL loc` with `URL loc`, and replace `java.net.InetSocketAddress` parameter type with `InetSocketAddress` since the type is already imported. Fixes CI error-prone -Werror failures.
…yPoint Add imports for SocketChannel, Socket and use simple names for System, Runtime, ProcessBuilder — all were already in java.lang or now imported. Fixes CI error-prone -Werror failures.
… agent-sm - build.gradle: disable error-prone for compileTestProgramsJava (no errorprone JAR on testProgramsAnnotationProcessor config) - SymlinkEscapeTest, SocketChannelInterceptorTest, SolrAgentIntegrationTest: add missing imports and remove FQ type references All 59 tests pass with -Pvalidation.errorprone=true.
The Prometheus output includes OTel scope labels alongside the type label
(e.g. otel_scope_name="org.apache.solr",type="file"), so asserting
on 'solr_security_agent_violations_total{type="file"' fails. Assert
on the base metric name only since the presence of the counter is what
matters.
Throwing SecurityException unconditionally for an unrecognised args[1] type bypassed the violation counter and structured log, and would break legitimate file operations in warn mode. Unknown types are now treated conservatively as mutating (write check applied).
The Javadoc claimed String#matches (full regex) with patterns like 'org\.apache\.solr\..*'. The actual implementation in ApprovedCallSite.matches() uses equals/startsWith semantics: '*' is wildcard, 'pkg.*' matches the package and sub-packages, anything else is an exact class-name match.
On Windows, CodeSource.getLocation().getPath() may return backslash separators while the codeBase string from the policy file uses forward slashes. Normalize both strings to forward slashes before the startsWith/equals comparison so the recursive codeBase check works correctly on all platforms.
The interceptor was using toAbsolutePath().normalize() which does not follow symbolic links, allowing a symlink inside a permitted directory that points outside it to bypass the policy check. Replace with resolveRealPath() which calls toRealPath() to resolve the real on-disk path before the policy check. A ThreadLocal guard (IN_SYMLINK_RESOLVE) prevents infinite recursion in case toRealPath() itself triggers an intercepted file operation. On IOException (path does not yet exist), falls back to toAbsolutePath().normalize(). The same fix is applied to the copy/move target-path resolution.
ByteBuddy binds @origin Method on every intercepted connect() call even though the parameter was never read. Removing it eliminates a reflective Method lookup on each outbound connection attempt.
When a policy grant block included a codeBase attribute, e.g.:
grant codeBase "file:/opt/solr/modules/foo/-" {
permission java.lang.RuntimePermission "exitVM";
};
PolicyLoader was storing the codeBase URL string as the
classNamePattern in ApprovedCallSite. ApprovedCallSite.matches() then
compared it against fully-qualified class names, which can never match
a "file:/.../..." URL string, making all codeBase-scoped exit/exec
grants silently no-ops.
Fix: ApprovedCallSite now carries a separate codeBase field. When
codeBase is set, matchesCodeBase(Class<?>) is used (delegating to the
same SocketChannelInterceptor.isCallerFromCodeBase logic used for
network grants). AgentPolicy.isChainThatCanExit/Exec dispatch to the
right matcher depending on which field is set. The simple
isExitApproved/isExecApproved(String) helpers skip codeBase entries
since they have no Class<?> to inspect.
Also correct the ApprovedCallSite Javadoc (previously claimed SolrCLI
and SolrDispatchFilter are default-approved exit callers, which was
never true), and add an explanatory comment in agent-security.policy
documenting why exitVM grants are intentionally absent from the
default policy.
actions.contains(action) does substring matching, so a malformed policy string like "readwrite" (missing comma) would incorrectly match both "read" and "write". Split the stored actions string into a Set at construction time and use Set.contains() for exact token matching.
Policy files use "[::1]:8983" bracket notation for IPv6 but
InetSocketAddress.getHostString() returns "::1" without brackets.
lastIndexOf(':') found the colon after ']' correctly, but the extracted
entryHost retained the brackets and never equaled the actual host string.
Parse bracket-notation IPv6 entries separately: strip the '[' and ']'
before passing to matchesHost so "::1" matches "[::1]:8983".
Add two tests covering the bracket notation with an exact port and a
port range.
|
Docker and BATS tests currently fail. I believe it is due to resolving of symlinks on linux and translating paths with |
Two bugs that cause Solr to fail to start when the Old Java SecurityManager is active (SOLR_SECURITY_MANAGER_ENABLED=true, the default for Java < 24): 1. resolveRealPath() only caught IOException, not SecurityException. On JDK 21 and earlier, Path.toRealPath() calls sm.checkRead() on the original path internally; if that check fails (e.g. when the Old SM policy does not cover the accessed path), a SecurityException is thrown. Because the advice signature declares 'throws Exception' but the catch clause was IOException-only, the SecurityException propagated through the ByteBuddy advice and aborted the intercepted file operation, preventing Solr from starting. Fix: catch IOException | SecurityException and fall back to toAbsolutePath().normalize() in both cases. 2. Policy paths from agent-security.policy were stored verbatim after variable substitution without resolving symlinks. At runtime, resolveRealPath() returns the real (symlink-resolved) form, which would not match a symlinked policy path. Fix: normalize each FilePermission base path via toRealPath() (same fallback logic as the runtime check) when building the policy, so both sides of the comparison use the canonical form. Also updates the stale comment in agent-security.policy that said "FileInterceptor uses toAbsolutePath(), not toRealPath()".
Gradle's incremental-build state tracking reads the @OutputDirectory before running the task. Docker containers run as uid 8983 (the solr user); any files they create (e.g. logs/) are unreadable by the Gradle runner process, causing an AccessDeniedException before the test even starts. Adding doNotTrackState() bypasses the output-directory inspection so the task always runs regardless of prior leftover artifacts.
ByteBuddy inlines @Advice.OnMethodEnter bytecode into JDK methods (e.g. java.nio.file.Files). Any static helper called from inlined advice must be public; package-private access causes IllegalAccessError at runtime because the inlined call-site executes in the context of the JDK class, not FileInterceptor. This was causing Solr startup failures on Linux with JDK 21 when the Old SecurityManager policy loaded lazily via Files.newByteChannel.
dsmiley
left a comment
There was a problem hiding this comment.
Overall I'm very impressed with the thoroughness here. This is very complete and well thought-out.
It's not clear we'll want to bother running our normal Java tests with the securityh agent -- I don't think it's worth the hassle and maintenance of a parallel test policy.
Disclaimer: I didn't look at the bats tests. I skimmed the agent code, mostly.
There was a problem hiding this comment.
IMO org.apache.solr.servlet.CoreContainerProvider#init is a more suitable place. CoreContainer can be used for embedding and lighter-weight tests that shouldn't have to deal with initialization of this nature.
| |solr.security.agent.extra.policy||`${server.dir}/etc/agent-security-extra.policy`|Path to the operator extension policy file for the Solr security agent. Overrides the default location. An absent file is silently skipped. | ||
|
|
||
| |solr.security.agent.mode||`warn`|Enforcement mode for the Solr security agent: `warn` (log violations, continue) or `enforce` (log violations, block the operation with `SecurityException`). | ||
|
|
||
| |solr.security.agent.skip||`false`|If set to `true`, omits the `-javaagent:` flag from the JVM command line, disabling all Solr security agent controls. Intended for temporary troubleshooting only. |
There was a problem hiding this comment.
I assume these can't be set via env vars because these are pre-EnvUtils?
| * optionally all descendants) for the set of operations encoded in the {@code actions} string | ||
| * ({@code "read"}, {@code "write"}, {@code "delete"}, or any comma-separated combination). | ||
| */ | ||
| public final class PermittedPath { |
| * codeBase matching: the calling class must have been loaded from a location matching the JDK | ||
| * policy {@code codeBase} URL ({@code file:/path/-} recursive, {@code file:/path/to.jar} exact). | ||
| */ | ||
| public final class ApprovedCallSite { |
There was a problem hiding this comment.
not a record?
Please audit these classes for record suitability. Tip: IntelliJ will do the conversion for you and may even suggest it.
| * classes are visible to the bootstrap classloader; SLF4J is excluded from the fat JAR to avoid | ||
| * poisoning the binding before Log4j2 initializes. |
There was a problem hiding this comment.
I keep seeing this repeated; I don't think you need to say anything and not here. Just maybe the build.gradle loudly in the dependency list "// DO NOT DEPEND ON SLF4J OR OTHER LOGGING JARS" is fine to prevent future fools from doing so.
Inlined ByteBuddy advice calls enforceNetworkAccess() from within sun.nio.ch.SocketChannelImpl (java.base). Package-private access causes IllegalAccessError at runtime; public is required.
Any static helper called from @Advice.OnMethodEnter must be public; the advice is inlined into JDK methods (java.base) and package-private access causes IllegalAccessError at runtime.
Implements SIP-24 — a drop-in replacement for the JVM Security Manager (removed in JDK 24) using a Java agent built on ByteBuddy
@Adviceinterceptors.Default mode is
warn-only. No behaviour changes for existing deployments unless you explicitly opt in toenforcemode.Disclaimer: All code written by and reviewed by LLMs. Expected to still have some rough edges.
What this adds
A new Gradle subproject
solr/agent-sm/that builds a self-contained fat JAR. The startup scripts auto-detect and load it via-javaagent:if present insolr/server/lib/ext/. Four categories of protection:FileSystemProvidersubclasses,java.nio.file.Files,FileChannelsubtypes — read/write/delete/copy/move/createSocketChannel.connect(),Socket.connect()System.exit(),Runtime.halt()ProcessBuilder.start(),Runtime.exec()Violations are logged as
SECURITY VIOLATION [TYPE] target=… caller=… mode=…and counted in/admin/metricsunder the node registry as a single labeled countersolr_security_agent_violations_total{type="file"|"network"|"exit"|"exec"}in Prometheus format.Policy is read from
server/etc/agent-security.policy(JDK-style.policysyntax). Variable substitution expands properties (e.g.${solr.solr.home},${solr.port.listen}), with env-var fallback. Operators extend the policy viaserver/etc/agent-security-extra.policyorSOLR_SECURITY_AGENT_EXTRA_POLICY. Five modules with legitimate outbound network needs (jwt-auth,opentelemetry,s3-repository,gcs-repository,cross-dc-manager) are pre-permitted in the default policy.NOTE: The interceptors and policy parser/expander are forked and adapted from the OpenSearch agent-sm project (Apache 2.0). Attribution added in
NOTICE.txtandpackage-info.java.Code review guide
Core loop
SolrAgentEntryPoint.premain()— loads the policy and installs all interceptors. The fat JAR is placed onBoot-Class-Path(SLF4J is intentionally absent, so this is safe) so all classes are visible to the bootstrap classloader and can redefinejava.basemethods.FileInterceptor—@Advice.OnMethodEnteris inlined into JDK bytecode at load time, not called normally. The trusted-filesystem exemption short-circuits for non-file:providers (Lucene's in-memory FS, jar/zip/jrt schemes); Unix domain socket cleanup returns early. Paths are normalised viaPath.normalize()before the policy check to prevent..-traversal bypasses.StackCallerClassChainExtractor— call-chain extraction viaStackWalker. Used bySystemExitInterceptorandRuntimeHaltInterceptorto check whether any class in the full stack is an approved exit caller.PolicyLoader.load()— two-file merge (default + optional operator extension), variable substitution viaPolicyPropertyExpander,codeBase-scoped grants. Parsing is strict: any non-comment top-level token that is not agrantblock causes anIllegalStateException(fail-fast). An absent or comment-only operator extension file is silently accepted; the default policy must contain at least one grant.ViolationMetricsReporter—LongAddercounters only; incremented frompremainviaincrementFile/Network/Exit/Exec(). No OTel dependency. After startup,AgentViolationMetrics(insolr:core) reads the counts reflectively and wires them into the metrics registry.AgentViolationMetrics(org.apache.solr.security,solr:core) — registers a single OTel observable countersolr.security.agent.violationswith labeltype. UsesAttributeKey/Attributes/ObservableLongMeasurementnatively (no reflection for OTel types); only the four plain-longcounter reads cross the bootstrap/app classloader boundary via reflection.SecurityViolationLogger/AgentViolationBridge— two-phase output. During early startup violations go toSystem.err; onceCoreContainercallsAgentViolationBridge.wire()after Log4j2 is up, aConsumer<String>is set onSecurityViolationLogger.reporterand violations route tosolr.logatWARN. Reflection is used because the two modules have no compile-time dependency on each other.Wiring into Solr
bin/solr/bin/solr.cmd— detectsolr-agent-sm-*.jarinlib/ext/and prepend-javaagent:toSOLR_OPTS.SOLR_SECURITY_AGENT_SKIP=truedisables this.solr/server/build.gradle—libExtdependency on:solr:agent-smplaces the agent JAR inserver/lib/ext/in the packaged distribution.CoreContainer.java— callsAgentViolationMetrics.register(metricManager, NODE_REGISTRY)andAgentViolationBridge.wire().Tests
Unit tests (
solr/agent-sm/src/test/) run with the JVM Security Manager active, which causes 3 tests to be skipped. Notable classes:SolrAgentIntegrationTest— end-to-end in enforce mode with counter verificationSymlinkEscapeTest— symlink escape preventionPolicyLoaderOperatorExtensionTest— operator extension merge, DEFAULT/OPERATOR source tagging, fail-fast on garbage policy contentSystemExitInterceptorTest,SocketChannelInterceptorTest,ProcessExecInterceptorTestBATS integration tests (
solr/packaging/test/test_security_agent.bats) — 7 tests covering start-script logic and agent behaviour via small standalone violation programs:solr_security_agent_violations_total{type="file"}etc.)SOLR_SECURITY_AGENT_SKIP=truesuppresses agent entirelySOLR_SECURITY_AGENT_MODEand extra-policy env vars forwarded correctlySystem.exit, outbound connect, and process exec withSecurityExceptionTesting IRL
Things worth eyeballing manually:
server/etc/agent-security.policy— did we miss any legitimate Solr access patterns? Check intra-cluster wildcards (*:${solr.port.listen},*:${solr.zk.port}).-Ptests.useSecurityManager=falseto exercise the 3 normally-skipped tests.What this does NOT do
SolrPaths.assertPathAllowed()call sites — that method is now@Deprecatedbut left in place.https://issues.apache.org/jira/browse/SOLR-17767