Skip to content

SOLR-17767: Add a Java Agent (SIP-24)#4471

Open
janhoy wants to merge 62 commits into
apache:mainfrom
janhoy:15868-java-agent
Open

SOLR-17767: Add a Java Agent (SIP-24)#4471
janhoy wants to merge 62 commits into
apache:mainfrom
janhoy:15868-java-agent

Conversation

@janhoy
Copy link
Copy Markdown
Contributor

@janhoy janhoy commented May 27, 2026

Implements SIP-24 — a drop-in replacement for the JVM Security Manager (removed in JDK 24) using a Java agent built on ByteBuddy @Advice interceptors.

Default mode is warn-only. No behaviour changes for existing deployments unless you explicitly opt in to enforce mode.

Disclaimer: All code written by and reviewed by LLMs. Expected to still have some rough edges.

What this adds

A new Gradle subproject solr/agent-sm/ that builds a self-contained fat JAR. The startup scripts auto-detect and load it via -javaagent: if present in solr/server/lib/ext/. Four categories of protection:

Category Intercepted JDK entry points
File access FileSystemProvider subclasses, java.nio.file.Files, FileChannel subtypes — read/write/delete/copy/move/create
Network access SocketChannel.connect(), Socket.connect()
JVM exit System.exit(), Runtime.halt()
Process exec ProcessBuilder.start(), Runtime.exec()

Violations are logged as SECURITY VIOLATION [TYPE] target=… caller=… mode=… and counted in /admin/metrics under the node registry as a single labeled counter solr_security_agent_violations_total{type="file"|"network"|"exit"|"exec"} in Prometheus format.

Policy is read from server/etc/agent-security.policy (JDK-style .policy syntax). Variable substitution expands properties (e.g. ${solr.solr.home}, ${solr.port.listen}), with env-var fallback. Operators extend the policy via server/etc/agent-security-extra.policy or SOLR_SECURITY_AGENT_EXTRA_POLICY. Five modules with legitimate outbound network needs (jwt-auth, opentelemetry, s3-repository, gcs-repository, cross-dc-manager) are pre-permitted in the default policy.

NOTE: The interceptors and policy parser/expander are forked and adapted from the OpenSearch agent-sm project (Apache 2.0). Attribution added in NOTICE.txt and package-info.java.

Code review guide

Core loop

  1. SolrAgentEntryPoint.premain() — loads the policy and installs all interceptors. The fat JAR is placed on Boot-Class-Path (SLF4J is intentionally absent, so this is safe) so all classes are visible to the bootstrap classloader and can redefine java.base methods.

  2. FileInterceptor@Advice.OnMethodEnter is inlined into JDK bytecode at load time, not called normally. The trusted-filesystem exemption short-circuits for non-file: providers (Lucene's in-memory FS, jar/zip/jrt schemes); Unix domain socket cleanup returns early. Paths are normalised via Path.normalize() before the policy check to prevent ..-traversal bypasses.

  3. StackCallerClassChainExtractor — call-chain extraction via StackWalker. Used by SystemExitInterceptor and RuntimeHaltInterceptor to check whether any class in the full stack is an approved exit caller.

  4. PolicyLoader.load() — two-file merge (default + optional operator extension), variable substitution via PolicyPropertyExpander, codeBase-scoped grants. Parsing is strict: any non-comment top-level token that is not a grant block causes an IllegalStateException (fail-fast). An absent or comment-only operator extension file is silently accepted; the default policy must contain at least one grant.

  5. ViolationMetricsReporterLongAdder counters only; incremented from premain via incrementFile/Network/Exit/Exec(). No OTel dependency. After startup, AgentViolationMetrics (in solr:core) reads the counts reflectively and wires them into the metrics registry.

  6. AgentViolationMetrics (org.apache.solr.security, solr:core) — registers a single OTel observable counter solr.security.agent.violations with label type. Uses AttributeKey/Attributes/ObservableLongMeasurement natively (no reflection for OTel types); only the four plain-long counter reads cross the bootstrap/app classloader boundary via reflection.

  7. SecurityViolationLogger / AgentViolationBridge — two-phase output. During early startup violations go to System.err; once CoreContainer calls AgentViolationBridge.wire() after Log4j2 is up, a Consumer<String> is set on SecurityViolationLogger.reporter and violations route to solr.log at WARN. Reflection is used because the two modules have no compile-time dependency on each other.

Wiring into Solr

  • bin/solr / bin/solr.cmd — detect solr-agent-sm-*.jar in lib/ext/ and prepend -javaagent: to SOLR_OPTS. SOLR_SECURITY_AGENT_SKIP=true disables this.
  • solr/server/build.gradlelibExt dependency on :solr:agent-sm places the agent JAR in server/lib/ext/ in the packaged distribution.
  • CoreContainer.java — calls AgentViolationMetrics.register(metricManager, NODE_REGISTRY) and AgentViolationBridge.wire().

Tests

Unit tests (solr/agent-sm/src/test/) run with the JVM Security Manager active, which causes 3 tests to be skipped. Notable classes:

  • SolrAgentIntegrationTest — end-to-end in enforce mode with counter verification
  • SymlinkEscapeTest — symlink escape prevention
  • PolicyLoaderOperatorExtensionTest — operator extension merge, DEFAULT/OPERATOR source tagging, fail-fast on garbage policy content
  • SystemExitInterceptorTest, SocketChannelInterceptorTest, ProcessExecInterceptorTest

BATS integration tests (solr/packaging/test/test_security_agent.bats) — 7 tests covering start-script logic and agent behaviour via small standalone violation programs:

  • Test 1 — warn mode active by default; metrics registered (solr_security_agent_violations_total{type="file"} etc.)
  • Test 2SOLR_SECURITY_AGENT_SKIP=true suppresses agent entirely
  • Test 3SOLR_SECURITY_AGENT_MODE and extra-policy env vars forwarded correctly
  • Tests 4–7 — enforce mode blocks file read, System.exit, outbound connect, and process exec with SecurityException

Testing IRL

# Unit tests (62 tests, 3 skipped under SecurityManager)
./gradlew :solr:agent-sm:test

# BATS integration tests
./gradlew :solr:packaging:integrationTests --tests test_security_agent.bats

# Start Solr in warn mode (default) and check violation counters
bin/solr start
curl http://localhost:8983/solr/admin/metrics
# Look for: solr_security_agent_violations_total{type="file"}, ...{type="network"}, etc.

# Opt in to enforce mode
SOLR_SECURITY_AGENT_MODE=enforce bin/solr start

Things worth eyeballing manually:

  • server/etc/agent-security.policy — did we miss any legitimate Solr access patterns? Check intra-cluster wildcards (*:${solr.port.listen}, *:${solr.zk.port}).
  • Run tests with -Ptests.useSecurityManager=false to exercise the 3 normally-skipped tests.

What this does NOT do

  • No enforce mode in the broader Solr test suite yet (warn only; enforce flip is a follow-up per SIP-24).
  • Does not remove existing SolrPaths.assertPathAllowed() call sites — that method is now @Deprecated but left in place.

https://issues.apache.org/jira/browse/SOLR-17767

janhoy added 8 commits May 26, 2026 20:20
…tion

- solr/server/build.gradle: add agent fat JAR to libExt (transitive=false)
  so it lands in server/lib/ext/ of the packaged distribution used by BATS
- CoreContainer.java: fix Class.forName to use CoreContainer's classloader
  instead of null (bootstrap); the agent JAR is in lib/ext/ loaded by
  Jetty's server classloader, not the bootstrap classloader
- ViolationMetricsReporter: replace Codahale registerGauge (non-existent
  in this OTel-based branch) with observableLongCounter reflective call;
  remove buildGauge/Proxy; metrics appear in Prometheus format as
  security_agent_violations_{file,network,exit,exec}_total
- test_security_agent.bats: add --user-managed to avoid embedded ZooKeeper
  crash; update metrics assertion to Prometheus format name; fix metrics
  endpoint URL (no group/prefix params); all 3 tests now pass
Fix sysprop names
Add BATS tests to verify actual invocation
…it/network/exec; remove redundant VirtualThreadCompatibilityTest

- Add testPrograms sourceSet in agent-sm compiling FileViolation, ExitViolation,
  NetworkViolation, ExecViolation into a standalone JAR (no Solr/ByteBuddy deps).
- Wire AGENT_TEST_PROGRAMS_JAR into integrationTests task; BATS test #4 drops
  the inline javac+heredoc in favour of the pre-compiled JAR.
- Add BATS tests #5 (System.exit), #6 (SocketChannel.connect to RFC 5737 TEST-NET-1),
  #7 (ProcessBuilder exec) — all verify SecurityException in enforce mode.
- Delete VirtualThreadCompatibilityTest: StackWalker virtual-thread compatibility
  is guaranteed by JDK spec and already exercised by other unit tests.
- Fix rat-sources.gradle: exclude src/test-programs/** for :solr:agent-sm.
…atus != 123 in BATS

System.exit(0) is unreachable after a successful exit, making refute_output vacuous
and masking a broken agent (exit code 0 would trigger assert_failure for the wrong
reason). Using exit code 123 as a sentinel lets the BATS test explicitly verify that
the interceptor fired (SecurityException in output, status != 123).
# Conflicts:
#	solr/core/gradle.lockfile
#	solr/solrj-zookeeper/gradle.lockfile
#	solr/test-framework/gradle.lockfile
janhoy added 11 commits May 27, 2026 15:52
…n-walk exec, codeBase matching

Part A — stale class names:
- dev-docs/security-agent.adoc: SolrSecurityPolicy → AgentPolicy, StackInspector → StackCallerClassChainExtractor
- build.gradle comment: SolrSecurityPolicy, FileAccessInterceptor → AgentPolicy, FileInterceptor

B1 — FileInterceptor "read" was a no-op; add explicit policy check for Files.read() path
B2 — Remove tautological `args instanceof Object` guard (always true)
B4 — SocketChannelInterceptor: implement codeBase matching via StackWalker instead of silently
     skipping codeBase-scoped entries; add isCallerFromCodeBase() helper + tests
B5 — ProcessExecInterceptor: walk full chain (like SystemExitInterceptor) instead of checking
     only direct caller; add AgentPolicy.isChainThatCanExec() + tests
B6 — Remove 0.0.0.0 from trustedHosts (unspecified bind address, not loopback)
B7 — PermittedPath: use java.io.File.separator instead of hardcoded "/" and "\\"
B8 — SolrAgentEntryPoint: .or(isAbstract()) → .and(not(isAbstract())) in file method matcher
B9 — FileInterceptor: remove redundant second getCallerClass() call inside delete branch
B10 — AgentPolicy.resetForTesting(): add comment explaining deliberate unsynchronized write
B11 — Rename bootOtel → bootAgent in SolrAgentEntryPoint
B12 — Fix typo: testApprovedCallerAnywherInChain → testApprovedCallerAnywhereInChain
B13 — FileInterceptor: move resolveRealPath() Javadoc above the correct method; note why
      toRealPath() must not be called from the advice (circular interception)
B14 — ViolationMetricsReporter.findMethod(): match by param type simple names, not just count
B15 — SecurityViolationLogger: clarify stack-trace comment; deduplicate debug block
…onventions

All Solr metrics should start with "solr.*". Rename:
  security.agent.violations.{file,network,exit,exec}
→ solr.security.agent.violations.{file,network,exit,exec}

In Prometheus format: solr_security_agent_violations_*_total.
Update BATS test assertion, ref-guide, and Javadoc accordingly.
…Programs; use NIO FileSystems.getSeparator() in PermittedPath

- testProgramsAnnotationProcessor config was pulling in error-prone deps not in
  the dependency lock file; disable annotation processing for trivial test programs.
- PermittedPath used java.io.File (forbidden API); switch to
  FileSystems.getDefault().getSeparator() per Solr's NIO-only policy.
…Interceptor

Add `import java.net.URL;`, replace `java.net.URL loc` with `URL loc`,
and replace `java.net.InetSocketAddress` parameter type with `InetSocketAddress`
since the type is already imported. Fixes CI error-prone -Werror failures.
…yPoint

Add imports for SocketChannel, Socket and use simple names for System,
Runtime, ProcessBuilder — all were already in java.lang or now imported.
Fixes CI error-prone -Werror failures.
… agent-sm

- build.gradle: disable error-prone for compileTestProgramsJava (no errorprone
  JAR on testProgramsAnnotationProcessor config)
- SymlinkEscapeTest, SocketChannelInterceptorTest, SolrAgentIntegrationTest:
  add missing imports and remove FQ type references

All 59 tests pass with -Pvalidation.errorprone=true.
janhoy added 5 commits May 28, 2026 12:56
The Prometheus output includes OTel scope labels alongside the type label
(e.g. otel_scope_name="org.apache.solr",type="file"), so asserting
on 'solr_security_agent_violations_total{type="file"' fails. Assert
on the base metric name only since the presence of the counter is what
matters.
Throwing SecurityException unconditionally for an unrecognised args[1]
type bypassed the violation counter and structured log, and would break
legitimate file operations in warn mode. Unknown types are now treated
conservatively as mutating (write check applied).
The Javadoc claimed String#matches (full regex) with patterns like
'org\.apache\.solr\..*'. The actual implementation in
ApprovedCallSite.matches() uses equals/startsWith semantics: '*' is
wildcard, 'pkg.*' matches the package and sub-packages, anything else
is an exact class-name match.
On Windows, CodeSource.getLocation().getPath() may return backslash
separators while the codeBase string from the policy file uses forward
slashes. Normalize both strings to forward slashes before the
startsWith/equals comparison so the recursive codeBase check works
correctly on all platforms.
The interceptor was using toAbsolutePath().normalize() which does not
follow symbolic links, allowing a symlink inside a permitted directory
that points outside it to bypass the policy check.

Replace with resolveRealPath() which calls toRealPath() to resolve the
real on-disk path before the policy check. A ThreadLocal guard
(IN_SYMLINK_RESOLVE) prevents infinite recursion in case toRealPath()
itself triggers an intercepted file operation. On IOException (path
does not yet exist), falls back to toAbsolutePath().normalize().

The same fix is applied to the copy/move target-path resolution.
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 58 out of 62 changed files in this pull request and generated 4 comments.

Comment thread solr/agent-sm/src/java/org/apache/solr/security/agent/ApprovedCallSite.java Outdated
Comment thread solr/server/etc/agent-security.policy
janhoy added 2 commits May 28, 2026 13:26
ByteBuddy binds @origin Method on every intercepted connect() call even
though the parameter was never read. Removing it eliminates a reflective
Method lookup on each outbound connection attempt.
When a policy grant block included a codeBase attribute, e.g.:

  grant codeBase "file:/opt/solr/modules/foo/-" {
    permission java.lang.RuntimePermission "exitVM";
  };

PolicyLoader was storing the codeBase URL string as the
classNamePattern in ApprovedCallSite. ApprovedCallSite.matches() then
compared it against fully-qualified class names, which can never match
a "file:/.../..." URL string, making all codeBase-scoped exit/exec
grants silently no-ops.

Fix: ApprovedCallSite now carries a separate codeBase field. When
codeBase is set, matchesCodeBase(Class<?>) is used (delegating to the
same SocketChannelInterceptor.isCallerFromCodeBase logic used for
network grants). AgentPolicy.isChainThatCanExit/Exec dispatch to the
right matcher depending on which field is set. The simple
isExitApproved/isExecApproved(String) helpers skip codeBase entries
since they have no Class<?> to inspect.

Also correct the ApprovedCallSite Javadoc (previously claimed SolrCLI
and SolrDispatchFilter are default-approved exit callers, which was
never true), and add an explanatory comment in agent-security.policy
documenting why exitVM grants are intentionally absent from the
default policy.
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 58 out of 62 changed files in this pull request and generated no new comments.

janhoy added 2 commits May 28, 2026 13:46
actions.contains(action) does substring matching, so a malformed policy
string like "readwrite" (missing comma) would incorrectly match both
"read" and "write". Split the stored actions string into a Set at
construction time and use Set.contains() for exact token matching.
@janhoy janhoy marked this pull request as ready for review May 28, 2026 11:53
Policy files use "[::1]:8983" bracket notation for IPv6 but
InetSocketAddress.getHostString() returns "::1" without brackets.
lastIndexOf(':') found the colon after ']' correctly, but the extracted
entryHost retained the brackets and never equaled the actual host string.

Parse bracket-notation IPv6 entries separately: strip the '[' and ']'
before passing to matchesHost so "::1" matches "[::1]:8983".

Add two tests covering the bracket notation with an exact port and a
port range.
@janhoy janhoy requested review from dsmiley, gus-asf and malliaridis May 28, 2026 11:57
@janhoy
Copy link
Copy Markdown
Contributor Author

janhoy commented May 29, 2026

Docker and BATS tests currently fail. I believe it is due to resolving of symlinks on linux and translating paths with toRealPath() gives a different absolute path than what is predefined in the policy file. It works locally on a Mac. So maybe we need to apply the same toRealPath() and Path.normalize() steps to the paths provided in the policy through ${solr.install.dir} etc before comparison? Currently investigating...

janhoy added 5 commits May 29, 2026 23:40
Two bugs that cause Solr to fail to start when the Old Java
SecurityManager is active (SOLR_SECURITY_MANAGER_ENABLED=true, the
default for Java < 24):

1. resolveRealPath() only caught IOException, not SecurityException.
   On JDK 21 and earlier, Path.toRealPath() calls sm.checkRead() on
   the original path internally; if that check fails (e.g. when the
   Old SM policy does not cover the accessed path), a SecurityException
   is thrown. Because the advice signature declares 'throws Exception'
   but the catch clause was IOException-only, the SecurityException
   propagated through the ByteBuddy advice and aborted the intercepted
   file operation, preventing Solr from starting.
   Fix: catch IOException | SecurityException and fall back to
   toAbsolutePath().normalize() in both cases.

2. Policy paths from agent-security.policy were stored verbatim after
   variable substitution without resolving symlinks. At runtime,
   resolveRealPath() returns the real (symlink-resolved) form, which
   would not match a symlinked policy path. Fix: normalize each
   FilePermission base path via toRealPath() (same fallback logic as
   the runtime check) when building the policy, so both sides of the
   comparison use the canonical form.

Also updates the stale comment in agent-security.policy that said
"FileInterceptor uses toAbsolutePath(), not toRealPath()".
Gradle's incremental-build state tracking reads the @OutputDirectory
before running the task. Docker containers run as uid 8983 (the solr
user); any files they create (e.g. logs/) are unreadable by the
Gradle runner process, causing an AccessDeniedException before the
test even starts.

Adding doNotTrackState() bypasses the output-directory inspection so
the task always runs regardless of prior leftover artifacts.
ByteBuddy inlines @Advice.OnMethodEnter bytecode into JDK methods (e.g.
java.nio.file.Files). Any static helper called from inlined advice must be
public; package-private access causes IllegalAccessError at runtime because
the inlined call-site executes in the context of the JDK class, not
FileInterceptor. This was causing Solr startup failures on Linux with JDK 21
when the Old SecurityManager policy loaded lazily via Files.newByteChannel.
Copy link
Copy Markdown
Contributor

@dsmiley dsmiley left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Overall I'm very impressed with the thoroughness here. This is very complete and well thought-out.

It's not clear we'll want to bother running our normal Java tests with the securityh agent -- I don't think it's worth the hassle and maintenance of a parallel test policy.

Disclaimer: I didn't look at the bats tests. I skimmed the agent code, mostly.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

IMO org.apache.solr.servlet.CoreContainerProvider#init is a more suitable place. CoreContainer can be used for embedding and lighter-weight tests that shouldn't have to deal with initialization of this nature.

Comment thread solr/docker/build.gradle
Comment on lines +95 to +99
|solr.security.agent.extra.policy||`${server.dir}/etc/agent-security-extra.policy`|Path to the operator extension policy file for the Solr security agent. Overrides the default location. An absent file is silently skipped.

|solr.security.agent.mode||`warn`|Enforcement mode for the Solr security agent: `warn` (log violations, continue) or `enforce` (log violations, block the operation with `SecurityException`).

|solr.security.agent.skip||`false`|If set to `true`, omits the `-javaagent:` flag from the JVM command line, disabling all Solr security agent controls. Intended for temporary troubleshooting only.
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I assume these can't be set via env vars because these are pre-EnvUtils?

* optionally all descendants) for the set of operations encoded in the {@code actions} string
* ({@code "read"}, {@code "write"}, {@code "delete"}, or any comma-separated combination).
*/
public final class PermittedPath {
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

not a record?

* codeBase matching: the calling class must have been loaded from a location matching the JDK
* policy {@code codeBase} URL ({@code file:/path/-} recursive, {@code file:/path/to.jar} exact).
*/
public final class ApprovedCallSite {
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

not a record?

Please audit these classes for record suitability. Tip: IntelliJ will do the conversion for you and may even suggest it.

Comment on lines +35 to +36
* classes are visible to the bootstrap classloader; SLF4J is excluded from the fat JAR to avoid
* poisoning the binding before Log4j2 initializes.
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I keep seeing this repeated; I don't think you need to say anything and not here. Just maybe the build.gradle loudly in the dependency list "// DO NOT DEPEND ON SLF4J OR OTHER LOGGING JARS" is fine to prevent future fools from doing so.

janhoy added 3 commits May 30, 2026 11:40
Inlined ByteBuddy advice calls enforceNetworkAccess() from within
sun.nio.ch.SocketChannelImpl (java.base). Package-private access causes
IllegalAccessError at runtime; public is required.
Any static helper called from @Advice.OnMethodEnter must be public;
the advice is inlined into JDK methods (java.base) and package-private
access causes IllegalAccessError at runtime.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants