Skip to content

Honor LLM_TIMEOUT, expose generation params, and harden black-box reports#549

Open
VoidChecksum wants to merge 5 commits into
usestrix:mainfrom
VoidChecksum:fix/llm-config-and-blackbox-reporting
Open

Honor LLM_TIMEOUT, expose generation params, and harden black-box reports#549
VoidChecksum wants to merge 5 commits into
usestrix:mainfrom
VoidChecksum:fix/llm-config-and-blackbox-reporting

Conversation

@VoidChecksum

@VoidChecksum VoidChecksum commented Jun 9, 2026

Copy link
Copy Markdown

Four small, independently-verified LLM/reporting fixes. Per maintainer request these are bundled into a single PR rather than four; each commit is self-contained and references its tracking issue, so they can be split or cherry-picked individually if preferred.

1. fix(llm): honor LLM_TIMEOUT during scans, not just warm-up (closes #426)

LLM_TIMEOUT is documented as the request timeout and is the recommended workaround for slow local models, but it was only applied to the warm-up call in strix/interface/main.py. Scan calls go through the SDK's LiteLLM model, which invokes litellm.acompletion without a timeout and falls back to LiteLLM's module default (6000s), so export LLM_TIMEOUT=600 had no effect on the actual run. _configure_litellm_request_timeout now sets litellm.request_timeout from settings inside configure_sdk_model_defaults.

2. fix(reporting): drop fabricated code_locations in black-box scans (closes #321)

In a black-box scan no source tree exists, yet create_vulnerability_report accepted code_locations unconditionally, letting the model fabricate file paths/line numbers into the customer-facing report. The existing is_whitebox flag is now threaded into the tool run-context (it propagates to child agents via dict(parent_ctx)), and code_locations are dropped when the scan is black-box; black-box guidance was added to the tool docstring and system prompt.

3. feat(llm): expose temperature, top_p, max_tokens (closes #514)

Adds optional STRIX_LLM_TEMPERATURE / STRIX_LLM_TOP_P / STRIX_LLM_MAX_TOKENS, threaded through make_model_settings into the SDK ModelSettings. All unset by default → provider defaults, so no behavior change. Params a model rejects are dropped automatically (litellm.drop_params is already enabled).

4. docs(local-models): document context-window sizing (closes #286)

Documents that agentic prompts easily exceed small local-runtime context windows (Ollama defaults to ~4096 tokens) and how to raise it. Docs only.

Verification

  • ruff 0.11.13 (lint + format), mypy 1.16.0 (strict, --install-types), and bandit are green on the full changed set.
  • Behavioral checks run locally for each change: the timeout setter, the black-box code_locations drop (and whitebox-preserve), and generation-param threading with None defaults.
  • strix --help and importing every changed module both succeed.
  • No test harness is added — the project verifies statically, so introducing pytest infra would be out of scope for these fixes.

Strix's agentic prompts (large system prompt + tool schema + growing
history) exceed the small default context many local runtimes use
(Ollama defaults to ~4096 tokens), which silently clips context and
manifests as looping agents, truncated/unexecuted tool calls, and
mid-scan failures.

Add a "Context Window Sizing" section to the local-models docs: the
4096 pitfall, how to raise it (Ollama OLLAMA_CONTEXT_LENGTH / Modelfile
num_ctx, LM Studio, llama.cpp -c, vLLM --max-model-len), symptoms of
clipping for self-diagnosis, recommended minimum per scan mode, and a
note on the VRAM tradeoff.

Fixes usestrix#286
Add optional generation parameters so users can steer model behavior —
particularly useful for local / OpenAI-compatible models that need a
lower temperature for steadier tool calling (see usestrix#514).

- STRIX_LLM_TEMPERATURE, STRIX_LLM_TOP_P, STRIX_LLM_MAX_TOKENS
  (all unset by default -> provider defaults, so no behavior change).
- Threaded through make_model_settings into the SDK ModelSettings used
  for the scan. Params a given model rejects are dropped automatically
  (litellm.drop_params is already enabled).
- Documented in docs/advanced/configuration.mdx.

Closes usestrix#514
In a black-box scan no source tree is available, yet
create_vulnerability_report accepted code_locations unconditionally and
render_vulnerability_md emitted a "Code Analysis" section from them. The
model could therefore fabricate file paths, line numbers, and snippets
into the customer-facing report (usestrix#321).

Thread the existing is_whitebox flag into the tool run-context (it
propagates to child agents via dict(parent_ctx)) and drop code_locations
when the scan is black-box. Also add black-box guidance to the reporting
tool docstring and the system prompt so the model does not assert source
locations it cannot see.

Fixes usestrix#321
LLM_TIMEOUT is documented (docs/advanced/configuration.mdx) as the
request timeout for LLM calls and is offered as the workaround for slow
local models, but it was only applied to the warm-up call in
strix/interface/main.py. Scan LLM calls go through the SDK's LiteLLM
model, which invokes litellm.acompletion without a timeout and falls
back to LiteLLM's module default (6000s) -- so `export LLM_TIMEOUT=600`
had no effect on the actual run.

Set litellm.request_timeout from settings in configure_sdk_model_defaults
so the documented setting takes effect for LiteLLM-routed scans.

Fixes usestrix#426
@greptile-apps

greptile-apps Bot commented Jun 9, 2026

Copy link
Copy Markdown
Contributor

Greptile Summary

This PR bundles four self-contained fixes: honoring LLM_TIMEOUT in scan calls (guarded by model_fields_set so users without the env var keep LiteLLM's default), dropping fabricated code_locations in true black-box scans while correctly preserving them for repository-type targets via the new source_in_scope flag, and exposing temperature/top_p/max_tokens as optional generation params with Pydantic range constraints and None defaults.

  • LLM_TIMEOUT fix: _configure_litellm_request_timeout is called only when LLM_TIMEOUT is explicitly in the environment (model_fields_set check), so users who never set it are not silently capped at the 300 s Pydantic default.
  • Black-box code_locations fix: source_in_scope = is_whitebox or (type == \"repository\") is threaded into the root context and inherited by child agents via dict(parent_ctx) in _start_child_runner; the reporting tool drops code_locations only when this flag is falsy.
  • Generation params: Pydantic ge/le constraints on temperature and top_p surface bad values at config-load time; litellm.drop_params handles provider-level incompatibilities.

Confidence Score: 4/5

Safe to merge with awareness that litellm.request_timeout is now mutated as a module-level global on every scan start, which can affect concurrent in-flight requests if two scans with different timeouts overlap.

The _configure_litellm_request_timeout call writes to litellm.request_timeout, a module-level variable read by every acompletion call. Because it is invoked inside run_strix_scan rather than once at process startup, two concurrent scans with different LLM_TIMEOUT values can race to overwrite it mid-flight, potentially applying the wrong timeout to an already in-flight request. The model_fields_set guard correctly avoids the 300 s default regression, but the global-mutation race is a real production risk for any host running parallel scans.

strix/config/models.py — the new _configure_litellm_request_timeout function and its call site in configure_sdk_model_defaults.

Important Files Changed

Filename Overview
strix/config/models.py Adds _configure_litellm_request_timeout called conditionally via model_fields_set check; extends the pre-existing module-level litellm global mutation pattern to request_timeout, which is called on every scan start and affects all in-flight concurrent requests.
strix/config/settings.py Adds temperature, top_p, max_tokens optional fields with correct Pydantic ge/le constraints and None defaults; no behavior change when unset.
strix/core/runner.py Correctly introduces source_in_scope (covers both local_code and repository scan types), threads it into the root context for child propagation, and wires the three new generation params into make_model_settings.
strix/tools/reporting/tool.py Guards code_locations on the new source_in_scope context key, correctly preserving them for repository and local-code scans while dropping fabricated paths for true black-box targets.
strix/core/inputs.py Adds temperature, top_p, max_tokens as optional kwargs to make_model_settings and passes them unconditionally to ModelSettings; None values are handled by the SDK/litellm layer.
strix/agents/prompts/system_prompt.jinja Adds one-line black-box reporting instruction to the execution guidelines; no logic change.
docs/advanced/configuration.mdx Documents the three new generation params; temperature range 0.0–2.0 is the constraint enforced by Strix but provider-specific narrower ranges (e.g. Anthropic: 0.0–1.0) are not mentioned.
docs/llm-providers/local.mdx Adds context-window sizing guidance for Ollama, llama.cpp, LM Studio, and vLLM; docs-only change.
Prompt To Fix All With AI
Fix the following 1 code review issue. Work through them one at a time, proposing concise fixes.

---

### Issue 1 of 1
docs/advanced/configuration.mdx:33-35
**Out-of-range temperature causes API errors, not silent drops**

The description says "Parameters unsupported by a given model are dropped automatically," but `litellm.drop_params` only removes parameters that the model does not accept **at all** — it does not sanitize out-of-range values. A user who sets `STRIX_LLM_TEMPERATURE=1.5` and points Strix at an Anthropic model (valid range: `0.0–1.0`) will get a provider API error at runtime rather than the graceful behaviour the copy implies. Clarifying that the statement refers to unsupported param *names* rather than out-of-range *values* would prevent confusion.

Reviews (2): Last reviewed commit: "Address review feedback on #549" | Re-trigger Greptile

Comment thread strix/tools/reporting/tool.py Outdated
Comment on lines +492 to +498
if not inner.get("is_whitebox") and code_locations:
# Black-box scan: no source tree is available, so any file paths /
# line numbers / snippets in code_locations can only be fabricated.
# Drop them so a hallucinated "Code Analysis" section can never reach
# the customer-facing report (#321).
logger.info("Black-box scan: dropping code_locations from report %r", title)
code_locations = None

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 is_whitebox=False covers repository scans that have cloned source

is_whitebox is False for any target whose type is not "local_code" — including "repository" scans where the code is actually cloned to /workspace. For those scans the agent receives the workspace path in its task prompt (see build_root_task) and can legitimately read the source and populate accurate code_locations. After this change those valid locations are silently dropped, producing vulnerability reports with no code_locations even though real line numbers were identified.

The guard should distinguish between scans that are truly source-free (e.g., web_application, ip_address) and scans that simply use the "repository" target type. A concrete failure: a user scans a cloned repo via the "repository" target type; the agent identifies an injected SQL query at src/db.py:42; the report arrives at the customer with code_locations removed and a log line reading "Black-box scan: dropping code_locations".

Prompt To Fix With AI
This is a comment left during a code review.
Path: strix/tools/reporting/tool.py
Line: 492-498

Comment:
**`is_whitebox=False` covers repository scans that have cloned source**

`is_whitebox` is `False` for any target whose type is not `"local_code"` — including `"repository"` scans where the code is actually cloned to `/workspace`. For those scans the agent receives the workspace path in its task prompt (see `build_root_task`) and can legitimately read the source and populate accurate `code_locations`. After this change those valid locations are silently dropped, producing vulnerability reports with no `code_locations` even though real line numbers were identified.

The guard should distinguish between scans that are truly source-free (e.g., `web_application`, `ip_address`) and scans that simply use the `"repository"` target type. A concrete failure: a user scans a cloned repo via the `"repository"` target type; the agent identifies an injected SQL query at `src/db.py:42`; the report arrives at the customer with `code_locations` removed and a log line reading "Black-box scan: dropping code_locations".

How can I resolve this? If you propose a fix, please make it concise.

Comment thread strix/config/models.py
Comment on lines +132 to +146
def _configure_litellm_request_timeout(timeout: int) -> None:
"""Apply the configured ``LLM_TIMEOUT`` to LiteLLM-routed scan calls.

The SDK's LiteLLM model invokes ``litellm.acompletion`` without an explicit
per-request timeout, so without this it falls back to LiteLLM's module
default and the documented ``LLM_TIMEOUT`` only affected the warm-up call
(``strix/interface/main.py``). Setting the module-level default makes
``export LLM_TIMEOUT=600`` take effect for the actual scan, restoring the
documented behavior for slow local / self-hosted models.
"""
import litellm

# litellm doesn't re-export request_timeout in its public surface, but it is
# the module-level default it reads for each acompletion call.
litellm.request_timeout = timeout # type: ignore[attr-defined]

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 Effective default timeout drops from ~6000 s to 300 s for all existing users

Before this PR, scan calls used LiteLLM's built-in module default (~6000 s); after it they use the LLM_TIMEOUT Pydantic default of 300 s. Any user who has not explicitly set LLM_TIMEOUT — including existing deployments and the project's own CI — will now hit a 300 s per-request wall where they previously had ~6000 s. For anyone running a slow local model (Ollama, llama.cpp) this will cause ReadTimeout errors on the first multi-step scan call that exceeds five minutes, with no indication that the timeout was silently tightened.

Prompt To Fix With AI
This is a comment left during a code review.
Path: strix/config/models.py
Line: 132-146

Comment:
**Effective default timeout drops from ~6000 s to 300 s for all existing users**

Before this PR, scan calls used LiteLLM's built-in module default (~6000 s); after it they use the `LLM_TIMEOUT` Pydantic default of 300 s. Any user who has not explicitly set `LLM_TIMEOUT` — including existing deployments and the project's own CI — will now hit a 300 s per-request wall where they previously had ~6000 s. For anyone running a slow local model (Ollama, llama.cpp) this will cause `ReadTimeout` errors on the first multi-step scan call that exceeds five minutes, with no indication that the timeout was silently tightened.

How can I resolve this? If you propose a fix, please make it concise.

Comment thread strix/config/models.py Outdated
llm = settings.llm
set_tracing_disabled(True)
_configure_litellm_compatibility()
_configure_litellm_request_timeout(llm.timeout)

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Module-level global mutated on every run_strix_scan call

configure_sdk_model_defaults (and now _configure_litellm_request_timeout) writes to litellm's module-level state. If a host process runs two concurrent scans that were started with different LLM_TIMEOUT values, the second call to configure_sdk_model_defaults will overwrite litellm.request_timeout for the first scan's in-flight requests. The pattern is pre-existing for api_key/api_base, but extending it to request_timeout (now called per scan, not once at startup) makes the race window much more likely to affect real requests.

Prompt To Fix With AI
This is a comment left during a code review.
Path: strix/config/models.py
Line: 66

Comment:
**Module-level global mutated on every `run_strix_scan` call**

`configure_sdk_model_defaults` (and now `_configure_litellm_request_timeout`) writes to `litellm`'s module-level state. If a host process runs two concurrent scans that were started with different `LLM_TIMEOUT` values, the second call to `configure_sdk_model_defaults` will overwrite `litellm.request_timeout` for the first scan's in-flight requests. The pattern is pre-existing for `api_key`/`api_base`, but extending it to `request_timeout` (now called per scan, not once at startup) makes the race window much more likely to affect real requests.

How can I resolve this? If you propose a fix, please make it concise.

Comment thread strix/config/settings.py Outdated
Comment on lines +38 to +40
reasoning_effort: ReasoningEffort = Field(default="high", alias="STRIX_REASONING_EFFORT")
timeout: int = Field(default=300, alias="LLM_TIMEOUT")
temperature: float | None = Field(default=None, alias="STRIX_LLM_TEMPERATURE")

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 No range validation on temperature and top_p

Pydantic will accept any float for both fields. Out-of-range values (e.g., STRIX_LLM_TEMPERATURE=3.5) will be forwarded to LiteLLM. For most providers this causes an API error at runtime rather than a startup-time config failure, and users have no immediate feedback that the value is wrong. top_p values outside [0, 1] are universally invalid. Adding ge=0.0 / le=2.0 (temperature) and ge=0.0, le=1.0 (top_p) constraints would surface problems early.

Prompt To Fix With AI
This is a comment left during a code review.
Path: strix/config/settings.py
Line: 38-40

Comment:
**No range validation on `temperature` and `top_p`**

Pydantic will accept any `float` for both fields. Out-of-range values (e.g., `STRIX_LLM_TEMPERATURE=3.5`) will be forwarded to LiteLLM. For most providers this causes an API error at runtime rather than a startup-time config failure, and users have no immediate feedback that the value is wrong. `top_p` values outside `[0, 1]` are universally invalid. Adding `ge=0.0` / `le=2.0` (temperature) and `ge=0.0, le=1.0` (top_p) constraints would surface problems early.

How can I resolve this? If you propose a fix, please make it concise.

Note: If this suggestion doesn't match your team's coding style, reply to this and let me know. I'll remember it for next time!

- reporting: gate code_locations on whether source is in scope
  (local_code OR repository), not is_whitebox. is_whitebox is False for
  repository targets even though their source is cloned to /workspace,
  so the prior guard silently dropped valid code_locations on repo
  scans. runner now threads source_in_scope; create_vulnerability_report
  keys off it.
- llm timeout: only override litellm.request_timeout when LLM_TIMEOUT is
  explicitly set (llm.model_fields_set), so users who never set it keep
  LiteLLM's ~6000s default instead of being silently capped at the 300s
  default.
- settings: bound STRIX_LLM_TEMPERATURE to [0,2], STRIX_LLM_TOP_P to
  [0,1], STRIX_LLM_MAX_TOKENS to >=1 so invalid values fail at config
  load instead of at the provider; docs note the ranges.
@VoidChecksum

Copy link
Copy Markdown
Author

Thanks for the review — addressed in 3804421.

1. is_whitebox=False covers repository scans (P1) — fixed. Correct catch: collect_local_sources mounts source for both local_code and repository, but is_whitebox is local_code-only, so the guard would have dropped valid code_locations on repo scans. The runner now threads source_in_scope = local_code OR repository, and create_vulnerability_report keys off that instead of is_whitebox. Repository and local-code scans keep their locations; only truly source-free scans (URL/IP/domain) drop them.

2. Default timeout 6000s → 300s regression (P1) — fixed. _configure_litellm_request_timeout is now only called when LLM_TIMEOUT is explicitly set (llm.model_fields_set). Users who never set it keep LiteLLM's built-in default; the documented setting still takes effect for scans when provided. This directly avoids regressing the slow-local-model case from #426.

3. Module-level global mutation / concurrent scans (P2) — intentionally out of scope. As noted, this is the pre-existing pattern for api_key/api_base in the same function, and Strix runs one scan per process (CLI). Reworking SDK-global config into per-scan state is a broader refactor that shouldn't ride along with these fixes — happy to file a separate issue if you'd like it tracked.

4. No range validation on temperature/top_p (P2) — fixed. Added ge=0.0, le=2.0 (temperature), ge=0.0, le=1.0 (top_p), and ge=1 (max_tokens) so bad values fail at config load; docs note the ranges.

All changes verified locally: ruff (lint + format), mypy --strict (with --install-types), bandit green, plus focused behavioral tests for the source-in-scope guard, the timeout gating, and the new bounds.

@bearsyankees

Copy link
Copy Markdown
Collaborator

@greptile

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

2 participants