feat(waterdata): Migrate to httpx and add async parallel chunker by thodson-usgs · Pull Request #285 · DOI-USGS/dataretrieval-python

thodson-usgs · 2026-05-21T22:17:16Z

Summary

Stacked on top of the chunker-unified work (PR #283); rebase to main once that lands.

Replaces requests with httpx package-wide.
Adds an opt-in async parallel branch to the multi-value chunker via API_USGS_CONCURRENT (default 16 for the server-friendly sweet spot; set =1 for the legacy sequential path).
Integrates the async path with the ChunkPlan / ChunkedCall arch from feat(waterdata): Auto-chunk OGC requests over the URL byte limit #283 — sync drives ChunkedCall.resume() over one shared httpx.Client; parallel uses _fan_out_async to iterate the same plan via asyncio.gather + asyncio.Semaphore over one shared httpx.AsyncClient.

Benchmarked at 2.48× speedup on a 19,602-site / 6-state get_daily call at API_USGS_CONCURRENT=16 (2.78s vs 6.89s sequential, distinct date windows per trial so cache reuse can't bias the result).

Why httpx

httpx ships sync and async clients on a unified API, so the same request shape powers both the existing synchronous getters and the new parallel path. requests is unmaintained and has no async story — a thread-pool bolt-on would have been a one-off rather than a primitive reusable elsewhere.

`API_USGS_CONCURRENT`

Value	Behavior
unset / blank	parallel, cap = 16 (`_CONCURRENCY_DEFAULT`)
`≥ 2`	parallel, semaphore-capped at that value
`1`	serial (sync `ChunkedCall.resume()` path)
`unbounded`	parallel, no per-call cap — caller owns the burst risk
`0`, negative, malformed	`ValueError` at call time

Connection-pool sharing across all sub-requests of a single chunked call in both modes via the _chunked_session (sync) / _chunked_async_session (async) ContextVars — _walk_pages / _walk_pages_async / get_stats_data read them as fallbacks before opening a fresh client.

Parallel path safety contracts

The parallel branch preserves the same safety contracts the serial path provides:

Probe-first quota check. _fan_out_async issues the first sub-request alone, reads x-ratelimit-remaining from its response, and raises RequestExceedsQuota before dispatching the rest if the remaining plan can't fit the window. Matches ChunkedCall._check_quota_after_first.
Resumable interruptions. asyncio.gather runs with return_exceptions=True, so a sibling's transient failure (RateLimited / ServiceUnavailable) doesn't lose the completed work. The raised ChunkInterrupted.call is a ChunkedCall holding the sparse-indexed completed sub-requests; exc.call.resume() re-issues only the unfinished indices via the sync fetch_once path.
Event-loop detection. asyncio.run() raises inside an already-running loop (Jupyter / IPython kernels, async apps). The wrapper calls asyncio.get_running_loop() first and, when one is active, falls back to the serial path with a UserWarning instead of crashing.
Missing-fetch_async warning. If API_USGS_CONCURRENT requests parallel but the decorator wasn't wired with fetch_async=, the wrapper warns + runs serial rather than silently no-op'ing the env var.

Three httpx behavior diffs handled defensively

httpx.InvalidURL raised when a URL component > 64 KB (e.g. all California stream sites comma-joined in one query). Caught by _safe_request_bytes (treats "too big to construct" as "doesn't fit", so the planner's halving loop keeps shrinking) and again in ChunkPlan.__init__ so canonical-URL recovery can fall through to a worst-case sub-request URL.
httpx.Response.elapsed only populated on close (not by httpx.MockTransport / pytest-httpx). _safe_elapsed falls back to timedelta(0).
httpx.Response.url is a read-only property. _set_response_url rewrites it via the bound request, with a fallback path for Mock-shaped test responses.

Backwards-compat

BaseMetadata.header is now httpx.Headers instead of requests.structures.CaseInsensitiveDict. Case-insensitive lookups (md.header.get("x-ratelimit-remaining")) keep working; literal dict equality (md.header == {"k": "v"}) no longer holds because httpx.Headers carries auto-added entries (content-type, content-length).
BaseMetadata.url is coerced to str (previously str on requests.Response; now str(httpx.Response.url)).
API_USGS_CONCURRENT defaults to 16 (parallel). Set =1 to opt back into the sequential path.

Test plan

374 mocked tests pass after migrating to pytest-httpx. tests/conftest.py (new) is a requests_mock-shaped shim over httpx_mock (URL-prefix match, complete_qs strict-mode parity, request-history view); an autouse fixture pins API_USGS_CONCURRENT=1 so the historical mocked suite stays on the deterministic serial path.
Async-path coverage: four new async-mode test functions (one parametrized over running-loop and missing-async) in tests/waterdata_chunking_test.py opt out of the serial-pin autouse and exercise (a) successful fan-out, (b) probe-first RequestExceedsQuota, (c) resumable ServiceInterrupted.call after a mid-fan-out failure (resume runs serially on the unfinished indices), (d) the running-event-loop fallback emits a UserWarning and runs serial, (e) the missing-fetch_async warning fires when the env asks for parallel.
ruff check and ruff format --check pass.
Live-API CI sweep — most pre-existing column-drift failures (get_daily / get_stats_* / get_channel) are resolved by the schema-aware _get_resp_data / _handle_stats_nesting rewrite incorporated via the rebase.

Out of scope (follow-ups)

High-concurrency memory: _fan_out_async materializes all (df, response) pairs before combining. Consider streaming-combine via asyncio.as_completed if users push concurrency very high.
NEWS.md entry — left for the merger to draft.

🤖 Generated with Claude Code

The OGC `waterdata` getters (`get_daily`, `get_continuous`, `get_field_measurements`, and the rest of the multi-value-capable functions) previously failed with HTTP 414 when the request URL exceeded the server's ~8 KB byte limit. The common chained-query pattern — pull a long site list from `get_monitoring_locations`, then feed it into `get_daily` — was the main offender: from dataretrieval.waterdata import get_daily, get_monitoring_locations sites_df, _ = get_monitoring_locations( state_name="Ohio", site_type_code="ST", skip_geometry=True, ) # Before: HTTP 414 once `sites_df` exceeded ~500 rows. # After: transparently chunked into multiple sub-requests, one # combined DataFrame returned. df, md = get_daily( monitoring_location_id=sites_df["monitoring_location_id"].tolist(), parameter_code="00060", time="P7D", ) This patch introduces a joint chunker that models every multi-value list parameter AND the cql-text `filter` (split on its top-level `OR` clauses) as a chunkable axis. Greedy halving splits the biggest chunk across all axes until each sub-request URL fits the limit; the chunker fans out into multiple HTTP requests under the hood and returns one combined DataFrame. Callers see no API change. Every axis (a list-shaped kwarg, or the filter split into its top-level `OR` clauses) is represented by an `_Axis` dataclass: the args key, the tuple of indivisible atoms (site IDs or clauses), and the joiner used to compose them back into URL text (`,` for list axes, ` OR ` for the filter axis). `ChunkPlan` extracts the chunkable axes for a request and runs greedy halving against the biggest chunk across all axes until the worst-case sub-request URL fits. `ChunkedCall` iterates the joint cartesian product of axis chunks and drives the sub-requests to completion. Requests that already fit get a trivial single-step plan — one code path either way. After the first sub-request, `ChunkedCall` reads `x-ratelimit-remaining`; if the rest of the plan can't fit the current per-key rate-limit window, it raises `RequestExceedsQuota` reporting the deficit before burning more budget. Set `API_USGS_LIMIT=0` to bypass the pre-emptive check. Mid-stream transient failures surface as a `ChunkInterrupted` subclass — `QuotaExhausted` for HTTP 429, `ServiceInterrupted` for HTTP 5xx. Both carry the partial result plus a resumable call handle on `exc.call`: import time from dataretrieval.waterdata import get_daily from dataretrieval.waterdata.chunking import ChunkInterrupted try: df, md = get_daily(monitoring_location_id=long_list) except ChunkInterrupted as exc: time.sleep(exc.retry_after or 5 * 60) # Re-issues only the still-pending sub-requests; banked work # is preserved on `exc.call`. df, md = exc.call.resume() `ChunkedCall.resume` opens one `requests.Session` for the entire fan-out and publishes it via a `ContextVar` so paginated-loop helpers downstream (`_walk_pages`, `get_stats_data` via the new `_paginate` helper) reuse the same connection pool across every sub-request — saves one TCP/TLS handshake per sub-request after the first. Measured 41% wall-clock reduction on a 2000-site / 8-chunk fan-out against the live USGS API (1.78s shared vs 3.03s per-sub-request). One behavior change for paginated/chunked calls: - `BaseMetadata.url` still reflects the user's original query (unchanged). - `BaseMetadata.header` now carries the *last* page/sub-request headers so downstream code that branches on `x-ratelimit-remaining` sees current state (was: first page's headers). - `BaseMetadata.query_time` is now cumulative wall-clock across every page/sub-request (was: first page's elapsed). - New module `dataretrieval.waterdata.chunking`: joint planner, exception hierarchy (`_RetryableTransportError`, `RateLimited`, `ServiceUnavailable`, `RequestTooLarge`, `RequestExceedsQuota`, `ChunkInterrupted`, `QuotaExhausted`, `ServiceInterrupted`), `ChunkPlan`, `ChunkedCall`, `multi_value_chunked` decorator, shared-session ContextVar plumbing. - `dataretrieval.waterdata.utils`: paginated-loop body consolidated into a `_paginate` strategy helper that `_walk_pages` and `get_stats_data` both delegate to; typed transport exceptions moved out to `chunking` so the layer direction is strictly `utils → chunking` (no more lazy cross-module import). - `dataretrieval.waterdata.filters`: existing top-level-OR splitter and filter-chunkability detector kept as primitives the joint planner consumes. 80 new unit tests in `tests/waterdata_chunking_test.py` covering the planner, axis extraction, cartesian-product enumeration, rate-limit gating, resume idempotency and equivalence, transient- error classification, shared-session reuse, and a URL-construction stress test against the real `_construct_api_requests` builder (not a fake) — 500 USGS site IDs × 20 datetime OR-clauses, asserting every sub-request URL stays under 8000 bytes and the joint planner beats the bail-floor worst case. Mid-pagination 429/5xx now also covered for both the OGC and stats paginators. Mirrors R `dataRetrieval`'s [#870](DOI-USGS/dataRetrieval#870), generalized from one filter axis to N joint axes. Also fixes a handful of pre-existing docstring typos in `waterdata/api.py` (`meaining` → `meaning`, `instantanous` → `instantaneous`). Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

``slots=True`` for ``@dataclass`` requires Python 3.10. The package declares ``requires-python = ">=3.9"`` and CI tests 3.9, so the import was failing test collection on the 3.9 matrix cell. Dropping the kwarg loses a small memory optimization on short-lived ``_Axis`` instances (not material) and restores compatibility. Also aligns one residual "sub-chunk" comment to "chunk" — the rest of the file already uses "chunk". Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

Both functions exist only to serve ``ChunkPlan.__init__`` and their parameters (``args``, ``axes``, ``chunks``) duplicate state the plan already holds. Folding them in as ``ChunkPlan._plan`` and ``ChunkPlan._worst_case_args`` makes the "planning IS construction" framing honest, removes parameter threading, and disambiguates the mutation target (``self.chunks`` rather than a passed-in dict). ``_extract_axes`` stays module-level — it operates on a raw args dict, has no plan state, and is imported directly by tests. No behavior change; 47/47 chunking tests pass. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

Drop justifications of design alternatives we rejected (ContextVar vs. _FetchOnce kwarg; pinning separators against past de-sync), and trim the snapshot/copy rationale comments to keep the load-bearing asymmetry note without the surrounding prose. No behavior change; 47/47 chunking tests pass. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

… one comment A second copy of ``_paginated_failure_message`` had survived a rebase and was silently shadowing the typed-isinstance version with a worse string-prefix branch. Removed. Also drops a "(the closure used to do ...)" comment that describes a past iteration of ``get_stats_data``'s follow-up closure. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

* `_combine_chunk_frames` all-empty now preserves the GeoDataFrame type of its input — returning a plain ``pd.DataFrame()`` would have downgraded the result on geopd installs, defeating the exact contract the function's docstring describes. * `_combine_chunk_frames` single-frame fast path now returns ``.copy()`` — the live ``ChunkedCall.partial_frame`` property used to alias ``_chunks[0][0]`` so caller mutations would corrupt the stored chunk frame. * `ChunkPlan.iter_sub_args` passthrough now yields ``dict(self.args)`` to match the chunked branch (and the docstring); the old version yielded ``self.args`` directly, a latent mutation footgun. * Quota check now fires after every non-final chunk, not only after chunk 0 — the old ``len(_chunks) == 1`` gating silently disabled the pre-emptive guard on every ``resume()`` after a ``ChunkInterrupted``, and missed concurrent-drain mid-call. Method renamed ``_check_quota_after_first`` → ``_check_quota_remaining``. * ``RequestExceedsQuota`` now carries a ``.call`` handle to the originating ``ChunkedCall`` — the first chunk's already-fetched data was previously unrecoverable because the exception is a ``ValueError`` with no resume affordance. * ``_get_resp_data`` now treats a 200 with ``numberReturned > 0`` but missing ``features`` key as an empty page rather than crashing with ``KeyError``, mirroring the hardening already applied to ``_handle_stats_nesting``. Six regression tests added; one existing test (``test_walk_pages_wraps_initial_page_parse_error``) updated to use a ``JSONDecodeError`` instead of the now-handled missing-features case. 97/97 chunking + utils tests pass. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

Replace ``requests`` with ``httpx`` package-wide and add an opt-in async parallel fan-out for the multi-value chunker, gated on the ``API_USGS_CONCURRENT`` env var. * ``httpx`` ships sync and async clients on a unified API, so the same request shape powers both the synchronous getters callers use today and the new ``_fan_out_async`` parallel path; the unmaintained ``requests`` had no async story. * ``API_USGS_CONCURRENT=1`` (default in tests) keeps the serial ``ChunkedCall.resume()`` path over one shared ``httpx.Client``. ``API_USGS_CONCURRENT=N`` (N > 1; default 16 in production) or ``unbounded`` fans the plan out through ``_fan_out_async`` over one shared ``httpx.AsyncClient``, bounded by ``asyncio.Semaphore(N)``. * Both paths publish their client on a ``ContextVar`` (``_chunked_session`` / ``_chunked_async_session``) so paginated helpers downstream reuse the connection pool across every sub-request of a chunked call. * The parallel path preserves the same safety contracts as the serial path: it probes the first sub-request alone to read ``x-ratelimit-remaining`` before fanning out the rest (``RequestExceedsQuota``), and uses ``asyncio.gather( return_exceptions=True)`` so a transient failure surfaces as a ``ChunkInterrupted`` whose ``.call`` is a ``ChunkedCall`` holding the sparse-indexed completed sub-requests; ``exc.call.resume()`` re-issues only the unfinished ones via the sync path. * The wrapper falls back to the serial path (with a ``UserWarning``) when ``asyncio.get_running_loop()`` returns — so Jupyter / IPython kernels and async apps don't see a confusing ``RuntimeError`` — and when the decorator was set up without a ``fetch_async=`` sibling. * Three defensive helpers smooth over httpx behaviours that ``requests`` didn't have: ``_safe_request_bytes`` swallows ``httpx.InvalidURL`` so the planner's halving loop keeps shrinking past httpx's 64 KB URL cap; ``_safe_elapsed`` falls back to ``timedelta(0)`` when ``.elapsed`` is missing (mock transports); ``_set_response_url`` rewrites the URL via the bound request, since httpx makes ``Response.url`` read-only. Tests: ``pyproject.toml`` switches ``requests``/``requests-mock`` to ``httpx``/``pytest-httpx``; ``tests/conftest.py`` adds a ``requests_mock``-shaped shim over ``httpx_mock`` and an autouse fixture pinning ``API_USGS_CONCURRENT=1`` so historical tests stay on the deterministic serial path. New async-mode tests cover the parallel fan-out, the probe-first quota check, the resumable ``ChunkInterrupted.call`` after a mid-fan-out failure, the running-event-loop fallback, and the missing-``fetch_async`` warning. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

thodson-usgs force-pushed the httpx-migration branch 7 times, most recently from 2d4cdf5 to 0e30f7b Compare May 22, 2026 21:16

thodson-usgs force-pushed the httpx-migration branch 5 times, most recently from c86f469 to 231f3d7 Compare May 23, 2026 04:02

thodson-usgs force-pushed the httpx-migration branch from 231f3d7 to 4de8b5a Compare May 23, 2026 04:23

thodson-usgs and others added 5 commits May 23, 2026 10:56

thodson-usgs force-pushed the httpx-migration branch from 4de8b5a to 3764e3f Compare May 23, 2026 22:04

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(waterdata): Migrate to httpx and add async parallel chunker#285

feat(waterdata): Migrate to httpx and add async parallel chunker#285
thodson-usgs wants to merge 7 commits into
DOI-USGS:mainfrom
thodson-usgs:httpx-migration

thodson-usgs commented May 21, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

thodson-usgs commented May 21, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Why httpx

API_USGS_CONCURRENT

Parallel path safety contracts

Three httpx behavior diffs handled defensively

Backwards-compat

Test plan

Out of scope (follow-ups)

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

thodson-usgs commented May 21, 2026 •

edited

Loading

`API_USGS_CONCURRENT`