Skip to content

feat(waterdata): Migrate to httpx and add async parallel chunker#285

Draft
thodson-usgs wants to merge 7 commits into
DOI-USGS:mainfrom
thodson-usgs:httpx-migration
Draft

feat(waterdata): Migrate to httpx and add async parallel chunker#285
thodson-usgs wants to merge 7 commits into
DOI-USGS:mainfrom
thodson-usgs:httpx-migration

Conversation

@thodson-usgs
Copy link
Copy Markdown
Collaborator

@thodson-usgs thodson-usgs commented May 21, 2026

Summary

Stacked on top of the chunker-unified work (PR #283); rebase to main once that lands.

  • Replaces requests with httpx package-wide.
  • Adds an opt-in async parallel branch to the multi-value chunker via API_USGS_CONCURRENT (default 16 for the server-friendly sweet spot; set =1 for the legacy sequential path).
  • Integrates the async path with the ChunkPlan / ChunkedCall arch from feat(waterdata): Auto-chunk OGC requests over the URL byte limit #283 — sync drives ChunkedCall.resume() over one shared httpx.Client; parallel uses _fan_out_async to iterate the same plan via asyncio.gather + asyncio.Semaphore over one shared httpx.AsyncClient.

Benchmarked at 2.48× speedup on a 19,602-site / 6-state get_daily call at API_USGS_CONCURRENT=16 (2.78s vs 6.89s sequential, distinct date windows per trial so cache reuse can't bias the result).

Why httpx

httpx ships sync and async clients on a unified API, so the same request shape powers both the existing synchronous getters and the new parallel path. requests is unmaintained and has no async story — a thread-pool bolt-on would have been a one-off rather than a primitive reusable elsewhere.

API_USGS_CONCURRENT

Value Behavior
unset / blank parallel, cap = 16 (_CONCURRENCY_DEFAULT)
≥ 2 parallel, semaphore-capped at that value
1 serial (sync ChunkedCall.resume() path)
unbounded parallel, no per-call cap — caller owns the burst risk
0, negative, malformed ValueError at call time

Connection-pool sharing across all sub-requests of a single chunked call in both modes via the _chunked_session (sync) / _chunked_async_session (async) ContextVars — _walk_pages / _walk_pages_async / get_stats_data read them as fallbacks before opening a fresh client.

Parallel path safety contracts

The parallel branch preserves the same safety contracts the serial path provides:

  • Probe-first quota check. _fan_out_async issues the first sub-request alone, reads x-ratelimit-remaining from its response, and raises RequestExceedsQuota before dispatching the rest if the remaining plan can't fit the window. Matches ChunkedCall._check_quota_after_first.
  • Resumable interruptions. asyncio.gather runs with return_exceptions=True, so a sibling's transient failure (RateLimited / ServiceUnavailable) doesn't lose the completed work. The raised ChunkInterrupted.call is a ChunkedCall holding the sparse-indexed completed sub-requests; exc.call.resume() re-issues only the unfinished indices via the sync fetch_once path.
  • Event-loop detection. asyncio.run() raises inside an already-running loop (Jupyter / IPython kernels, async apps). The wrapper calls asyncio.get_running_loop() first and, when one is active, falls back to the serial path with a UserWarning instead of crashing.
  • Missing-fetch_async warning. If API_USGS_CONCURRENT requests parallel but the decorator wasn't wired with fetch_async=, the wrapper warns + runs serial rather than silently no-op'ing the env var.

Three httpx behavior diffs handled defensively

  • httpx.InvalidURL raised when a URL component > 64 KB (e.g. all California stream sites comma-joined in one query). Caught by _safe_request_bytes (treats "too big to construct" as "doesn't fit", so the planner's halving loop keeps shrinking) and again in ChunkPlan.__init__ so canonical-URL recovery can fall through to a worst-case sub-request URL.
  • httpx.Response.elapsed only populated on close (not by httpx.MockTransport / pytest-httpx). _safe_elapsed falls back to timedelta(0).
  • httpx.Response.url is a read-only property. _set_response_url rewrites it via the bound request, with a fallback path for Mock-shaped test responses.

Backwards-compat

  • BaseMetadata.header is now httpx.Headers instead of requests.structures.CaseInsensitiveDict. Case-insensitive lookups (md.header.get("x-ratelimit-remaining")) keep working; literal dict equality (md.header == {"k": "v"}) no longer holds because httpx.Headers carries auto-added entries (content-type, content-length).
  • BaseMetadata.url is coerced to str (previously str on requests.Response; now str(httpx.Response.url)).
  • API_USGS_CONCURRENT defaults to 16 (parallel). Set =1 to opt back into the sequential path.

Test plan

  • 374 mocked tests pass after migrating to pytest-httpx. tests/conftest.py (new) is a requests_mock-shaped shim over httpx_mock (URL-prefix match, complete_qs strict-mode parity, request-history view); an autouse fixture pins API_USGS_CONCURRENT=1 so the historical mocked suite stays on the deterministic serial path.
  • Async-path coverage: four new async-mode test functions (one parametrized over running-loop and missing-async) in tests/waterdata_chunking_test.py opt out of the serial-pin autouse and exercise (a) successful fan-out, (b) probe-first RequestExceedsQuota, (c) resumable ServiceInterrupted.call after a mid-fan-out failure (resume runs serially on the unfinished indices), (d) the running-event-loop fallback emits a UserWarning and runs serial, (e) the missing-fetch_async warning fires when the env asks for parallel.
  • ruff check and ruff format --check pass.
  • Live-API CI sweep — most pre-existing column-drift failures (get_daily / get_stats_* / get_channel) are resolved by the schema-aware _get_resp_data / _handle_stats_nesting rewrite incorporated via the rebase.

Out of scope (follow-ups)

  • High-concurrency memory: _fan_out_async materializes all (df, response) pairs before combining. Consider streaming-combine via asyncio.as_completed if users push concurrency very high.
  • NEWS.md entry — left for the merger to draft.

🤖 Generated with Claude Code

@thodson-usgs thodson-usgs force-pushed the httpx-migration branch 7 times, most recently from 2d4cdf5 to 0e30f7b Compare May 22, 2026 21:16
The OGC `waterdata` getters (`get_daily`, `get_continuous`,
`get_field_measurements`, and the rest of the multi-value-capable
functions) previously failed with HTTP 414 when the request URL
exceeded the server's ~8 KB byte limit. The common chained-query
pattern — pull a long site list from `get_monitoring_locations`,
then feed it into `get_daily` — was the main offender:

    from dataretrieval.waterdata import get_daily, get_monitoring_locations

    sites_df, _ = get_monitoring_locations(
        state_name="Ohio",
        site_type_code="ST",
        skip_geometry=True,
    )
    # Before: HTTP 414 once `sites_df` exceeded ~500 rows.
    # After: transparently chunked into multiple sub-requests, one
    # combined DataFrame returned.
    df, md = get_daily(
        monitoring_location_id=sites_df["monitoring_location_id"].tolist(),
        parameter_code="00060",
        time="P7D",
    )

This patch introduces a joint chunker that models every multi-value
list parameter AND the cql-text `filter` (split on its top-level
`OR` clauses) as a chunkable axis. Greedy halving splits the biggest
chunk across all axes until each sub-request URL fits the limit; the
chunker fans out into multiple HTTP requests under the hood and
returns one combined DataFrame. Callers see no API change.

Every axis (a list-shaped kwarg, or the filter split into its
top-level `OR` clauses) is represented by an `_Axis` dataclass: the
args key, the tuple of indivisible atoms (site IDs or clauses), and
the joiner used to compose them back into URL text (`,` for list
axes, ` OR ` for the filter axis). `ChunkPlan` extracts the
chunkable axes for a request and runs greedy halving against the
biggest chunk across all axes until the worst-case sub-request URL
fits. `ChunkedCall` iterates the joint cartesian product of axis
chunks and drives the sub-requests to completion. Requests that
already fit get a trivial single-step plan — one code path either
way.

After the first sub-request, `ChunkedCall` reads
`x-ratelimit-remaining`; if the rest of the plan can't fit the
current per-key rate-limit window, it raises `RequestExceedsQuota`
reporting the deficit before burning more budget. Set
`API_USGS_LIMIT=0` to bypass the pre-emptive check.

Mid-stream transient failures surface as a `ChunkInterrupted`
subclass — `QuotaExhausted` for HTTP 429, `ServiceInterrupted` for
HTTP 5xx. Both carry the partial result plus a resumable call handle
on `exc.call`:

    import time
    from dataretrieval.waterdata import get_daily
    from dataretrieval.waterdata.chunking import ChunkInterrupted

    try:
        df, md = get_daily(monitoring_location_id=long_list)
    except ChunkInterrupted as exc:
        time.sleep(exc.retry_after or 5 * 60)
        # Re-issues only the still-pending sub-requests; banked work
        # is preserved on `exc.call`.
        df, md = exc.call.resume()

`ChunkedCall.resume` opens one `requests.Session` for the entire
fan-out and publishes it via a `ContextVar` so paginated-loop
helpers downstream (`_walk_pages`, `get_stats_data` via the new
`_paginate` helper) reuse the same connection pool across every
sub-request — saves one TCP/TLS handshake per sub-request after the
first. Measured 41% wall-clock reduction on a 2000-site / 8-chunk
fan-out against the live USGS API (1.78s shared vs 3.03s
per-sub-request).

One behavior change for paginated/chunked calls:

- `BaseMetadata.url` still reflects the user's original query
  (unchanged).
- `BaseMetadata.header` now carries the *last* page/sub-request
  headers so downstream code that branches on
  `x-ratelimit-remaining` sees current state (was: first page's
  headers).
- `BaseMetadata.query_time` is now cumulative wall-clock across
  every page/sub-request (was: first page's elapsed).

- New module `dataretrieval.waterdata.chunking`: joint planner,
  exception hierarchy (`_RetryableTransportError`, `RateLimited`,
  `ServiceUnavailable`, `RequestTooLarge`, `RequestExceedsQuota`,
  `ChunkInterrupted`, `QuotaExhausted`, `ServiceInterrupted`),
  `ChunkPlan`, `ChunkedCall`, `multi_value_chunked` decorator,
  shared-session ContextVar plumbing.
- `dataretrieval.waterdata.utils`: paginated-loop body consolidated
  into a `_paginate` strategy helper that `_walk_pages` and
  `get_stats_data` both delegate to; typed transport exceptions
  moved out to `chunking` so the layer direction is strictly
  `utils → chunking` (no more lazy cross-module import).
- `dataretrieval.waterdata.filters`: existing top-level-OR splitter
  and filter-chunkability detector kept as primitives the joint
  planner consumes.

80 new unit tests in `tests/waterdata_chunking_test.py` covering
the planner, axis extraction, cartesian-product enumeration,
rate-limit gating, resume idempotency and equivalence, transient-
error classification, shared-session reuse, and a URL-construction
stress test against the real `_construct_api_requests` builder (not
a fake) — 500 USGS site IDs × 20 datetime OR-clauses, asserting
every sub-request URL stays under 8000 bytes and the joint planner
beats the bail-floor worst case. Mid-pagination 429/5xx now also
covered for both the OGC and stats paginators.

Mirrors R `dataRetrieval`'s [#870](DOI-USGS/dataRetrieval#870),
generalized from one filter axis to N joint axes.

Also fixes a handful of pre-existing docstring typos in
`waterdata/api.py` (`meaining` → `meaning`,
`instantanous` → `instantaneous`).

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
@thodson-usgs thodson-usgs force-pushed the httpx-migration branch 5 times, most recently from c86f469 to 231f3d7 Compare May 23, 2026 04:02
``slots=True`` for ``@dataclass`` requires Python 3.10. The package
declares ``requires-python = ">=3.9"`` and CI tests 3.9, so the import
was failing test collection on the 3.9 matrix cell. Dropping the kwarg
loses a small memory optimization on short-lived ``_Axis`` instances
(not material) and restores compatibility.

Also aligns one residual "sub-chunk" comment to "chunk" — the rest of
the file already uses "chunk".

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
thodson-usgs and others added 5 commits May 23, 2026 10:56
Both functions exist only to serve ``ChunkPlan.__init__`` and their
parameters (``args``, ``axes``, ``chunks``) duplicate state the plan
already holds. Folding them in as ``ChunkPlan._plan`` and
``ChunkPlan._worst_case_args`` makes the "planning IS construction"
framing honest, removes parameter threading, and disambiguates the
mutation target (``self.chunks`` rather than a passed-in dict).

``_extract_axes`` stays module-level — it operates on a raw args dict,
has no plan state, and is imported directly by tests.

No behavior change; 47/47 chunking tests pass.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Drop justifications of design alternatives we rejected (ContextVar
vs. _FetchOnce kwarg; pinning separators against past de-sync), and
trim the snapshot/copy rationale comments to keep the load-bearing
asymmetry note without the surrounding prose.

No behavior change; 47/47 chunking tests pass.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
… one comment

A second copy of ``_paginated_failure_message`` had survived a rebase
and was silently shadowing the typed-isinstance version with a
worse string-prefix branch. Removed.

Also drops a "(the closure used to do ...)" comment that describes a
past iteration of ``get_stats_data``'s follow-up closure.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
* `_combine_chunk_frames` all-empty now preserves the GeoDataFrame
  type of its input — returning a plain ``pd.DataFrame()`` would have
  downgraded the result on geopd installs, defeating the exact
  contract the function's docstring describes.
* `_combine_chunk_frames` single-frame fast path now returns
  ``.copy()`` — the live ``ChunkedCall.partial_frame`` property used
  to alias ``_chunks[0][0]`` so caller mutations would corrupt the
  stored chunk frame.
* `ChunkPlan.iter_sub_args` passthrough now yields ``dict(self.args)``
  to match the chunked branch (and the docstring); the old version
  yielded ``self.args`` directly, a latent mutation footgun.
* Quota check now fires after every non-final chunk, not only after
  chunk 0 — the old ``len(_chunks) == 1`` gating silently disabled
  the pre-emptive guard on every ``resume()`` after a
  ``ChunkInterrupted``, and missed concurrent-drain mid-call. Method
  renamed ``_check_quota_after_first`` → ``_check_quota_remaining``.
* ``RequestExceedsQuota`` now carries a ``.call`` handle to the
  originating ``ChunkedCall`` — the first chunk's already-fetched
  data was previously unrecoverable because the exception is a
  ``ValueError`` with no resume affordance.
* ``_get_resp_data`` now treats a 200 with ``numberReturned > 0`` but
  missing ``features`` key as an empty page rather than crashing with
  ``KeyError``, mirroring the hardening already applied to
  ``_handle_stats_nesting``.

Six regression tests added; one existing test
(``test_walk_pages_wraps_initial_page_parse_error``) updated to use a
``JSONDecodeError`` instead of the now-handled missing-features case.
97/97 chunking + utils tests pass.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Replace ``requests`` with ``httpx`` package-wide and add an opt-in
async parallel fan-out for the multi-value chunker, gated on the
``API_USGS_CONCURRENT`` env var.

* ``httpx`` ships sync and async clients on a unified API, so the
  same request shape powers both the synchronous getters callers
  use today and the new ``_fan_out_async`` parallel path; the
  unmaintained ``requests`` had no async story.
* ``API_USGS_CONCURRENT=1`` (default in tests) keeps the serial
  ``ChunkedCall.resume()`` path over one shared ``httpx.Client``.
  ``API_USGS_CONCURRENT=N`` (N > 1; default 16 in production) or
  ``unbounded`` fans the plan out through ``_fan_out_async`` over
  one shared ``httpx.AsyncClient``, bounded by
  ``asyncio.Semaphore(N)``.
* Both paths publish their client on a ``ContextVar``
  (``_chunked_session`` / ``_chunked_async_session``) so paginated
  helpers downstream reuse the connection pool across every
  sub-request of a chunked call.
* The parallel path preserves the same safety contracts as the
  serial path: it probes the first sub-request alone to read
  ``x-ratelimit-remaining`` before fanning out the rest
  (``RequestExceedsQuota``), and uses ``asyncio.gather(
  return_exceptions=True)`` so a transient failure surfaces as a
  ``ChunkInterrupted`` whose ``.call`` is a ``ChunkedCall`` holding
  the sparse-indexed completed sub-requests; ``exc.call.resume()``
  re-issues only the unfinished ones via the sync path.
* The wrapper falls back to the serial path (with a
  ``UserWarning``) when ``asyncio.get_running_loop()`` returns —
  so Jupyter / IPython kernels and async apps don't see a
  confusing ``RuntimeError`` — and when the decorator was set up
  without a ``fetch_async=`` sibling.
* Three defensive helpers smooth over httpx behaviours that
  ``requests`` didn't have: ``_safe_request_bytes`` swallows
  ``httpx.InvalidURL`` so the planner's halving loop keeps
  shrinking past httpx's 64 KB URL cap; ``_safe_elapsed`` falls
  back to ``timedelta(0)`` when ``.elapsed`` is missing (mock
  transports); ``_set_response_url`` rewrites the URL via the
  bound request, since httpx makes ``Response.url`` read-only.

Tests: ``pyproject.toml`` switches ``requests``/``requests-mock``
to ``httpx``/``pytest-httpx``; ``tests/conftest.py`` adds a
``requests_mock``-shaped shim over ``httpx_mock`` and an autouse
fixture pinning ``API_USGS_CONCURRENT=1`` so historical tests
stay on the deterministic serial path. New async-mode tests cover
the parallel fan-out, the probe-first quota check, the resumable
``ChunkInterrupted.call`` after a mid-fan-out failure, the
running-event-loop fallback, and the missing-``fetch_async``
warning.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant