feat(waterdata): Migrate to httpx and add async parallel chunker#285
Draft
thodson-usgs wants to merge 7 commits into
Draft
feat(waterdata): Migrate to httpx and add async parallel chunker#285thodson-usgs wants to merge 7 commits into
thodson-usgs wants to merge 7 commits into
Conversation
2d4cdf5 to
0e30f7b
Compare
The OGC `waterdata` getters (`get_daily`, `get_continuous`,
`get_field_measurements`, and the rest of the multi-value-capable
functions) previously failed with HTTP 414 when the request URL
exceeded the server's ~8 KB byte limit. The common chained-query
pattern — pull a long site list from `get_monitoring_locations`,
then feed it into `get_daily` — was the main offender:
from dataretrieval.waterdata import get_daily, get_monitoring_locations
sites_df, _ = get_monitoring_locations(
state_name="Ohio",
site_type_code="ST",
skip_geometry=True,
)
# Before: HTTP 414 once `sites_df` exceeded ~500 rows.
# After: transparently chunked into multiple sub-requests, one
# combined DataFrame returned.
df, md = get_daily(
monitoring_location_id=sites_df["monitoring_location_id"].tolist(),
parameter_code="00060",
time="P7D",
)
This patch introduces a joint chunker that models every multi-value
list parameter AND the cql-text `filter` (split on its top-level
`OR` clauses) as a chunkable axis. Greedy halving splits the biggest
chunk across all axes until each sub-request URL fits the limit; the
chunker fans out into multiple HTTP requests under the hood and
returns one combined DataFrame. Callers see no API change.
Every axis (a list-shaped kwarg, or the filter split into its
top-level `OR` clauses) is represented by an `_Axis` dataclass: the
args key, the tuple of indivisible atoms (site IDs or clauses), and
the joiner used to compose them back into URL text (`,` for list
axes, ` OR ` for the filter axis). `ChunkPlan` extracts the
chunkable axes for a request and runs greedy halving against the
biggest chunk across all axes until the worst-case sub-request URL
fits. `ChunkedCall` iterates the joint cartesian product of axis
chunks and drives the sub-requests to completion. Requests that
already fit get a trivial single-step plan — one code path either
way.
After the first sub-request, `ChunkedCall` reads
`x-ratelimit-remaining`; if the rest of the plan can't fit the
current per-key rate-limit window, it raises `RequestExceedsQuota`
reporting the deficit before burning more budget. Set
`API_USGS_LIMIT=0` to bypass the pre-emptive check.
Mid-stream transient failures surface as a `ChunkInterrupted`
subclass — `QuotaExhausted` for HTTP 429, `ServiceInterrupted` for
HTTP 5xx. Both carry the partial result plus a resumable call handle
on `exc.call`:
import time
from dataretrieval.waterdata import get_daily
from dataretrieval.waterdata.chunking import ChunkInterrupted
try:
df, md = get_daily(monitoring_location_id=long_list)
except ChunkInterrupted as exc:
time.sleep(exc.retry_after or 5 * 60)
# Re-issues only the still-pending sub-requests; banked work
# is preserved on `exc.call`.
df, md = exc.call.resume()
`ChunkedCall.resume` opens one `requests.Session` for the entire
fan-out and publishes it via a `ContextVar` so paginated-loop
helpers downstream (`_walk_pages`, `get_stats_data` via the new
`_paginate` helper) reuse the same connection pool across every
sub-request — saves one TCP/TLS handshake per sub-request after the
first. Measured 41% wall-clock reduction on a 2000-site / 8-chunk
fan-out against the live USGS API (1.78s shared vs 3.03s
per-sub-request).
One behavior change for paginated/chunked calls:
- `BaseMetadata.url` still reflects the user's original query
(unchanged).
- `BaseMetadata.header` now carries the *last* page/sub-request
headers so downstream code that branches on
`x-ratelimit-remaining` sees current state (was: first page's
headers).
- `BaseMetadata.query_time` is now cumulative wall-clock across
every page/sub-request (was: first page's elapsed).
- New module `dataretrieval.waterdata.chunking`: joint planner,
exception hierarchy (`_RetryableTransportError`, `RateLimited`,
`ServiceUnavailable`, `RequestTooLarge`, `RequestExceedsQuota`,
`ChunkInterrupted`, `QuotaExhausted`, `ServiceInterrupted`),
`ChunkPlan`, `ChunkedCall`, `multi_value_chunked` decorator,
shared-session ContextVar plumbing.
- `dataretrieval.waterdata.utils`: paginated-loop body consolidated
into a `_paginate` strategy helper that `_walk_pages` and
`get_stats_data` both delegate to; typed transport exceptions
moved out to `chunking` so the layer direction is strictly
`utils → chunking` (no more lazy cross-module import).
- `dataretrieval.waterdata.filters`: existing top-level-OR splitter
and filter-chunkability detector kept as primitives the joint
planner consumes.
80 new unit tests in `tests/waterdata_chunking_test.py` covering
the planner, axis extraction, cartesian-product enumeration,
rate-limit gating, resume idempotency and equivalence, transient-
error classification, shared-session reuse, and a URL-construction
stress test against the real `_construct_api_requests` builder (not
a fake) — 500 USGS site IDs × 20 datetime OR-clauses, asserting
every sub-request URL stays under 8000 bytes and the joint planner
beats the bail-floor worst case. Mid-pagination 429/5xx now also
covered for both the OGC and stats paginators.
Mirrors R `dataRetrieval`'s [#870](DOI-USGS/dataRetrieval#870),
generalized from one filter axis to N joint axes.
Also fixes a handful of pre-existing docstring typos in
`waterdata/api.py` (`meaining` → `meaning`,
`instantanous` → `instantaneous`).
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
c86f469 to
231f3d7
Compare
``slots=True`` for ``@dataclass`` requires Python 3.10. The package declares ``requires-python = ">=3.9"`` and CI tests 3.9, so the import was failing test collection on the 3.9 matrix cell. Dropping the kwarg loses a small memory optimization on short-lived ``_Axis`` instances (not material) and restores compatibility. Also aligns one residual "sub-chunk" comment to "chunk" — the rest of the file already uses "chunk". Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
231f3d7 to
4de8b5a
Compare
Both functions exist only to serve ``ChunkPlan.__init__`` and their parameters (``args``, ``axes``, ``chunks``) duplicate state the plan already holds. Folding them in as ``ChunkPlan._plan`` and ``ChunkPlan._worst_case_args`` makes the "planning IS construction" framing honest, removes parameter threading, and disambiguates the mutation target (``self.chunks`` rather than a passed-in dict). ``_extract_axes`` stays module-level — it operates on a raw args dict, has no plan state, and is imported directly by tests. No behavior change; 47/47 chunking tests pass. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Drop justifications of design alternatives we rejected (ContextVar vs. _FetchOnce kwarg; pinning separators against past de-sync), and trim the snapshot/copy rationale comments to keep the load-bearing asymmetry note without the surrounding prose. No behavior change; 47/47 chunking tests pass. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
… one comment A second copy of ``_paginated_failure_message`` had survived a rebase and was silently shadowing the typed-isinstance version with a worse string-prefix branch. Removed. Also drops a "(the closure used to do ...)" comment that describes a past iteration of ``get_stats_data``'s follow-up closure. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
* `_combine_chunk_frames` all-empty now preserves the GeoDataFrame type of its input — returning a plain ``pd.DataFrame()`` would have downgraded the result on geopd installs, defeating the exact contract the function's docstring describes. * `_combine_chunk_frames` single-frame fast path now returns ``.copy()`` — the live ``ChunkedCall.partial_frame`` property used to alias ``_chunks[0][0]`` so caller mutations would corrupt the stored chunk frame. * `ChunkPlan.iter_sub_args` passthrough now yields ``dict(self.args)`` to match the chunked branch (and the docstring); the old version yielded ``self.args`` directly, a latent mutation footgun. * Quota check now fires after every non-final chunk, not only after chunk 0 — the old ``len(_chunks) == 1`` gating silently disabled the pre-emptive guard on every ``resume()`` after a ``ChunkInterrupted``, and missed concurrent-drain mid-call. Method renamed ``_check_quota_after_first`` → ``_check_quota_remaining``. * ``RequestExceedsQuota`` now carries a ``.call`` handle to the originating ``ChunkedCall`` — the first chunk's already-fetched data was previously unrecoverable because the exception is a ``ValueError`` with no resume affordance. * ``_get_resp_data`` now treats a 200 with ``numberReturned > 0`` but missing ``features`` key as an empty page rather than crashing with ``KeyError``, mirroring the hardening already applied to ``_handle_stats_nesting``. Six regression tests added; one existing test (``test_walk_pages_wraps_initial_page_parse_error``) updated to use a ``JSONDecodeError`` instead of the now-handled missing-features case. 97/97 chunking + utils tests pass. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Replace ``requests`` with ``httpx`` package-wide and add an opt-in async parallel fan-out for the multi-value chunker, gated on the ``API_USGS_CONCURRENT`` env var. * ``httpx`` ships sync and async clients on a unified API, so the same request shape powers both the synchronous getters callers use today and the new ``_fan_out_async`` parallel path; the unmaintained ``requests`` had no async story. * ``API_USGS_CONCURRENT=1`` (default in tests) keeps the serial ``ChunkedCall.resume()`` path over one shared ``httpx.Client``. ``API_USGS_CONCURRENT=N`` (N > 1; default 16 in production) or ``unbounded`` fans the plan out through ``_fan_out_async`` over one shared ``httpx.AsyncClient``, bounded by ``asyncio.Semaphore(N)``. * Both paths publish their client on a ``ContextVar`` (``_chunked_session`` / ``_chunked_async_session``) so paginated helpers downstream reuse the connection pool across every sub-request of a chunked call. * The parallel path preserves the same safety contracts as the serial path: it probes the first sub-request alone to read ``x-ratelimit-remaining`` before fanning out the rest (``RequestExceedsQuota``), and uses ``asyncio.gather( return_exceptions=True)`` so a transient failure surfaces as a ``ChunkInterrupted`` whose ``.call`` is a ``ChunkedCall`` holding the sparse-indexed completed sub-requests; ``exc.call.resume()`` re-issues only the unfinished ones via the sync path. * The wrapper falls back to the serial path (with a ``UserWarning``) when ``asyncio.get_running_loop()`` returns — so Jupyter / IPython kernels and async apps don't see a confusing ``RuntimeError`` — and when the decorator was set up without a ``fetch_async=`` sibling. * Three defensive helpers smooth over httpx behaviours that ``requests`` didn't have: ``_safe_request_bytes`` swallows ``httpx.InvalidURL`` so the planner's halving loop keeps shrinking past httpx's 64 KB URL cap; ``_safe_elapsed`` falls back to ``timedelta(0)`` when ``.elapsed`` is missing (mock transports); ``_set_response_url`` rewrites the URL via the bound request, since httpx makes ``Response.url`` read-only. Tests: ``pyproject.toml`` switches ``requests``/``requests-mock`` to ``httpx``/``pytest-httpx``; ``tests/conftest.py`` adds a ``requests_mock``-shaped shim over ``httpx_mock`` and an autouse fixture pinning ``API_USGS_CONCURRENT=1`` so historical tests stay on the deterministic serial path. New async-mode tests cover the parallel fan-out, the probe-first quota check, the resumable ``ChunkInterrupted.call`` after a mid-fan-out failure, the running-event-loop fallback, and the missing-``fetch_async`` warning. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
4de8b5a to
3764e3f
Compare
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Stacked on top of the
chunker-unifiedwork (PR #283); rebase tomainonce that lands.requestswithhttpxpackage-wide.API_USGS_CONCURRENT(default 16 for the server-friendly sweet spot; set=1for the legacy sequential path).ChunkPlan/ChunkedCallarch from feat(waterdata): Auto-chunk OGC requests over the URL byte limit #283 — sync drivesChunkedCall.resume()over one sharedhttpx.Client; parallel uses_fan_out_asyncto iterate the same plan viaasyncio.gather+asyncio.Semaphoreover one sharedhttpx.AsyncClient.Benchmarked at 2.48× speedup on a 19,602-site / 6-state
get_dailycall atAPI_USGS_CONCURRENT=16(2.78s vs 6.89s sequential, distinct date windows per trial so cache reuse can't bias the result).Why httpx
httpxships sync and async clients on a unified API, so the same request shape powers both the existing synchronous getters and the new parallel path.requestsis unmaintained and has no async story — a thread-pool bolt-on would have been a one-off rather than a primitive reusable elsewhere.API_USGS_CONCURRENT_CONCURRENCY_DEFAULT)≥ 21ChunkedCall.resume()path)unbounded0, negative, malformedValueErrorat call timeConnection-pool sharing across all sub-requests of a single chunked call in both modes via the
_chunked_session(sync) /_chunked_async_session(async)ContextVars —_walk_pages/_walk_pages_async/get_stats_dataread them as fallbacks before opening a fresh client.Parallel path safety contracts
The parallel branch preserves the same safety contracts the serial path provides:
_fan_out_asyncissues the first sub-request alone, readsx-ratelimit-remainingfrom its response, and raisesRequestExceedsQuotabefore dispatching the rest if the remaining plan can't fit the window. MatchesChunkedCall._check_quota_after_first.asyncio.gatherruns withreturn_exceptions=True, so a sibling's transient failure (RateLimited/ServiceUnavailable) doesn't lose the completed work. The raisedChunkInterrupted.callis aChunkedCallholding the sparse-indexed completed sub-requests;exc.call.resume()re-issues only the unfinished indices via the syncfetch_oncepath.asyncio.run()raises inside an already-running loop (Jupyter / IPython kernels, async apps). The wrapper callsasyncio.get_running_loop()first and, when one is active, falls back to the serial path with aUserWarninginstead of crashing.fetch_asyncwarning. IfAPI_USGS_CONCURRENTrequests parallel but the decorator wasn't wired withfetch_async=, the wrapper warns + runs serial rather than silently no-op'ing the env var.Three httpx behavior diffs handled defensively
httpx.InvalidURLraised when a URL component > 64 KB (e.g. all California stream sites comma-joined in one query). Caught by_safe_request_bytes(treats "too big to construct" as "doesn't fit", so the planner's halving loop keeps shrinking) and again inChunkPlan.__init__so canonical-URL recovery can fall through to a worst-case sub-request URL.httpx.Response.elapsedonly populated on close (not byhttpx.MockTransport/pytest-httpx)._safe_elapsedfalls back totimedelta(0).httpx.Response.urlis a read-only property._set_response_urlrewrites it via the bound request, with a fallback path forMock-shaped test responses.Backwards-compat
BaseMetadata.headeris nowhttpx.Headersinstead ofrequests.structures.CaseInsensitiveDict. Case-insensitive lookups (md.header.get("x-ratelimit-remaining")) keep working; literal dict equality (md.header == {"k": "v"}) no longer holds becausehttpx.Headerscarries auto-added entries (content-type, content-length).BaseMetadata.urlis coerced tostr(previouslystronrequests.Response; nowstr(httpx.Response.url)).API_USGS_CONCURRENTdefaults to 16 (parallel). Set=1to opt back into the sequential path.Test plan
pytest-httpx.tests/conftest.py(new) is arequests_mock-shaped shim overhttpx_mock(URL-prefix match,complete_qsstrict-mode parity, request-history view); an autouse fixture pinsAPI_USGS_CONCURRENT=1so the historical mocked suite stays on the deterministic serial path.running-loopandmissing-async) intests/waterdata_chunking_test.pyopt out of the serial-pin autouse and exercise (a) successful fan-out, (b) probe-firstRequestExceedsQuota, (c) resumableServiceInterrupted.callafter a mid-fan-out failure (resume runs serially on the unfinished indices), (d) the running-event-loop fallback emits aUserWarningand runs serial, (e) the missing-fetch_asyncwarning fires when the env asks for parallel.ruff checkandruff format --checkpass.get_daily/get_stats_*/get_channel) are resolved by the schema-aware_get_resp_data/_handle_stats_nestingrewrite incorporated via the rebase.Out of scope (follow-ups)
_fan_out_asyncmaterializes all(df, response)pairs before combining. Consider streaming-combine viaasyncio.as_completedif users push concurrency very high.NEWS.mdentry — left for the merger to draft.🤖 Generated with Claude Code