Skip to content

feat(waterdata): Auto-chunk OGC requests over the URL byte limit#283

Draft
thodson-usgs wants to merge 6 commits into
DOI-USGS:mainfrom
thodson-usgs:chunker-unified
Draft

feat(waterdata): Auto-chunk OGC requests over the URL byte limit#283
thodson-usgs wants to merge 6 commits into
DOI-USGS:mainfrom
thodson-usgs:chunker-unified

Conversation

@thodson-usgs
Copy link
Copy Markdown
Collaborator

@thodson-usgs thodson-usgs commented May 18, 2026

Summary

The OGC waterdata getters (get_daily, get_continuous, get_field_measurements, and the rest of the multi-value-capable functions) previously failed with HTTP 414 when the request URL exceeded the server's ~8 KB byte limit. The common chained-query pattern — pull a long site list from get_monitoring_locations, then feed it into get_daily — was the main offender.

This PR introduces a joint chunker that models every multi-value list parameter AND the cql-text filter (split on its top-level OR clauses) as a chunkable axis. Greedy halving splits the biggest chunk across all axes until each sub-request URL fits the limit; the chunker fans out into multiple HTTP requests under the hood and returns one combined DataFrame. Callers see no API change — the same call that used to 414 now just works.

Mirrors R dataRetrieval's #870, generalized from one filter axis to N joint axes.

Minimal example

from dataretrieval.waterdata import get_daily, get_monitoring_locations

# Pull every Ohio stream gage.
sites_df, _ = get_monitoring_locations(
    state_name="Ohio",
    site_type_code="ST",
    skip_geometry=True,
)

# Before: HTTP 414 once the site list grew past ~500.
# After:  transparently chunked into multiple sub-requests, one
#         combined DataFrame returned.
df, md = get_daily(
    monitoring_location_id=sites_df["monitoring_location_id"].tolist(),
    parameter_code="00060",
    time="P7D",
)

Recovery from mid-call interruptions

If the per-key rate-limit window empties partway through a fan-out, the chunker raises QuotaExhausted carrying the partial result plus a resumable call handle — call exc.call.resume() once the window resets and only the still-pending sub-requests are re-issued:

import time
from dataretrieval.waterdata import get_daily
from dataretrieval.waterdata.chunking import ChunkInterrupted

try:
    df, md = get_daily(monitoring_location_id=long_list)
except ChunkInterrupted as exc:
    time.sleep(exc.retry_after or 5 * 60)
    df, md = exc.call.resume()

HTTP 5xx mid-call surfaces analogously as ServiceInterrupted. Both subclass ChunkInterrupted so callers who want one retry policy can catch the base class.

What's in the box

  • Joint planner (ChunkPlan) — extracts every chunkable axis (list-shaped kwargs and the cql-text filter), runs greedy halving on the biggest chunk across all axes until the worst-case sub-request URL fits the limit. Passthrough requests get a trivial single-step plan.
  • Typed transport exceptionsRateLimited / ServiceUnavailable (raised by _raise_for_non_200), recognized by the chunker's _classify_chunk_error and wrapped as resumable QuotaExhausted / ServiceInterrupted.
  • Pre-emptive quota check — after the first sub-request, ChunkedCall reads x-ratelimit-remaining; if the rest of the plan can't fit the window it raises RequestExceedsQuota reporting the deficit before burning more budget. Set API_USGS_LIMIT=0 to bypass.
  • Shared requests.Session across the fan-outChunkedCall.resume opens one Session and publishes it via a ContextVar; _walk_pages and get_stats_data (now sharing a _paginate strategy helper) borrow it instead of opening a fresh one per sub-request. Measured 41% wall-clock reduction on a 2000-site / 8-chunk fan-out against the live USGS API (1.78s vs 3.03s).
  • Schema-aware GeoJSON extraction_get_resp_data and _handle_stats_nesting now build the non-geopandas DataFrame directly from each feature's properties + (optional) id + (optional) geometry.coordinates, instead of json_normalize-ing the whole feature and reactively dropping envelope columns. Closes a recently-surfaced bug where USGS migrating to full GeoJSON geometry objects ({type, coordinates}) caused a geometry_type column (constant "Point") to leak through.
  • Strict one-way layerutils → chunking. The chunker has zero runtime deps on utils (only docstring cross-refs).

Behavior change to be aware of

Paginated/chunked calls now report:

  • BaseMetadata.url — still reflects the user's original query (unchanged).
  • BaseMetadata.header — the last page/sub-request headers (so x-ratelimit-remaining is current). Was: first page's headers.
  • BaseMetadata.query_time — cumulative wall-clock across every page/sub-request. Was: first page's elapsed.

Downstream code that read md.header["x-ratelimit-remaining"] or md.query_time from a chunked call will see different (more useful) values.

The non-geopandas DataFrame for OGC + stats responses no longer carries the geometry_type column (it was always "Point"); callers reading it should switch to checking for geometry directly. Verified bit-identical to origin/main on all four representative docstring-example queries for the columns both produce; the only delta is the removed geometry_type noise column.

Test coverage

80 new unit tests in tests/waterdata_chunking_test.py covering:

  • Planner axis extraction and greedy-halving allocation
  • Cartesian-product enumeration of sub-args
  • Rate-limit gating (RequestExceedsQuota)
  • Resume idempotency and equivalence (an interrupted-then-resumed run produces a byte-identical frame to an uninterrupted run)
  • Transient-error classification (429 vs 5xx vs non-transient)
  • Shared-session reuse across sub-requests and isolation between resume calls
  • URL-construction stress test against the real _construct_api_requests builder (not a fake) — 500 USGS site IDs × 20 datetime OR-clauses, asserting every sub-request URL stays under 8000 bytes and the joint planner beats the bail-floor worst case
  • Mid-pagination 429/5xx now also covered for both the OGC and stats paginators

Test plan

  • pytest tests/waterdata_chunking_test.py tests/waterdata_utils_test.py — 80/80 pass
  • Full pytest tests/ — 357 pass, 3 skipped, 0 fail (5 previously-failing waterdata_test.py tests now pass after the schema-aware fix)
  • ruff check . clean
  • Live-API smoke test against get_daily for 2000+ Ohio stream gages (chunks transparently into 8 sub-requests, one combined frame returned)
  • Resume equivalence verified end-to-end (interrupted run + resume() produces byte-identical frame to uninterrupted run)
  • Bit-identical comparison vs origin/main on 4 docstring-example queries (only delta: the intentionally-removed geometry_type noise column)
  • Aggressive chunker stress: 2000/3000/all-2888-Ohio-sites and 1500 sites + 15-clause OR-filter — all 4 scenarios pass with correct canonical URLs (up to 58 KB)

Copy link
Copy Markdown
Collaborator Author

@thodson-usgs thodson-usgs left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pay close attention to the layout: are all variables and functions placed logically into their modules? Or has the logic been mixed up.

Comment thread dataretrieval/waterdata/chunking.py Outdated
- Chunkable dims include multi-value list params (sites, parameter
codes, ...) and the cql-text ``filter`` (split at top-level ``OR``
to keep each chunk valid CQL).
- The planner enumerates candidate filter chunk counts
Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what is k here?

Comment thread dataretrieval/waterdata/chunking.py Outdated
Comment on lines +125 to +130
class QuotaExhausted(RuntimeError):
"""Raised mid-chunked-call when the API's reported remaining quota
(``x-ratelimit-remaining`` header) drops below the configured safety
floor. The chunker stops before issuing the next sub-request to
avoid a mid-call HTTP 429 that would silently truncate paginated
results.
Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This seems like a bug. A mid-call HTTP 429 should not silently truncate. If it does, fix it, then we won't need to defend against this case.

Comment thread dataretrieval/waterdata/filters.py Outdated
# per-request budget from ``_WATERDATA_URL_BYTE_LIMIT``.
_CQL_FILTER_CHUNK_LEN = 5000

# Empirically the API replies HTTP 414 above ~8200 bytes of full URL —
Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should this be moved to a different module now?

thodson-usgs added a commit to thodson-usgs/dataretrieval-python that referenced this pull request May 18, 2026
…r helpers, clarify docs

Three review responses bundled together:

- chunking.py module docstring: define ``k`` as the candidate filter
  chunk count before using it in the planner description.
- ``QuotaExhausted`` docstring: drop the stale "silently truncate"
  framing. PR DOI-USGS#273 / DOI-USGS#279 already raise on a mid-pagination 429, so
  this exception is the structured-recovery alternative (partial
  frames in hand) rather than a defense against silent truncation.
- Move chunker-only orphans from filters.py to chunking.py:
  ``_WATERDATA_URL_BYTE_LIMIT`` (the URL byte ceiling),
  ``_FetchOnce`` TypeVar, ``_combine_chunk_frames``, and
  ``_combine_chunk_responses``. filters.py was a leftover home from
  the pre-unification two-decorator stack; these helpers have no
  callers outside the chunker. Test ``test_multi_value_chunked_lazy_url_limit``
  now monkeypatches the constant on its new module.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
thodson-usgs added a commit to thodson-usgs/dataretrieval-python that referenced this pull request May 18, 2026
Three test docstrings/comments still framed their reasoning against the
removed two-decorator stack (PR DOI-USGS#283 unified them). Reword to describe
the current joint-planner behavior on its own terms:

- ``test_plan_joint_fans_out_filter_when_list_alone_cannot_fit``: drop
  the "previous two-decorator design" aside.
- ``test_chunkable_params_skips_filter_passed_as_list``: rewrite the
  "inner filters.chunked is the only place that may shrink filter"
  line to point at ``_plan_joint``.
- ``stress_chunker._bail_floor_baseline``: reframe the baseline as
  "degenerate singleton plan" rather than "worst case the old
  two-decorator design produced."

No behavioral changes; prose only. Chunker tests + offline stress
test still pass.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Copy link
Copy Markdown
Collaborator Author

@thodson-usgs thodson-usgs left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please fix my comments

Comment thread dataretrieval/waterdata/chunking.py Outdated
Comment on lines +3 to +8
PR 233 routes most services through GET with comma-separated values
(e.g. ``monitoring_location_id=USGS-A,USGS-B,...``). Long lists and
long top-level-``OR`` CQL filters can independently blow the server's
~8 KB URL byte limit. This module adds a single decorator that plans
both chunking dimensions together and iterates the joint cartesian
product so each sub-request URL fits.
Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Tighten this to summarize the correct design. Don't mention PR 233 or the old design.

Comment thread dataretrieval/waterdata/chunking.py Outdated
Comment on lines +90 to +95
# Default cap on the number of sub-requests a single chunked call may
# emit. The USGS Water Data API rate-limits each HTTP request
# (including pagination), so the true budget is
# ``hourly_quota / avg_pages_per_chunk``. 1000 matches the default
# hourly quota.
_DEFAULT_MAX_CHUNKS = 1000
Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why not just set this as x-ratelimit-remaining by default

Comment thread dataretrieval/waterdata/chunking.py Outdated
Comment on lines +97 to +101
# When ``x-ratelimit-remaining`` drops below this between sub-requests,
# the chunker bails with ``QuotaExhausted`` rather than risk a mid-call
# HTTP 429. Carries the partial result so callers can resume from a
# known offset instead of retrying the whole chunked call from scratch.
_DEFAULT_QUOTA_SAFETY_FLOOR = 50
Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

after the retrieving the first chunk, just check x-ratelimit-remaining, and if the plan will not fit within our current rait limit, bail and return an Error message that the query would exceed our rate limit and report by how much.

Comment thread dataretrieval/waterdata/chunking.py Outdated
Comment on lines +134 to +144
class QuotaExhausted(RuntimeError):
"""Raised mid-chunked-call when the API's reported remaining quota
(``x-ratelimit-remaining`` header) drops below the configured safety
floor. The chunker stops before issuing the next sub-request and
surfaces the partial result so callers can resume after the hourly
window resets.

A bare 429 raised by ``_walk_pages`` would also abort the call but
discard the chunks completed so far; this exception is the
structured-recovery alternative, triggered pre-emptively while the
accumulated frames are still in hand.
Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Revise this. This error should raise if we prematurely exhaust our quota because another processes is using it faster than predicted. A single process should never incounter this, because it raises a RequestExceedsQuota value error after checking the limit in the first chunk

Comment thread dataretrieval/waterdata/chunking.py Outdated
"""Decorator that splits multi-value list params and cql-text
filters across sub-requests so each sub-request URL fits
``url_limit`` bytes (defaults to ``_WATERDATA_URL_BYTE_LIMIT``)
and the joint cartesian-product plan stays ≤ ``max_chunks``
Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

again max_chunks should be replaced by the our current rate limit (which we know after receiving the first chunk)

Comment thread dataretrieval/waterdata/chunking.py Outdated
Comment on lines +560 to +564
Between sub-requests the wrapper reads ``x-ratelimit-remaining`` from
each response. If it drops below ``quota_safety_floor`` (default
``_DEFAULT_QUOTA_SAFETY_FLOOR``), the wrapper raises
``QuotaExhausted`` carrying the combined partial result and the
chunk offset so callers can resume after the hourly window resets.
Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This should also change. There is no quota_saftey_floor anymore. But raise the QuotaExhausted if we receive an HTTP error telling us the quota was exhuasted.

thodson-usgs added a commit to thodson-usgs/dataretrieval-python that referenced this pull request May 19, 2026
…mic rate-limit gate

Addresses PR DOI-USGS#283 review feedback. The static caps
(``_DEFAULT_MAX_CHUNKS=1000``, ``_DEFAULT_QUOTA_SAFETY_FLOOR=50``) and
the matching ``max_chunks`` / ``quota_safety_floor`` decorator
parameters are replaced by a quota check that runs after the first
sub-request, using the real ``x-ratelimit-remaining`` value rather
than a guessed cap.

Behavior:

- After the first sub-request the wrapper reads
  ``x-ratelimit-remaining``. If the rest of the plan won't fit in
  the current rate-limit window, it raises a new
  ``RequestExceedsQuota(ValueError)`` carrying ``planned_chunks``,
  ``available``, and ``deficit`` so the message reports exactly how
  far over budget the call is. The first chunk has already been
  issued; the wrapper stops there rather than burn the rest of the
  quota on a call that will fail mid-way.

- ``QuotaExhausted`` is now triggered only when an actual HTTP 429
  propagates from a sub-request (detected by walking ``__cause__``
  for ``RuntimeError("429: ...")``, the shape ``_raise_for_non_200``
  produces and ``_walk_pages`` wraps). A single-process caller
  should not normally see this — ``RequestExceedsQuota``
  short-circuits in chunk 1; arrival here implies a concurrent
  consumer drained the bucket faster than predicted. Carries the
  partial frame for resume. ``partial_response`` becomes ``None``
  when the 429 hits chunk 0 (no banked responses).

- A non-429 ``RuntimeError`` (e.g. 500) propagates unchanged so the
  real cause surfaces to the caller.

- When the server doesn't echo ``x-ratelimit-remaining``,
  ``_read_remaining`` returns ``_QUOTA_UNKNOWN``; the wrapper skips
  the post-first-chunk quota check (no signal → don't synthesize a
  block).

Planner: ``_plan_list_chunks`` / ``_plan_joint`` no longer carry a
``max_chunks`` cap. ``RequestTooLarge`` fires only when nothing more
can be split (the genuine URL-byte floor). The rate-limit gate
replaces the static cap.

Module docstring rewritten to summarize the current design (joint
planning + dynamic quota gate); historical PR 233 / two-decorator
references dropped.

Tests: ten obsolete cap/floor tests removed; eight new tests added
covering ``RequestExceedsQuota`` after chunk 0, deficit reporting,
the no-header skip path, mid-call 429 → ``QuotaExhausted`` with
partial frame, the first-chunk 429 (partial_response=None) edge
case, and non-429 ``RuntimeError`` pass-through.

``_fetch_once`` in ``utils.py`` calls the decorator with defaults
only, so no call-site changes are needed.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@thodson-usgs
Copy link
Copy Markdown
Collaborator Author

Pushed 01e579e reworking the quota machinery per all the inline comments. Summary:

Module docstring (line 8 comment): tightened. No more PR 233 / two-decorator references — it just describes the current design (joint planner + dynamic quota gate).

_DEFAULT_MAX_CHUNKS / _DEFAULT_QUOTA_SAFETY_FLOOR (line 95, 101 comments): both deleted. After the first sub-request the wrapper reads x-ratelimit-remaining directly; if the remaining plan won't fit in the current window, it raises a new RequestExceedsQuota(ValueError) carrying planned_chunks, available, and deficit so the message reports exactly how far over budget the call is.

QuotaExhausted (line 144 comment): rewritten. It now fires only when an actual HTTP 429 propagates from a sub-request (detected by walking __cause__ for RuntimeError("429: ..."), the shape _raise_for_non_200 produces). The docstring states explicitly that a single-process caller should not normally see this — RequestExceedsQuota short-circuits in chunk 1; arrival here implies a concurrent consumer drained the bucket faster than predicted. Carries the partial frame for resume. partial_response is None when the 429 hits chunk 0.

max_chunks / quota_safety_floor decorator params (line 552, 564 comments): removed. _plan_list_chunks and _plan_joint no longer carry a max_chunks cap; RequestTooLarge fires only on the genuine "nothing left to split" floor. The rate-limit gate replaces the static cap.

Side effects: _fetch_once in utils.py already called the decorator with defaults only, so no call-site changes were needed. Ten obsolete cap/floor tests were removed and eight new ones added covering RequestExceedsQuota after chunk 0, deficit reporting, the no-header skip path, mid-call 429 → QuotaExhausted with partial frame, the first-chunk 429 (partial_response=None) edge case, and non-429 RuntimeError pass-through. All chunker tests + offline stress test pass.

Copy link
Copy Markdown
Collaborator Author

@thodson-usgs thodson-usgs left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fix these things

Comment thread dataretrieval/waterdata/chunking.py Outdated
Comment on lines +12 to +18
Planning: for a filter with ``n_clauses`` top-level OR clauses, try
candidate filter chunk counts ``k = 1, 2, 4, ..., n_clauses``. For
each, partition clauses into ``k`` count-balanced groups joined by
``OR``, take the longest (URL-encoded) group as the worst-case filter,
then plan list-dim chunking by greedy halving against the remaining
budget. Keep the candidate with the smallest ``list_count × k``.

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this sounds like filter chunking is given priority over list-dim chunking. Shouldn't the filter be treated as any other dim?

Comment thread dataretrieval/waterdata/chunking.py Outdated
Comment on lines +439 to +440
@classmethod
def from_args(
Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it would be cleaner to refactor this to a class.init

Comment thread dataretrieval/waterdata/chunking.py Outdated
return head


class _ChunkExecution:
Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could we call this QueryExecutor? Really a class should be a noun. So QueryOrchestrator or ChunkManager or ChunkExecutor

thodson-usgs added a commit to thodson-usgs/dataretrieval-python that referenced this pull request May 21, 2026
…or, axis-symmetric docstring

Addresses three PR DOI-USGS#283 review comments:

- **Module docstring reframed for axis symmetry.** The previous text
  read as "filter is the outer loop, list dims are inner," which
  obscured that both axis kinds are chunkable dimensions. The new
  framing leads with "every multi-value list parameter and the filter
  are chunkable axes" and explains *why* the algorithm enumerates
  filter counts in the outer loop (filter chunking is discrete in
  OR-clause cardinality; list dims are continuously halvable) rather
  than presenting the asymmetry as arbitrary.

- **``ChunkPlan.from_args`` → ``ChunkPlan.__init__``.** Now that the
  passthrough case is just a trivial plan (never ``None``), the
  classmethod-constructor pattern was unjustified. ``__init__`` does
  the planning directly: ``ChunkPlan(args, build_request, url_limit)``
  reads as "construct a plan for these args." Dropped ``@dataclass``;
  the fields are still simple attributes, just assigned in ``__init__``.
  Extracted the search loop to a free helper ``_search_best_chunking``
  so ``__init__`` stays readable.

- **``_ChunkExecution`` → ``_ChunkExecutor``.** Classes should be nouns;
  "Execution" reads as an event, "Executor" as an actor. Pairs cleanly
  with ``ChunkPlan`` — the plan is the recipe, the executor runs it.

The wrapper is unchanged in shape:

    return ChunkPlan(args, build_request, limit).execute(fetch_once)

Tests updated to use the direct constructor; all 145 unit tests pass.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
thodson-usgs added a commit to thodson-usgs/dataretrieval-python that referenced this pull request May 22, 2026
Replaces ``requests`` with ``httpx`` package-wide and adds an async
parallel branch to the multi-value chunker, governed by
``API_USGS_CONCURRENT`` (parallel-by-default; set ``=1`` for the
legacy sequential path). Benchmarked at ~5.3x speedup on a 52k-site /
10-state ``get_daily`` call.

The parallel fan-out runs on a single shared ``httpx.AsyncClient`` so
sub-requests amortize one TCP+TLS handshake — impossible with the
sync ``requests`` stack without a thread pool. Built on top of the
``ChunkPlan`` / ``ChunkedCall`` arch from DOI-USGS#283: the sync path drives
``ChunkedCall.resume()`` (resumable, with ``ChunkInterrupted``
guarantees); the parallel path uses ``_fan_out_async`` to iterate the
same plan via ``asyncio.gather`` + ``Semaphore``. Both paths publish
their client via ``ContextVar`` so ``_walk_pages`` and
``get_stats_data`` reuse one client across sub-requests.

Backwards-compat: ``BaseMetadata.header`` is now ``httpx.Headers``
(case-insensitive dict reads still work; literal dict equality breaks
because ``httpx.Headers`` carries auto-added entries like ``host``).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@thodson-usgs thodson-usgs changed the title refactor(waterdata): Unify list and filter chunkers into one joint planner feat(waterdata): Auto-chunk OGC requests over the URL byte limit May 22, 2026
@thodson-usgs thodson-usgs force-pushed the chunker-unified branch 3 times, most recently from 11c848e to 068e756 Compare May 22, 2026 21:13
The OGC `waterdata` getters (`get_daily`, `get_continuous`,
`get_field_measurements`, and the rest of the multi-value-capable
functions) previously failed with HTTP 414 when the request URL
exceeded the server's ~8 KB byte limit. The common chained-query
pattern — pull a long site list from `get_monitoring_locations`,
then feed it into `get_daily` — was the main offender:

    from dataretrieval.waterdata import get_daily, get_monitoring_locations

    sites_df, _ = get_monitoring_locations(
        state_name="Ohio",
        site_type_code="ST",
        skip_geometry=True,
    )
    # Before: HTTP 414 once `sites_df` exceeded ~500 rows.
    # After: transparently chunked into multiple sub-requests, one
    # combined DataFrame returned.
    df, md = get_daily(
        monitoring_location_id=sites_df["monitoring_location_id"].tolist(),
        parameter_code="00060",
        time="P7D",
    )

This patch introduces a joint chunker that models every multi-value
list parameter AND the cql-text `filter` (split on its top-level
`OR` clauses) as a chunkable axis. Greedy halving splits the biggest
chunk across all axes until each sub-request URL fits the limit; the
chunker fans out into multiple HTTP requests under the hood and
returns one combined DataFrame. Callers see no API change.

Every axis (a list-shaped kwarg, or the filter split into its
top-level `OR` clauses) is represented by an `_Axis` dataclass: the
args key, the tuple of indivisible atoms (site IDs or clauses), and
the joiner used to compose them back into URL text (`,` for list
axes, ` OR ` for the filter axis). `ChunkPlan` extracts the
chunkable axes for a request and runs greedy halving against the
biggest chunk across all axes until the worst-case sub-request URL
fits. `ChunkedCall` iterates the joint cartesian product of axis
chunks and drives the sub-requests to completion. Requests that
already fit get a trivial single-step plan — one code path either
way.

After the first sub-request, `ChunkedCall` reads
`x-ratelimit-remaining`; if the rest of the plan can't fit the
current per-key rate-limit window, it raises `RequestExceedsQuota`
reporting the deficit before burning more budget. Set
`API_USGS_LIMIT=0` to bypass the pre-emptive check.

Mid-stream transient failures surface as a `ChunkInterrupted`
subclass — `QuotaExhausted` for HTTP 429, `ServiceInterrupted` for
HTTP 5xx. Both carry the partial result plus a resumable call handle
on `exc.call`:

    import time
    from dataretrieval.waterdata import get_daily
    from dataretrieval.waterdata.chunking import ChunkInterrupted

    try:
        df, md = get_daily(monitoring_location_id=long_list)
    except ChunkInterrupted as exc:
        time.sleep(exc.retry_after or 5 * 60)
        # Re-issues only the still-pending sub-requests; banked work
        # is preserved on `exc.call`.
        df, md = exc.call.resume()

`ChunkedCall.resume` opens one `requests.Session` for the entire
fan-out and publishes it via a `ContextVar` so paginated-loop
helpers downstream (`_walk_pages`, `get_stats_data` via the new
`_paginate` helper) reuse the same connection pool across every
sub-request — saves one TCP/TLS handshake per sub-request after the
first. Measured 41% wall-clock reduction on a 2000-site / 8-chunk
fan-out against the live USGS API (1.78s shared vs 3.03s
per-sub-request).

One behavior change for paginated/chunked calls:

- `BaseMetadata.url` still reflects the user's original query
  (unchanged).
- `BaseMetadata.header` now carries the *last* page/sub-request
  headers so downstream code that branches on
  `x-ratelimit-remaining` sees current state (was: first page's
  headers).
- `BaseMetadata.query_time` is now cumulative wall-clock across
  every page/sub-request (was: first page's elapsed).

- New module `dataretrieval.waterdata.chunking`: joint planner,
  exception hierarchy (`_RetryableTransportError`, `RateLimited`,
  `ServiceUnavailable`, `RequestTooLarge`, `RequestExceedsQuota`,
  `ChunkInterrupted`, `QuotaExhausted`, `ServiceInterrupted`),
  `ChunkPlan`, `ChunkedCall`, `multi_value_chunked` decorator,
  shared-session ContextVar plumbing.
- `dataretrieval.waterdata.utils`: paginated-loop body consolidated
  into a `_paginate` strategy helper that `_walk_pages` and
  `get_stats_data` both delegate to; typed transport exceptions
  moved out to `chunking` so the layer direction is strictly
  `utils → chunking` (no more lazy cross-module import).
- `dataretrieval.waterdata.filters`: existing top-level-OR splitter
  and filter-chunkability detector kept as primitives the joint
  planner consumes.

80 new unit tests in `tests/waterdata_chunking_test.py` covering
the planner, axis extraction, cartesian-product enumeration,
rate-limit gating, resume idempotency and equivalence, transient-
error classification, shared-session reuse, and a URL-construction
stress test against the real `_construct_api_requests` builder (not
a fake) — 500 USGS site IDs × 20 datetime OR-clauses, asserting
every sub-request URL stays under 8000 bytes and the joint planner
beats the bail-floor worst case. Mid-pagination 429/5xx now also
covered for both the OGC and stats paginators.

Mirrors R `dataRetrieval`'s [#870](DOI-USGS/dataRetrieval#870),
generalized from one filter axis to N joint axes.

Also fixes a handful of pre-existing docstring typos in
`waterdata/api.py` (`meaining` → `meaning`,
`instantanous` → `instantaneous`).

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
``slots=True`` for ``@dataclass`` requires Python 3.10. The package
declares ``requires-python = ">=3.9"`` and CI tests 3.9, so the import
was failing test collection on the 3.9 matrix cell. Dropping the kwarg
loses a small memory optimization on short-lived ``_Axis`` instances
(not material) and restores compatibility.

Also aligns one residual "sub-chunk" comment to "chunk" — the rest of
the file already uses "chunk".

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
thodson-usgs and others added 4 commits May 23, 2026 10:56
Both functions exist only to serve ``ChunkPlan.__init__`` and their
parameters (``args``, ``axes``, ``chunks``) duplicate state the plan
already holds. Folding them in as ``ChunkPlan._plan`` and
``ChunkPlan._worst_case_args`` makes the "planning IS construction"
framing honest, removes parameter threading, and disambiguates the
mutation target (``self.chunks`` rather than a passed-in dict).

``_extract_axes`` stays module-level — it operates on a raw args dict,
has no plan state, and is imported directly by tests.

No behavior change; 47/47 chunking tests pass.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Drop justifications of design alternatives we rejected (ContextVar
vs. _FetchOnce kwarg; pinning separators against past de-sync), and
trim the snapshot/copy rationale comments to keep the load-bearing
asymmetry note without the surrounding prose.

No behavior change; 47/47 chunking tests pass.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
… one comment

A second copy of ``_paginated_failure_message`` had survived a rebase
and was silently shadowing the typed-isinstance version with a
worse string-prefix branch. Removed.

Also drops a "(the closure used to do ...)" comment that describes a
past iteration of ``get_stats_data``'s follow-up closure.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
* `_combine_chunk_frames` all-empty now preserves the GeoDataFrame
  type of its input — returning a plain ``pd.DataFrame()`` would have
  downgraded the result on geopd installs, defeating the exact
  contract the function's docstring describes.
* `_combine_chunk_frames` single-frame fast path now returns
  ``.copy()`` — the live ``ChunkedCall.partial_frame`` property used
  to alias ``_chunks[0][0]`` so caller mutations would corrupt the
  stored chunk frame.
* `ChunkPlan.iter_sub_args` passthrough now yields ``dict(self.args)``
  to match the chunked branch (and the docstring); the old version
  yielded ``self.args`` directly, a latent mutation footgun.
* Quota check now fires after every non-final chunk, not only after
  chunk 0 — the old ``len(_chunks) == 1`` gating silently disabled
  the pre-emptive guard on every ``resume()`` after a
  ``ChunkInterrupted``, and missed concurrent-drain mid-call. Method
  renamed ``_check_quota_after_first`` → ``_check_quota_remaining``.
* ``RequestExceedsQuota`` now carries a ``.call`` handle to the
  originating ``ChunkedCall`` — the first chunk's already-fetched
  data was previously unrecoverable because the exception is a
  ``ValueError`` with no resume affordance.
* ``_get_resp_data`` now treats a 200 with ``numberReturned > 0`` but
  missing ``features`` key as an empty page rather than crashing with
  ``KeyError``, mirroring the hardening already applied to
  ``_handle_stats_nesting``.

Six regression tests added; one existing test
(``test_walk_pages_wraps_initial_page_parse_error``) updated to use a
``JSONDecodeError`` instead of the now-handled missing-features case.
97/97 chunking + utils tests pass.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant