feat: expose variety of features from DF54 update#1554
Conversation
DataFusion 53 deprecated `TableFunctionImpl::call(args: &[Expr])` in favor of `call_with_args(args: TableFunctionArgs)`. `PyTableFunction` was migrated in 5a64b0d; this brings the FFI example along so it no longer relies on the deprecated entry point. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
PR apache#1541 introduced `with_logical_extension_codec` / `with_physical_extension_codec` setters typed as `codec: Any`. The Rust extractors accept either a raw `PyCapsule` or any object exposing `__datafusion_logical_extension_codec__` / `__datafusion_physical_extension_codec__`. Add `LogicalExtensionCodecExportable` / `PhysicalExtensionCodecExportable` Protocols in `python/datafusion/user_defined.py` (matching the existing `ScalarUDFExportable` pattern) and tighten both setter signatures to `Protocol | _PyCapsule`. Pure typing change; no runtime behavior diff. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Upstream exposes both `get_field(expr, name)` and `get_field_path(expr, [names...])`, but both ultimately call the same scalar UDF with a base expression plus one or more name args. Collapse the Python surface into a single variadic `get_field(expr, *names)` that accepts either a one-step lookup or a path of names, dispatching through a single Rust binding. Note in `.ai/skills/check-upstream/SKILL.md` that `get_field_path` is covered by the variadic form so future audits do not flag it as a gap. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Wrap upstream `SessionContext::read_batches`, which materializes a DataFrame directly from a sequence of `RecordBatch`es without registering a named table. The single-batch convenience `SessionContext.read_batch` is implemented in pure Python by calling `read_batches([batch])`, so the Rust side only needs the one binding. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Expose `udf(name)` / `udaf(name)` / `udwf(name)` lookups symmetric with the existing `register_udf` / `register_udaf` / `register_udwf` setters, plus `udfs()` / `udafs()` / `udwfs()` for enumerating registered function names. Looked-up functions come back as the same `ScalarUDF` / `AggregateUDF` / `WindowUDF` wrappers users already get from registration, so they can be called as expressions or re-registered into a different session. Returns Vec<String> from the list helpers (sorted) rather than the raw HashSet upstream returns, so calling code gets a stable ordering. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
pyarrow.parquet promotes timestamp[s] to timestamp[ms] on write (apache/arrow#41382), so the read array never matched the input. Cast the expected array to timestamp[ms] in test_simple_select to assert DataFusion reads what Arrow actually stored. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
DataFrameHtmlFormatter(repr_rows=..., max_rows=...) fires the deprecation warning before raising ValueError, but pytest.raises does not catch warnings. The escaping warning surfaced in every pytest run. Wrap the call in both pytest.raises and pytest.warns so the warning is asserted, not leaked. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Add Examples docstrings (doctest) for `udf` / `udaf` / `udwf` / `udfs` / `udafs` / `udwfs` that demonstrate the lookup pattern, including a late-binding example where the function name comes from configuration. Add tests covering config-driven dispatch and built-in UDAF / UDWF lookup so the documented patterns are exercised end-to-end. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
There was a problem hiding this comment.
Pull request overview
This PR updates the Python bindings and examples to expose additional DataFusion 54-era functionality (notably UDF/UDAF/UDWF discovery + lookup helpers and Arrow RecordBatch ingestion conveniences), and adjusts tests/tooling accordingly.
Changes:
- Add
SessionContext.read_batch/read_batchesplus UDF/UDAF/UDWF lookup & listing helpers (udf/udaf/udwf,udfs/udafs/udwfs). - Extend
functions.get_fieldto support multi-segment nested field paths (and update the Rust binding accordingly). - Update tests to cover the new API surface and adjust timestamp/parquet and deprecation-warning expectations; bump pre-commit hook version.
Reviewed changes
Copilot reviewed 13 out of 13 changed files in this pull request and generated 1 comment.
Show a summary per file
| File | Description |
|---|---|
| python/datafusion/context.py | Adds batch-reading helpers and UDF/UDAF/UDWF discovery + lookup methods; improves codec type hints. |
| python/datafusion/functions.py | Updates get_field to accept nested paths. |
| python/datafusion/user_defined.py | Introduces Protocol type hints for logical/physical extension codec exportables. |
| crates/core/src/context.rs | Exposes read_batches and function-registry lookup/listing to Python via PyO3. |
| crates/core/src/functions.rs | Updates internal get_field binding to accept a vector of path segments. |
| examples/datafusion-ffi-example/src/table_function.rs | Updates example for upstream TableFunctionImpl API changes (call_with_args). |
| python/tests/test_context.py | Adds coverage for read_batch/read_batches. |
| python/tests/test_dataframe.py | Adjusts test to assert both DeprecationWarning and ValueError. |
| python/tests/test_functions.py | Adds coverage for nested-path get_field and empty-arg error behavior. |
| python/tests/test_sql.py | Removes timestamp[s] xfail and compensates for parquet timestamp unit promotion. |
| python/tests/test_udf.py | Adds coverage for UDF/UDAF/UDWF lookup + late-binding dispatch. |
| .pre-commit-config.yaml | Bumps actionlint hook version to fix CI failures. |
| .ai/skills/check-upstream/SKILL.md | Documents that get_field_path is covered by variadic get_field. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| FFI interface. ``codec`` must either be a raw ``FFI_LogicalExtensionCodec`` | ||
| ``PyCapsule`` or an object exposing | ||
| ``__datafusion_logical_extension_codec__``. |
There was a problem hiding this comment.
Do we need this addition? Isn't this redundant with the typing?
|
|
||
| This only supports codecs that have been implemented using the | ||
| FFI interface. | ||
| FFI interface. ``codec`` must either be a raw |
There was a problem hiding this comment.
Ditto on shadowing type hint
|
|
||
| >>> df = df.with_column( | ||
| ... "outer", | ||
| ... dfn.functions.named_struct([("inner", dfn.col("s"))]), |
There was a problem hiding this comment.
NIT: Not required here but the doctest namespace already imports functions a F which would make this addition less verbose.
| """ | ||
| self.ctx.register_record_batches(name, partitions) | ||
|
|
||
| def read_batch(self, batch: pa.RecordBatch) -> DataFrame: |
There was a problem hiding this comment.
I would consider it more pythonic for read_batches to accept RecordBatch | Iterable[RecordBatches]
| name: Name of the registered scalar UDF. | ||
|
|
||
| Raises: | ||
| Exception: If no scalar UDF is registered under ``name``. |
There was a problem hiding this comment.
I don't recall if this is the convention across the code base but on quick look I'd expect to just return ScalarUDF | None
| def udf(self, name: str) -> ScalarUDF: | ||
| """Look up a registered scalar UDF by name. | ||
|
|
||
| Returns the same :py:class:`~datafusion.user_defined.ScalarUDF` |
There was a problem hiding this comment.
Shadows the return type.
Which issue does this PR close?
No single issue — this is wave 1 of follow-up work after the DataFusion 54 upgrade (#1532). Each commit is self-contained and can be reviewed independently.
Rationale for this change
DataFusion 54 introduced or deprecated several pieces of upstream API surface that the Python bindings had not yet caught up with. This PR closes the highest-value gaps.
What changes are included in this PR?
LogicalExtensionCodecExportable/PhysicalExtensionCodecExportableto make hinting signatures more understandableget_field_pathbut instead fold it intoget_fieldto be more pythonicSessionContext.read_batches/read_batchAre there any user-facing changes?
Yes, but they are all additions. No breaking changes to existing public APIs.