Skip to content

[core] Introduce file resource management#8179

Open
gavin9402 wants to merge 9 commits into
apache:masterfrom
gavin9402:resource-management
Open

[core] Introduce file resource management#8179
gavin9402 wants to merge 9 commits into
apache:masterfrom
gavin9402:resource-management

Conversation

@gavin9402

@gavin9402 gavin9402 commented Jun 9, 2026

Copy link
Copy Markdown
Contributor

Purpose

Introduce resource management capabilities to the REST Catalog, providing a unified way to manage file resources (FILE, JAR, PY, ARCHIVE) associated with databases. This lays the foundation for upcoming ML model and function features, where users will need to reference and manage external file resources such as model artifacts, UDF JARs, and Python scripts.

Changes

Resource Model (paimon-api)

  • Resource interface and AbstractResource base class — define the resource abstraction with properties like name, type, description, URI, and custom properties
  • FileResource, JarResource, PyResource, ArchiveResource — concrete resource types for FILE/JAR/PY/ARCHIVE
  • ResourceType enum — four supported resource types
  • ResourceChange — change operations for altering resources (setProperty, removeProperty, setDescription, setUri)
  • ResourceDeserializer — Jackson deserializer for polymorphic resource deserialization

REST API (paimon-api)

  • ResourcePaths — URL path builders for resource endpoints (/resources, /resource-details, /resources/{name})
  • RESTApi — 8 new resource management API methods: listResources, listResourcesPaged, listResourceDetailsPaged, getResource, createResource, dropResource, alterResource, listResourcesPagedGlobally
  • Request/Response classes: CreateResourceRequest, AlterResourceRequest, GetResourceResponse, ListResourcesResponse, ListResourceDetailsResponse, ListResourcesGloballyResponse

Catalog Interface (paimon-core)

  • Catalog — 8 new interface methods for resource CRUD + ResourceAlreadyExistException and ResourceNotExistException inner exception classes
  • AbstractCatalog — default UnsupportedOperationException implementations
  • DelegateCatalog — delegation implementations
  • RESTCatalog — full REST-backed implementations

Tests (paimon-core)

  • RESTApiJsonTest — JSON serialization/deserialization tests for resource request/response classes
  • RESTCatalogTest — integration tests for resource CRUD operations
  • RESTCatalogServer — mock REST server with resource management route handlers
  • MockRESTMessage — test helper methods for constructing resource test data

API Summary

Operation Method Endpoint
List resources GET /v1/{prefix}/databases/{db}/resources
List resource details GET /v1/{prefix}/databases/{db}/resource-details
Get resource GET /v1/{prefix}/databases/{db}/resources/{name}
Create resource POST /v1/{prefix}/databases/{db}/resources
Drop resource DELETE /v1/{prefix}/databases/{db}/resources/{name}
Alter resource POST /v1/{prefix}/databases/{db}/resources/{name}
List resources globally GET /v1/{prefix}/resources

Tests

mvn -pl paimon-core -am -Pfast-build -DfailIfNoTests=false -Dtest="RESTApiJsonTest,RESTCatalogTest" test

@TheR1sing3un

Copy link
Copy Markdown
Member

It's a very surprising pr. Is there any relevant pip to provide more background information?

@gavin9402

Copy link
Copy Markdown
Contributor Author

It's a very surprising pr. Is there any relevant pip to provide more background information?

Thank you for your suggestion. I will submit the PIP as soon as possible.

@JingsongLi

Copy link
Copy Markdown
Contributor

First PR, I think you can focus on Resource introducing.

@gavin9402 gavin9402 force-pushed the resource-management branch from 57a260d to 81f5235 Compare June 10, 2026 02:03
@gavin9402

Copy link
Copy Markdown
Contributor Author

First PR, I think you can focus on Resource introducing.

Sure, let me revise it.

@gavin9402 gavin9402 force-pushed the resource-management branch from 81f5235 to 25c4603 Compare June 10, 2026 02:50
@gavin9402 gavin9402 changed the title [core] Introduce resource && ML model management [core] Introduce file resource management Jun 10, 2026
@gavin9402 gavin9402 requested a review from JingsongLi June 10, 2026 09:44

private final Identifier identifier;
@Nullable private final String comment;
private final String uri;

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How to use this URI? How to get rest token for this file?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The current design delegates URI handling to the file systems integrated with the engine. The metastore service is only responsible for permission management of the resource entity.

For example, in Daft, we assign the URI to the resources field of the PyFileResourceFunction instance.

    def _get_function(self, ident: Identifier) -> Function:
        ...
        paimon_daft_func: FunctionDefinition = self._inner.get_function(str(ident)).definitions()["daft"]
        ...

        # file_resources may be a list attribute or a callable method
        raw_resources = paimon_daft_func.file_resources
        resources = raw_resources() if callable(raw_resources) else raw_resources

        return PyFileResourceFunction(
            identifier=ident,
            module_name=paimon_daft_func.class_name,
            binding_name=paimon_daft_func.function_name,
            resources=[item.uri for item in resources],
        )

During execution, the engine resolves it by fetching the resources through the corresponding file system, and file permissions are also handled by the file system.

async def run_plan(
        self,
        plan: LocalPhysicalPlan,
        exec_cfg: PyDaftExecutionConfig,
        context: dict[str, str] | None,
        added_resources: dict[str, int] | None = None,
        **inputs: (
            Input | list[ray.ObjectRef]
        ),  # PyMicroPartitions are separated from Inputs because they are Ray ObjectRefs, which will be resolved by Ray.
    ) -> AsyncGenerator[MicroPartition | FlightPartitions | SwordfishTaskMetadata, None]:
        """Run a plan on swordfish and yield partitions."""
        if added_resources:
            file_resource_manager.resolve(added_resources)

More straightforwardly, we could also use Paimon FileIO for handling this. In fact, the engine’s behavior is similar to this.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For our REST Catalog, the file system should be managed by Catalog for permissions, and here Resource feels that FileIO also needs to be exposed.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Your suggestion is absolutely right. I moved Resource to the paimon-core module and implemented the toBytes and newInputStream methods for it.

However, this introduces a small side effect: if Resource needs to be used in Function in the future, then Function would also need to be refactored into the paimon-core module.

@gavin9402 gavin9402 requested a review from JingsongLi June 13, 2026 09:01
private final String uri;
private final long size;
private final long lastModifiedTime;
private final UriReaderFactory uriReaderFactory;

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why just using UriReaderFactory? This is not FileIO from Catalog.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Updated: using fileIOForData

@gavin9402

Copy link
Copy Markdown
Contributor Author

Python implementation added

@gavin9402 gavin9402 requested a review from JingsongLi June 22, 2026 06:43
uri,
response.size(),
response.lastModifiedTime(),
fileIOForData(new Path(uri), identifier));

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This still breaks resource reads when the REST catalog has data-token.enabled=true. fileIOForData wraps the URI with RESTTokenFileIO, and RESTTokenFileIO.refreshToken() always calls loadTableToken(identifier). Here the identifier is the resource identifier, not a table identifier, so resource.toBytes() / newInputStream() will try to fetch /tables/<resource>/token and fail as soon as a token-enabled catalog is used. Please either use a non-table-token FileIO for resources or add a resource-token path instead of reusing the table-token flow.

uri,
response.size,
response.last_modified_time,
self.file_io_for_data(uri, identifier) if uri else None,

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This has the same token-path problem as the Java implementation. When data-token.enabled is set, file_io_for_data returns a RESTTokenFileIO, whose refresh_token() calls load_table_token(table_identifier). The identifier passed here is a resource identifier, so reading the returned resource will request a table token for the resource name and fail in token-enabled REST catalogs. Please use a resource-aware token flow here, or avoid the table-token FileIO for resource URIs.

@JingsongLi

Copy link
Copy Markdown
Contributor

We use loadTableToken to refresh FileIO, this is a resource instead of table. This is indeed a problem, I need to think about it.

@gavin9402

Copy link
Copy Markdown
Contributor Author

We use loadTableToken to refresh FileIO, this is a resource instead of table. This is indeed a problem, I need to think about it.

Perhaps we should make the refreshToken method support identifiers of any type.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants