Add Cloud Native AI Scheduling Challenges Whitepaper by rajaskakodkar · Pull Request #2164 · cncf/toc

rajaskakodkar · 2026-05-15T16:51:15Z

Adds the Cloud Native Scheduling Challenges Whitepaper

Signed-off-by: Rajas Kakodkar <rajaskakodkar16@gmail.com>

andreyvelich

Thanks for this effort @rajaskakodkar!
Overall, looks great, I left few thoughts.

andreyvelich · 2026-05-19T13:16:09Z

+* Data transformation: Normalizing, encoding categorical variables, feature scaling  
+* Data splitting: Dividing data into training, validation, and test sets
+
+From a scheduling perspective, data preparation is typically CPU and I/O intensive rather than GPU-intensive. That said, GPU-accelerated frameworks can significantly speed up large-scale data processing tasks such as filtering, joining, and aggregating datasets. Jobs are often parallelizable—you can clean different partitions of a dataset independently. Event-driven scheduling is common: new data arriving triggers a preparation pipeline.


Maybe we can also mention unstructured data?

Suggested change

From a scheduling perspective, data preparation is typically CPU and I/O intensive rather than GPU-intensive. That said, GPU-accelerated frameworks can significantly speed up large-scale data processing tasks such as filtering, joining, and aggregating datasets. Jobs are often parallelizable—you can clean different partitions of a dataset independently. Event-driven scheduling is common: new data arriving triggers a preparation pipeline.

From a scheduling perspective, data preparation is typically CPU and I/O intensive rather than GPU-intensive. That said, GPU-accelerated frameworks can significantly speed up large-scale data processing tasks such as filtering, joining, and aggregating datasets. Additionally, GPUs work well for unstructured data like images because image processing involves massive parallel math operations.

Jobs are often parallelizable—you can clean different partitions of a dataset independently. Event-driven scheduling is common: new data arriving triggers a preparation pipeline.

andreyvelich · 2026-05-19T13:17:09Z

+
+From a scheduling perspective, data preparation is typically CPU and I/O intensive rather than GPU-intensive. That said, GPU-accelerated frameworks can significantly speed up large-scale data processing tasks such as filtering, joining, and aggregating datasets. Jobs are often parallelizable—you can clean different partitions of a dataset independently. Event-driven scheduling is common: new data arriving triggers a preparation pipeline.
+
+Kubernetes resources like Jobs and CronJobs handle these workloads reasonably well. Workflow orchestrators (Airflow, Argo Workflows, Flyte) coordinate multi-step pipelines.


What about Spark here?

Suggested change

Kubernetes resources like Jobs and CronJobs handle these workloads reasonably well. Workflow orchestrators (Airflow, Argo Workflows, Flyte) coordinate multi-step pipelines.

Kubernetes resources like Jobs, CronJobs, and SparkApplications handle these workloads reasonably well. Workflow orchestrators (Airflow, Argo Workflows, Flyte) coordinate multi-step pipelines.

andreyvelich · 2026-05-19T13:21:19Z

+Model development has two distinct activities that are often combined:
+
+* **Feature engineering** transforms prepared data into input features the model can use. This involves creating new variables, encoding categorical data, and selecting which features to include. Feature engineering is computationally similar to data preparation—CPU and I/O bound, parallelizable, often triggered by new data.  
+* **Model architecture** involves selecting the type of model (linear regression, decision tree, neural network, transformer) and designing its structure. For deep learning, this means defining layers, attention mechanisms, and other architectural choices. This work is often interactive—a data scientist experimenting in a notebook—and does not require significant compute resources until training begins.


Would it make sense to add topic around HPO?

Suggested change

* **Model architecture** involves selecting the type of model (linear regression, decision tree, neural network, transformer) and designing its structure. For deep learning, this means defining layers, attention mechanisms, and other architectural choices. This work is often interactive—a data scientist experimenting in a notebook—and does not require significant compute resources until training begins.

* **Model architecture** involves selecting the type of model (linear regression, decision tree, neural network, transformer) and designing its structure. For deep learning, this means defining layers, attention mechanisms, and other architectural choices. This work is often interactive—a data scientist experimenting in a notebook—and does not require significant compute resources until training begins.

* **Hyperparameter tuning** optimizes how the model learns rather than the structure of the model itself. This includes adjusting parameters such as learning rate, batch size, optimizer choice, number of epochs, and dropout rates. Unlike architecture design, hyperparameter tuning is compute-intensive because it requires repeatedly training and evaluating many model variants. These tuning jobs are highly parallelizable and are commonly distributed across GPUs or clusters.

andreyvelich · 2026-05-19T17:38:18Z

+* Tightly coupled: All workers must run simultaneously  
+* Sensitive to topology: Communication speed depends on GPU interconnects
+
+The default Kubernetes scheduler cannot handle these requirements. It will start pods as resources become available, potentially leaving a job stuck with partial resources indefinitely.


With the efforts around WAS, I wouldn't mention this, maybe we can say:
cc @helayoty @kannon92 @mm4tt

Suggested change

The default Kubernetes scheduler cannot handle these requirements. It will start pods as resources become available, potentially leaving a job stuck with partial resources indefinitely.

These characteristics require additional Kubernetes scheduler capabilities to support efficient all-or-nothing placement and topology-aware scheduling.

andreyvelich · 2026-05-19T17:47:20Z

+
+* **Long-running jobs.** A training run is not a request that completes in milliseconds. It is a job that runs for days or weeks. Interrupting it wastes all the work done since the last checkpoint. The scheduler must account for job duration, not just instantaneous resource needs.  
+* **Massive resource consumption.** Training large models requires hundreds or thousands of GPUs running simultaneously. A single job can consume the majority of a cluster's capacity for extended periods. This is not "scale horizontally by adding pods"—it is "reserve a large fraction of the cluster for one workload."  
+* **Tightly coupled distribution.** Distributed training uses collective communication patterns where all workers must participate. You cannot start with 7 of 8 workers and add the 8th later. You cannot lose one worker and continue with the remaining 7\. Either all workers are running, or the job cannot proceed. This is fundamentally different from web services, where losing one replica just shifts load to the others.  


Suggested change

* **Tightly coupled distribution.** Distributed training uses collective communication patterns where all workers must participate. You cannot start with 7 of 8 workers and add the 8th later. You cannot lose one worker and continue with the remaining 7\. Either all workers are running, or the job cannot proceed. This is fundamentally different from web services, where losing one replica just shifts load to the others.

* **Tightly coupled distribution.** Distributed training uses collective communication patterns where all workers must participate. You cannot start with 7 of 8 workers and add the 8th later. You cannot lose one worker and continue with the remaining 7. Either all workers are running, or the job cannot proceed. This is fundamentally different from web services, where losing one replica just shifts load to the others.

andreyvelich · 2026-05-19T19:50:24Z

+
+## ML Platform Tools
+
+These tools provide higher-level abstractions for ML workflows:


Suggested change

These tools provide higher-level abstractions for ML workflows:

These tools provide higher-level abstractions for AI workloads:

andreyvelich · 2026-05-19T19:51:02Z

+These tools provide higher-level abstractions for ML workflows:
+
+* **Kubeflow**  
+  * **Kubeflow Trainer** supports distributed training across frameworks (PyTorch, TensorFlow, PaddlePaddle, XGBoost). Provides job abstractions that handle worker coordination, including gang scheduling requirements.  


Ref: https://github.com/kubeflow/trainer#overview

Suggested change

* **Kubeflow Trainer** supports distributed training across frameworks (PyTorch, TensorFlow, PaddlePaddle, XGBoost). Provides job abstractions that handle worker coordination, including gang scheduling requirements.

* **Kubeflow Trainer** is a Kubernetes-native distributed AI platform for scalable LLM fine-tuning and training of AI models across a wide range of frameworks, including PyTorch, MLX, HuggingFace, DeepSpeed, JAX, XGBoost, and more. Provides job abstractions that handle worker coordination, including gang scheduling requirements and HPC workloads orchestration such as MPI and Flux.

andreyvelich · 2026-05-19T19:56:42Z

+| GPU Sharing | DRA (GA, K8s 1.34+) | KAI | HAMi, KubeRay, Volcano | Both | MIG requires DRA or vendor tools |
+| Scalability | Cluster Autoscaler, Karpenter | Armada, KAI, Kueue, Slinky, Volcano | interLink | Both | Large-scale scheduling is challenging |
+| I/O Bottlenecks | PersistentVolumes | \- | Fluid | Both | Storage and caching solutions |
+| Fault Tolerance | \- | Slinky,  | Kubeflow (elastic training) | Training | Framework-dependent |


andreyvelich · 2026-05-19T20:02:44Z

+| Preemption | PriorityClass (pod-level) | KAI, Kueue, Slinky, Volcano | \- | Both | Job-level preemption needs external tools |
+| Priority Scheduling | PriorityClass | All batch schedulers | \- | Both | Job-level priority in batch schedulers |
+| Reservation & Backfill | \- | Slinky, Volcano, YuniKorn | \- | Training | Advanced feature in some schedulers |
+| Topology Awareness (Node) | Topology Manager (NUMA), DRA CPU Driver (CPU topology) | KAI, Kueue, Slinky, Volcano | \- | Both | GPU interconnect awareness varies |
+| Topology Awareness (Cluster) | Topology Spread Constraints, DRANET (network DRA Driver) (limited) | KAI, Kueue, Slinky, Volcano | \- | Both | Network topology awareness is emerging |


andreyvelich · 2026-05-19T20:03:27Z

+**For ML engineers working with existing infrastructure:**
+
+1. Understand what scheduling tools are available in your cluster.  
+2. Use the appropriate job abstractions (PyTorchJob, MPIJob, etc.) rather than raw pods.  


Suggested change

2. Use the appropriate job abstractions (PyTorchJob, MPIJob, etc.) rather than raw pods.

2. Use the appropriate job abstractions (TrainJob, MPIJob, etc.) rather than raw pods.

Add Cloud Native AI Scheduling Challenges Whitepaper

a84e737

Signed-off-by: Rajas Kakodkar <rajaskakodkar16@gmail.com>

rajaskakodkar requested review from a team as code owners May 15, 2026 16:51

andreyvelich reviewed May 19, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add Cloud Native AI Scheduling Challenges Whitepaper#2164

Add Cloud Native AI Scheduling Challenges Whitepaper#2164
rajaskakodkar wants to merge 1 commit into
cncf:mainfrom
rajaskakodkar:scheduling-whitepaper

rajaskakodkar commented May 15, 2026

Uh oh!

andreyvelich left a comment

Uh oh!

andreyvelich May 19, 2026

Uh oh!

andreyvelich May 19, 2026

Uh oh!

andreyvelich May 19, 2026

Uh oh!

andreyvelich May 19, 2026

Uh oh!

andreyvelich May 19, 2026

Uh oh!

andreyvelich May 19, 2026

Uh oh!

andreyvelich May 19, 2026

Uh oh!

andreyvelich May 19, 2026

Uh oh!

andreyvelich May 19, 2026

Uh oh!

andreyvelich May 19, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants


		From a scheduling perspective, data preparation is typically CPU and I/O intensive rather than GPU-intensive. That said, GPU-accelerated frameworks can significantly speed up large-scale data processing tasks such as filtering, joining, and aggregating datasets. Jobs are often parallelizable—you can clean different partitions of a dataset independently. Event-driven scheduling is common: new data arriving triggers a preparation pipeline.

		Kubernetes resources like Jobs and CronJobs handle these workloads reasonably well. Workflow orchestrators (Airflow, Argo Workflows, Flyte) coordinate multi-step pipelines.

	Kubernetes resources like Jobs and CronJobs handle these workloads reasonably well. Workflow orchestrators (Airflow, Argo Workflows, Flyte) coordinate multi-step pipelines.
	Kubernetes resources like Jobs, CronJobs, and SparkApplications handle these workloads reasonably well. Workflow orchestrators (Airflow, Argo Workflows, Flyte) coordinate multi-step pipelines.

	* Model architecture involves selecting the type of model (linear regression, decision tree, neural network, transformer) and designing its structure. For deep learning, this means defining layers, attention mechanisms, and other architectural choices. This work is often interactive—a data scientist experimenting in a notebook—and does not require significant compute resources until training begins.
	* Model architecture involves selecting the type of model (linear regression, decision tree, neural network, transformer) and designing its structure. For deep learning, this means defining layers, attention mechanisms, and other architectural choices. This work is often interactive—a data scientist experimenting in a notebook—and does not require significant compute resources until training begins.
	* Hyperparameter tuning optimizes how the model learns rather than the structure of the model itself. This includes adjusting parameters such as learning rate, batch size, optimizer choice, number of epochs, and dropout rates. Unlike architecture design, hyperparameter tuning is compute-intensive because it requires repeatedly training and evaluating many model variants. These tuning jobs are highly parallelizable and are commonly distributed across GPUs or clusters.

	The default Kubernetes scheduler cannot handle these requirements. It will start pods as resources become available, potentially leaving a job stuck with partial resources indefinitely.
	These characteristics require additional Kubernetes scheduler capabilities to support efficient all-or-nothing placement and topology-aware scheduling.


		## ML Platform Tools

		These tools provide higher-level abstractions for ML workflows:

	* Kubeflow Trainer supports distributed training across frameworks (PyTorch, TensorFlow, PaddlePaddle, XGBoost). Provides job abstractions that handle worker coordination, including gang scheduling requirements.
	* Kubeflow Trainer is a Kubernetes-native distributed AI platform for scalable LLM fine-tuning and training of AI models across a wide range of frameworks, including PyTorch, MLX, HuggingFace, DeepSpeed, JAX, XGBoost, and more. Provides job abstractions that handle worker coordination, including gang scheduling requirements and HPC workloads orchestration such as MPI and Flux.

	\| Fault Tolerance \| \- \| Slinky, \| Kubeflow (elastic training) \| Training \| Framework-dependent \|
	\| Fault Tolerance \| \- \| Slinky, \| Kubeflow Trainer \| Training \| Framework-dependent \|

	2. Use the appropriate job abstractions (PyTorchJob, MPIJob, etc.) rather than raw pods.
	2. Use the appropriate job abstractions (TrainJob, MPIJob, etc.) rather than raw pods.

Conversation

rajaskakodkar commented May 15, 2026

Uh oh!

andreyvelich left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants