opt: optimatize cpu cache start time#1319
Conversation
There was a problem hiding this comment.
Code Review
This pull request implements lazy registration for pinned host memory and refactors the initialization sequence in the base model. However, several significant issues were identified during the review. Moving the module readiness check after CUDA graph initialization potentially violates memory pinning requirements for graph capture, which could lead to runtime errors. Additionally, the lack of error handling in the registration worker thread could cause the main thread to hang indefinitely if a failure occurs. The assumption that host and device pointers are identical is considered brittle and unsafe for certain architectures. Finally, the shift to lazy registration may negatively impact performance by executing time-consuming memory pinning sequentially rather than in parallel with other initialization tasks.
| self._init_cudagraph() | ||
| self._init_prefill_cuda_graph() | ||
| self._check_max_len_infer() | ||
| self._wait_other_modules_ready() |
There was a problem hiding this comment.
Moving _wait_other_modules_ready() after _init_cudagraph() and _init_prefill_cuda_graph() appears to violate the constraint mentioned in the removed comment (previously at line 120), which stated that this wait must happen before CUDA graph initialization to avoid "error capture".
If the CPU cache is involved in the forward pass (e.g., for KV cache offloading) and that pass is captured in a CUDA graph, the memory must be pinned before capture to ensure the graph correctly handles the memory mapping. Furthermore, since the registration is now lazy (as seen in kv_cache_utils.py), the pinning hasn't even started when the graphs are initialized, which could lead to capturing unpinned memory states or runtime errors during graph replay.
| res = cuda.cudaHostGetDevicePointer(ctypes.byref(device_ptr), host_ptr, 0) | ||
| if res != 0: | ||
| raise Exception(f"cudaHostGetDevicePointer failed with error code {res}") | ||
| assert host_ptr.value == device_ptr.value |
There was a problem hiding this comment.
There is a risk of an infinite loop in handle.wait(). If an exception occurs in this _worker thread (for example, if cudaHostRegister fails at line 275 or this assertion fails), the execution will never reach handle.tasks_finished.set(). This will cause the while loop in wait() (line 320) to block indefinitely as it waits for an event that will never be set.
Consider wrapping the worker logic in a try...finally block to ensure the event is always signaled, or modify the wait() loop to check if the thread is still alive.
| cpu_cache_tensor = self._build_tensor_view(shm_ptr=shm_ptr) | ||
| assert shm_ptr == cpu_cache_tensor.data_ptr() |
There was a problem hiding this comment.
Using shm_ptr (the host virtual address) directly to build the tensor view assumes that the host and device pointers for pinned memory are always identical. While this is common on x86_64 Linux, it is not guaranteed by the CUDA API.
Although an assertion was added in the registration worker to check this, this approach is brittle. If the pointers were to differ, this code would pass a host pointer where a device pointer is expected, leading to crashes or undefined behavior during GPU operations. It is safer to continue using the device pointer explicitly returned by cudaHostGetDevicePointer.
|
|
||
| th = threading.Thread(target=_worker, name=f"cpu_cache_register_{shm_ptr}", daemon=True) | ||
| handle.thread = th | ||
| th.start() |
There was a problem hiding this comment.
Removing th.start() makes the registration lazy, meaning the background thread only starts when wait() is explicitly called. Combined with moving the wait to the end of basemodel.__init__, this change effectively eliminates the parallelism between CPU cache registration and other heavy initialization tasks like weight loading and CUDA graph warmup.
Since cudaHostRegister can be time-consuming for large storage sizes, this sequential execution likely increases the total startup time, which seems to contradict the PR's objective of optimizing start time.
e305849 to
cea5737
Compare
No description provided.