Skip to content

opt: optimatize cpu cache start time#1319

Merged
hiworldwzj merged 3 commits into
mainfrom
opt_cpu_cache
May 26, 2026
Merged

opt: optimatize cpu cache start time#1319
hiworldwzj merged 3 commits into
mainfrom
opt_cpu_cache

Conversation

@blueswhen
Copy link
Copy Markdown
Collaborator

No description provided.

Copy link
Copy Markdown
Contributor

@gemini-code-assist gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request implements lazy registration for pinned host memory and refactors the initialization sequence in the base model. However, several significant issues were identified during the review. Moving the module readiness check after CUDA graph initialization potentially violates memory pinning requirements for graph capture, which could lead to runtime errors. Additionally, the lack of error handling in the registration worker thread could cause the main thread to hang indefinitely if a failure occurs. The assumption that host and device pointers are identical is considered brittle and unsafe for certain architectures. Finally, the shift to lazy registration may negatively impact performance by executing time-consuming memory pinning sequentially rather than in parallel with other initialization tasks.

Comment thread lightllm/common/basemodel/basemodel.py Outdated
self._init_cudagraph()
self._init_prefill_cuda_graph()
self._check_max_len_infer()
self._wait_other_modules_ready()
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

Moving _wait_other_modules_ready() after _init_cudagraph() and _init_prefill_cuda_graph() appears to violate the constraint mentioned in the removed comment (previously at line 120), which stated that this wait must happen before CUDA graph initialization to avoid "error capture".

If the CPU cache is involved in the forward pass (e.g., for KV cache offloading) and that pass is captured in a CUDA graph, the memory must be pinned before capture to ensure the graph correctly handles the memory mapping. Furthermore, since the registration is now lazy (as seen in kv_cache_utils.py), the pinning hasn't even started when the graphs are initialized, which could lead to capturing unpinned memory states or runtime errors during graph replay.

Comment thread lightllm/utils/kv_cache_utils.py Outdated
res = cuda.cudaHostGetDevicePointer(ctypes.byref(device_ptr), host_ptr, 0)
if res != 0:
raise Exception(f"cudaHostGetDevicePointer failed with error code {res}")
assert host_ptr.value == device_ptr.value
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

There is a risk of an infinite loop in handle.wait(). If an exception occurs in this _worker thread (for example, if cudaHostRegister fails at line 275 or this assertion fails), the execution will never reach handle.tasks_finished.set(). This will cause the while loop in wait() (line 320) to block indefinitely as it waits for an event that will never be set.

Consider wrapping the worker logic in a try...finally block to ensure the event is always signaled, or modify the wait() loop to check if the thread is still alive.

Comment thread lightllm/common/cpu_cache/creator.py Outdated
Comment on lines +35 to +36
cpu_cache_tensor = self._build_tensor_view(shm_ptr=shm_ptr)
assert shm_ptr == cpu_cache_tensor.data_ptr()
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

Using shm_ptr (the host virtual address) directly to build the tensor view assumes that the host and device pointers for pinned memory are always identical. While this is common on x86_64 Linux, it is not guaranteed by the CUDA API.

Although an assertion was added in the registration worker to check this, this approach is brittle. If the pointers were to differ, this code would pass a host pointer where a device pointer is expected, leading to crashes or undefined behavior during GPU operations. It is safer to continue using the device pointer explicitly returned by cudaHostGetDevicePointer.


th = threading.Thread(target=_worker, name=f"cpu_cache_register_{shm_ptr}", daemon=True)
handle.thread = th
th.start()
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

Removing th.start() makes the registration lazy, meaning the background thread only starts when wait() is explicitly called. Combined with moving the wait to the end of basemodel.__init__, this change effectively eliminates the parallelism between CPU cache registration and other heavy initialization tasks like weight loading and CUDA graph warmup.

Since cudaHostRegister can be time-consuming for large storage sizes, this sequential execution likely increases the total startup time, which seems to contradict the PR's objective of optimizing start time.

@blueswhen blueswhen force-pushed the opt_cpu_cache branch 3 times, most recently from e305849 to cea5737 Compare May 26, 2026 05:32
@hiworldwzj hiworldwzj merged commit 5adbf00 into main May 26, 2026
1 check passed
@hiworldwzj hiworldwzj deleted the opt_cpu_cache branch May 26, 2026 08:26
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants