Download dataset files atomically to avoid corrupt cache (#4097)#4142
Download dataset files atomically to avoid corrupt cache (#4097)#4142gaoflow wants to merge 8 commits into
Conversation
Codecov Report✅ All modified and coverable lines are covered by tests. Additional details and impacted files@@ Coverage Diff @@
## main #4142 +/- ##
==========================================
+ Coverage 79.61% 79.64% +0.03%
==========================================
Files 120 120
Lines 12786 12791 +5
==========================================
+ Hits 10180 10188 +8
+ Misses 2606 2603 -3
Flags with carried forward coverage won't be shown. Click here to find out more.
|
|
Thanks! Two comments:
|
`_download` wrote directly to the destination cache path, so a concurrent reader (e.g. a parallel pytest worker sharing `settings.datasetdir`) could find the file present via `path.is_file()` and read it while it was still being written, getting a corrupted/partial file. Download to a temporary file in the same directory and atomically rename it into place once complete, so the destination only ever appears fully written. On failure the temporary file is removed and the destination is left untouched (it may belong to another process that finished first).
53dc24b to
a367e0e
Compare
|
Thanks for the review!
Disclosure: these changes were made by an AI coding agent (Claude) running under my account and direction. |
|
No pre-commit does not pass. Please use your brain in addition to an agent. I have infinite tokens and can run agents myself. |
|
pre-commit passes now — that was a ruff |
Fixes #4097.
Problem
_download(used by_check_datafile_present_and_downloadfor every cached dataset) writes directly to the destination cache path:As @flying-sheep diagnosed in #4097, this races under parallel execution (e.g.
pytest-xdistworkers sharingsettings.datasetdir):pathand writing into it;path.is_file(), sees the partially-written file, and reads it → corrupted file / error.Fix
Download to a
NamedTemporaryFilein the same directory and atomicallyPath.replace()it into place once the download is complete (the second option suggested in the issue). A rename on the same filesystem is atomic, sopathonly ever becomes visible fully written; concurrent downloads still work, the loser's temp file just replaces in turn.On failure the temporary file is removed, and
pathitself is left untouched — previously the error path unconditionallyunlink-edpath, which under the race could delete a sibling process's already-completed download.Verification
Added
test_download_atomicintests/test_datasets.py: it mocksurlopenand asserts that the destination path does not exist while bytes are still being streamed, that the final content is correct, and that no temporary file is left behind. The test fails on the current code (destination appeared before the download finished) and passes with this change. I also verified that both an immediate failure (e.g.HTTPError) and a mid-stream failure propagate the exception and leave neither a destination nor a leftover temporary file.