Fix race condition on files in libgit2#4229
Conversation
There was a problem hiding this comment.
Pull request overview
Note
Copilot was unable to run its full agentic suite in this review.
Updates the vendored libgit2 LFS patch to mitigate rename-related races and improve diagnostics when finalizing downloaded LFS objects.
Changes:
- Removes the pre-
unlinkof the destination file before renaming the temp file into place. - Adds helpers to detect whether parent directories exist for improved
rename(...)=ENOENTdiagnostics. - Captures and logs
errnomore robustly during rename failures (and logs Win32 error code).
Comments suppressed due to low confidence (1)
third_party/libgit2/lfs.patch:1
GetLastError()is logged after calls top_access()/p_stat()(vialfs_parent_dir_exists). Those calls can overwrite the thread’s last-error value on Windows, so the loggedGetLastErroris likely unrelated to thep_renamefailure. CaptureDWORD winerr = GetLastError();immediately after the failedp_rename(before any other syscalls), and log the captured value later.
diff --git a/.gitignore b/.gitignore
- ResumeShutdown, ResumeTerminate, ResumeCtrlC: Only wait for .lfs_part files (not pointer files) - Pointer files exist BEFORE download starts (part of git repo), causing shutdown to arrive AFTER completion - Updated ResumeTerminate child process to detect ANY .lfs_part file, not just the large model's This ensures shutdown/cancel signals arrive DURING download, not after completion, allowing proper LFS finalization with .lfs_part artifacts preserved for resume.
| ASSERT_FALSE(std::filesystem::exists(gitDir)); | ||
|
|
||
| #ifdef _WIN32 | ||
| // On Windows, gtest stdout capture can conflict with SPDLOG stdout sink. |
There was a problem hiding this comment.
We could use log file with ovms to check, unless sth we expect does not appear in spdlong but only in stdout?
There was a problem hiding this comment.
Logger configuration is process-global and initialized earlier in the test
binary, so a per-test --log_path does not reliably capture this warning.
| #endif | ||
|
|
||
| EXPECT_TRUE(observedPartialDownload); | ||
| SPDLOG_INFO("ResumeCtrlC test state: observedPartialDownload={}, interruptionSent={}", |
| #ifdef _WIN32 | ||
| SKIP_AND_EXIT_IF_NOT_RUNNING_UNSTABLE(); | ||
| #endif |
There was a problem hiding this comment.
What are those env variables? Why are they needed?
Also couldn't we have static utility funcitons that would do proper sending signals in tests? Now test logic is containing both setup how the initial state is prepared, then details how to properly send signals depending on OS.
Its not easily readable at first glance what is actual test logic, what is preparation, what is mechanism to simulate ctrl c etc.
Also I see magic constants:
ASSERT_EQ(std::filesystem::file_size(modelPath), 52417240);
have const size_t MAIN_MODEL_SIZE
TOKENIZER_MODEL_SIZE
| std::error_code ec; | ||
| const bool hasMainRef = std::filesystem::exists(mainRefPath, ec); | ||
| ec.clear(); | ||
| const bool modelExists = std::filesystem::exists(modelPath, ec); |
There was a problem hiding this comment.
no ec check after this call?
| // Wait until pull is in progress and repository is already resumable. | ||
| std::error_code ec; | ||
| const bool hasMainRef = std::filesystem::exists(mainRefPath, ec); | ||
| ec.clear(); |
There was a problem hiding this comment.
error code cleared without checking? why is it even cleared? won't the last call before check from line 841 override it anyway before check?
| auto lfsCandidates = ovms::libgit2::findLfsLikeFiles(downloadPath, true); | ||
| auto hasOpenvinoModelPointer = std::find_if(lfsCandidates.begin(), lfsCandidates.end(), | ||
| [](const std::filesystem::path& p) { return p.filename() == "openvino_model.bin"; }) != lfsCandidates.end(); | ||
| const std::string partPath = ovms::FileSystem::appendSlash(basePath) + "openvino_model.binlfs_part"; | ||
| const bool hasPartFile = std::filesystem::exists(partPath); | ||
| if (hasOpenvinoModelPointer || hasPartFile) { | ||
| const bool hasLfsArtifacts = !lfsCandidates.empty(); | ||
| const bool modelInFlight = modelExists && (modelSize > 0) && (modelSize < 52417240); | ||
| if (hasMainRef && (hasLfsArtifacts || modelInFlight)) { |
| auto lfsCandidates = ovms::libgit2::findLfsLikeFiles(downloadPath, true); | ||
| const bool hasOpenvinoModelPointer = std::find_if(lfsCandidates.begin(), lfsCandidates.end(), | ||
| [](const std::filesystem::path& p) { return p.filename() == "openvino_model.bin"; }) != lfsCandidates.end(); | ||
| const std::string partPath = ovms::FileSystem::appendSlash(basePath) + "openvino_model.binlfs_part"; | ||
| const bool hasPartFile = std::filesystem::exists(partPath); | ||
| if (hasOpenvinoModelPointer || hasPartFile) { | ||
| const bool hasLfsArtifacts = !lfsCandidates.empty(); | ||
| const bool modelInFlight = modelExists && (modelSize > 0) && (modelSize < 52417240); | ||
| if (hasMainRef && (hasLfsArtifacts || modelInFlight)) { |
| const bool hasPartFile = std::filesystem::exists(partPath); | ||
| if (hasOpenvinoModelPointer || hasPartFile) { | ||
| const bool hasLfsArtifacts = !lfsCandidates.empty(); | ||
| const bool modelInFlight = modelExists && (modelSize > 0) && (modelSize < 52417240); |
There was a problem hiding this comment.
hardcoded limits - why this upper bound has been chosen?
please move to const
| ASSERT_TRUE(TerminateProcess(pi.hProcess, 1)); | ||
| ASSERT_EQ(WaitForSingleObject(pi.hProcess, 10000), WAIT_OBJECT_0); | ||
| interruptionSent = true; | ||
| ASSERT_EQ(WaitForSingleObject(pi.hProcess, 120000), WAIT_OBJECT_0); |
There was a problem hiding this comment.
move value to const variable so we know the context
| std::error_code ec; | ||
| const bool hasMainRef = std::filesystem::exists(mainRefPath, ec); | ||
| ec.clear(); | ||
| const bool modelExists = std::filesystem::exists(modelPath, ec); | ||
| std::uintmax_t modelSize = 0; | ||
| if (modelExists) { | ||
| ec.clear(); | ||
| modelSize = std::filesystem::file_size(modelPath, ec); | ||
| if (ec) { | ||
| modelSize = 0; | ||
| } | ||
| } |
There was a problem hiding this comment.
Duplicate of part of logic above (same comments apply here as well).
Could we have it in a function?
| const bool hasPartFile = std::filesystem::exists(partPath); | ||
| if (hasOpenvinoModelPointer || hasPartFile) { | ||
| const bool hasLfsArtifacts = !lfsCandidates.empty(); | ||
| const bool modelInFlight = modelExists && (modelSize > 0) && (modelSize < 52417240); |
There was a problem hiding this comment.
duplicate use of hardcoded values
🛠 Summary
Fixes a race condition in the LFS shutdown path and hardens the interrupt-and-resume test coverage for the HF model pull module.
Production fix:
Adds clearStaleErrorFile() in libgit2.cpp to remove stale lfs_error.txt artifacts left by a previous interrupted pull. Without this, a false error file caused subsequent resume attempts to misidentify the repository state and fail.
Adds a shutdown check inside the LFS write callback so that an in-progress LFS download honours a cancellation request promptly rather than completing the current chunk before noticing the shutdown.
Test improvements:
Refactors HfPull and HfPullCache fixtures to centralize repeated model path setup into shared members, eliminating per-test duplication.
Removes the deterministic fallback helper that masked the real SIGKILL / SIGINT code paths; both ResumeTerminate and ResumeCtrlC now require observing genuine in-progress LFS activity before signalling.
Replaces fragile fixed-sleep interruption gating with a polling loop that checks for main plus LFS artifacts or an in-flight model file, making the tests reliable across both fast and slow network conditions.
🧪 Checklist
``