Skip to content

Fix race condition on files in libgit2#4229

Open
rasapala wants to merge 15 commits into
mainfrom
fix_race_condition
Open

Fix race condition on files in libgit2#4229
rasapala wants to merge 15 commits into
mainfrom
fix_race_condition

Conversation

@rasapala
Copy link
Copy Markdown
Collaborator

@rasapala rasapala commented May 21, 2026

🛠 Summary

Fixes a race condition in the LFS shutdown path and hardens the interrupt-and-resume test coverage for the HF model pull module.

Production fix:

Adds clearStaleErrorFile() in libgit2.cpp to remove stale lfs_error.txt artifacts left by a previous interrupted pull. Without this, a false error file caused subsequent resume attempts to misidentify the repository state and fail.
Adds a shutdown check inside the LFS write callback so that an in-progress LFS download honours a cancellation request promptly rather than completing the current chunk before noticing the shutdown.

Test improvements:

Refactors HfPull and HfPullCache fixtures to centralize repeated model path setup into shared members, eliminating per-test duplication.
Removes the deterministic fallback helper that masked the real SIGKILL / SIGINT code paths; both ResumeTerminate and ResumeCtrlC now require observing genuine in-progress LFS activity before signalling.
Replaces fragile fixed-sleep interruption gating with a polling loop that checks for main plus LFS artifacts or an in-flight model file, making the tests reliable across both fast and slow network conditions.

🧪 Checklist

  • Unit tests added.
  • The documentation updated.
  • Change follows security best practices.
    ``

Copilot AI review requested due to automatic review settings May 21, 2026 09:06
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Note

Copilot was unable to run its full agentic suite in this review.

Updates the vendored libgit2 LFS patch to mitigate rename-related races and improve diagnostics when finalizing downloaded LFS objects.

Changes:

  • Removes the pre-unlink of the destination file before renaming the temp file into place.
  • Adds helpers to detect whether parent directories exist for improved rename(...)=ENOENT diagnostics.
  • Captures and logs errno more robustly during rename failures (and logs Win32 error code).
Comments suppressed due to low confidence (1)

third_party/libgit2/lfs.patch:1

  • GetLastError() is logged after calls to p_access() / p_stat() (via lfs_parent_dir_exists). Those calls can overwrite the thread’s last-error value on Windows, so the logged GetLastError is likely unrelated to the p_rename failure. Capture DWORD winerr = GetLastError(); immediately after the failed p_rename (before any other syscalls), and log the captured value later.
diff --git a/.gitignore b/.gitignore

Comment thread third_party/libgit2/lfs.patch
Comment thread third_party/libgit2/lfs.patch Outdated
Comment thread third_party/libgit2/lfs.patch
Comment thread src/pull_module/libgit2.cpp Outdated
Comment thread third_party/libgit2/lfs.patch
Comment thread third_party/libgit2/lfs.patch Outdated
Comment thread third_party/libgit2/lfs.patch Outdated
Comment thread third_party/libgit2/lfs.patch
rasapala added 10 commits May 22, 2026 11:54
- ResumeShutdown, ResumeTerminate, ResumeCtrlC: Only wait for .lfs_part files (not pointer files)
- Pointer files exist BEFORE download starts (part of git repo), causing shutdown to arrive AFTER completion
- Updated ResumeTerminate child process to detect ANY .lfs_part file, not just the large model's

This ensures shutdown/cancel signals arrive DURING download, not after completion,
allowing proper LFS finalization with .lfs_part artifacts preserved for resume.
Comment thread src/test/pull_hf_model_test.cpp Outdated
ASSERT_FALSE(std::filesystem::exists(gitDir));

#ifdef _WIN32
// On Windows, gtest stdout capture can conflict with SPDLOG stdout sink.
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We could use log file with ovms to check, unless sth we expect does not appear in spdlong but only in stdout?

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Logger configuration is process-global and initialized earlier in the test
binary, so a per-test --log_path does not reliably capture this warning.

Comment thread src/test/pull_hf_model_test.cpp Outdated
#endif

EXPECT_TRUE(observedPartialDownload);
SPDLOG_INFO("ResumeCtrlC test state: observedPartialDownload={}, interruptionSent={}",
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

DEBUG?

Comment thread src/test/pull_hf_model_test.cpp Outdated
Comment on lines 428 to 430
#ifdef _WIN32
SKIP_AND_EXIT_IF_NOT_RUNNING_UNSTABLE();
#endif
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Left skipped?

Comment thread src/test/pull_hf_model_test.cpp Outdated
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What are those env variables? Why are they needed?
Also couldn't we have static utility funcitons that would do proper sending signals in tests? Now test logic is containing both setup how the initial state is prepared, then details how to properly send signals depending on OS.

Its not easily readable at first glance what is actual test logic, what is preparation, what is mechanism to simulate ctrl c etc.

Also I see magic constants:
ASSERT_EQ(std::filesystem::file_size(modelPath), 52417240);

have const size_t MAIN_MODEL_SIZE
TOKENIZER_MODEL_SIZE

Comment thread src/test/pull_hf_model_test.cpp Outdated
std::error_code ec;
const bool hasMainRef = std::filesystem::exists(mainRefPath, ec);
ec.clear();
const bool modelExists = std::filesystem::exists(modelPath, ec);
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

no ec check after this call?

Comment thread src/test/pull_hf_model_test.cpp Outdated
// Wait until pull is in progress and repository is already resumable.
std::error_code ec;
const bool hasMainRef = std::filesystem::exists(mainRefPath, ec);
ec.clear();
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

error code cleared without checking? why is it even cleared? won't the last call before check from line 841 override it anyway before check?

Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 3 out of 3 changed files in this pull request and generated 3 comments.

Comment thread src/pull_module/libgit2.cpp Outdated
Comment thread src/test/pull_hf_model_test.cpp Outdated
Comment on lines +846 to +849
auto lfsCandidates = ovms::libgit2::findLfsLikeFiles(downloadPath, true);
auto hasOpenvinoModelPointer = std::find_if(lfsCandidates.begin(), lfsCandidates.end(),
[](const std::filesystem::path& p) { return p.filename() == "openvino_model.bin"; }) != lfsCandidates.end();
const std::string partPath = ovms::FileSystem::appendSlash(basePath) + "openvino_model.binlfs_part";
const bool hasPartFile = std::filesystem::exists(partPath);
if (hasOpenvinoModelPointer || hasPartFile) {
const bool hasLfsArtifacts = !lfsCandidates.empty();
const bool modelInFlight = modelExists && (modelSize > 0) && (modelSize < 52417240);
if (hasMainRef && (hasLfsArtifacts || modelInFlight)) {
Comment thread src/test/pull_hf_model_test.cpp Outdated
Comment on lines +1026 to +1029
auto lfsCandidates = ovms::libgit2::findLfsLikeFiles(downloadPath, true);
const bool hasOpenvinoModelPointer = std::find_if(lfsCandidates.begin(), lfsCandidates.end(),
[](const std::filesystem::path& p) { return p.filename() == "openvino_model.bin"; }) != lfsCandidates.end();
const std::string partPath = ovms::FileSystem::appendSlash(basePath) + "openvino_model.binlfs_part";
const bool hasPartFile = std::filesystem::exists(partPath);
if (hasOpenvinoModelPointer || hasPartFile) {
const bool hasLfsArtifacts = !lfsCandidates.empty();
const bool modelInFlight = modelExists && (modelSize > 0) && (modelSize < 52417240);
if (hasMainRef && (hasLfsArtifacts || modelInFlight)) {
Comment thread src/test/pull_hf_model_test.cpp Outdated
const bool hasPartFile = std::filesystem::exists(partPath);
if (hasOpenvinoModelPointer || hasPartFile) {
const bool hasLfsArtifacts = !lfsCandidates.empty();
const bool modelInFlight = modelExists && (modelSize > 0) && (modelSize < 52417240);
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

hardcoded limits - why this upper bound has been chosen?
please move to const

Comment thread src/test/pull_hf_model_test.cpp Outdated
ASSERT_TRUE(TerminateProcess(pi.hProcess, 1));
ASSERT_EQ(WaitForSingleObject(pi.hProcess, 10000), WAIT_OBJECT_0);
interruptionSent = true;
ASSERT_EQ(WaitForSingleObject(pi.hProcess, 120000), WAIT_OBJECT_0);
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

move value to const variable so we know the context

Comment thread src/test/pull_hf_model_test.cpp Outdated
Comment on lines +1014 to +1025
std::error_code ec;
const bool hasMainRef = std::filesystem::exists(mainRefPath, ec);
ec.clear();
const bool modelExists = std::filesystem::exists(modelPath, ec);
std::uintmax_t modelSize = 0;
if (modelExists) {
ec.clear();
modelSize = std::filesystem::file_size(modelPath, ec);
if (ec) {
modelSize = 0;
}
}
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Duplicate of part of logic above (same comments apply here as well).
Could we have it in a function?

Comment thread src/test/pull_hf_model_test.cpp Outdated
const bool hasPartFile = std::filesystem::exists(partPath);
if (hasOpenvinoModelPointer || hasPartFile) {
const bool hasLfsArtifacts = !lfsCandidates.empty();
const bool modelInFlight = modelExists && (modelSize > 0) && (modelSize < 52417240);
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

duplicate use of hardcoded values

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants