Skip to content

fix(scoring): re-enqueue scoring after commit to avoid stuck SCORING …#2420

Open
AybH26 wants to merge 1 commit into
codalab:developfrom
AybH26:fix/scoring-stuck-on-broker-error
Open

fix(scoring): re-enqueue scoring after commit to avoid stuck SCORING …#2420
AybH26 wants to merge 1 commit into
codalab:developfrom
AybH26:fix/scoring-stuck-on-broker-error

Conversation

@AybH26

@AybH26 AybH26 commented Jun 16, 2026

Copy link
Copy Markdown

fix(scoring): re-enqueue scoring after commit to avoid stuck SCORING rows (#2419)

Closes #2419

Issue

Submissions can stay in Scoring forever. The compute worker PATCHes the submission to status=SCORING to trigger the scoring step, but if the broker (RabbitMQ) has a brief hiccup at that exact moment, the status row commits while the scoring task is never published. The submission then sits in Scoring indefinitely: the 24h cleanup (submission_status_cleanup, src/apps/competitions/tasks.py:797-806) only rescues RUNNING rows, so there is no recovery path today. Participants saw this as "stuck in Scoring for many hours" / "submitted ~12h ago, still scoring".

Root cause

In SubmissionCreationSerializer.update() (src/apps/api/serializers/submissions.py) the scoring re-enqueue was published synchronously, inside the request transaction, without retry or on_commit guard:

if validated_data.get("status") == Submission.SCORING:
    # Start scoring because we're "SCORING" status now from compute worker
    from competitions.tasks import run_submission
    run_submission(submission.pk, tasks=[submission.task], is_scoring=True)

Two problems compound:

  • run_submission is called before super().update() commits, so the broker can see (and start) the scoring task on a row that does not yet reflect SCORING.
  • Any broker error (ConnectionResetError, OperationalError, AMQP timeout) bubbles up from inside the PATCH handler. The worker side then swallows it in _update_status (literal comment in compute_worker/compute_worker.py:632-643: "Always catch exception and never raise error"), so the row remains in Scoring with no task ever queued.

Fix

Defer the scoring enqueue until after the DB transaction commits, and explicitly mark the submission Failed (with a clear status_details) if the publish itself fails, so no submission stays in a non-terminal limbo. update() is also wrapped in @transaction.atomic to make the commit boundary explicit.

     if validated_data.get("status") == Submission.SCORING:
-        # Start scoring because we're "SCORING" status now from compute worker
+        # Re-enqueue scoring AFTER the new status is committed: otherwise the
+        # site-worker may pick the task up before the row reflects SCORING,
+        # and a broker error here would leave the row stuck in SCORING forever.
         from competitions.tasks import run_submission
-        run_submission(submission.pk, tasks=[submission.task], is_scoring=True)
+        submission_pk = submission.pk
+        scoring_task = submission.task
+
+        def _enqueue_scoring():
+            try:
+                run_submission(submission_pk, tasks=[scoring_task], is_scoring=True)
+            except Exception:
+                logger.exception(
+                    "Failed to re-enqueue scoring for submission %s; marking Failed",
+                    submission_pk,
+                )
+                Submission.objects.filter(
+                    pk=submission_pk, status=Submission.SCORING,
+                ).update(
+                    status=Submission.FAILED,
+                    status_details="Broker unavailable when re-enqueuing scoring task",
+                )
+
+        transaction.on_commit(_enqueue_scoring)

…rows

When the compute worker PATCHes a submission to status=SCORING, the API serializer used to call run_submission() synchronously inside the same DB transaction. If the broker (RabbitMQ) was unreachable at that exact moment, the status row would commit but the scoring task would never be published, leaving the submission stuck in SCORING forever (no recovery: the 24h cleanup only rescues RUNNING rows).

Move the enqueue into transaction.on_commit so the task is only published after the SCORING status is durably committed, and explicitly mark the submission as Failed (with a clear status_details) if the publish still fails, so the row never stays in a non-terminal limbo state. Wrap update() in @transaction.atomic to make the commit boundary explicit.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Submissions stuck in "Scoring" when broker error occurs during compute_worker PATCH (non-transactional re-enqueue)

1 participant