Skip to content

[SYNSD-2567] Fix httpx.ResponseNotRead crash in download chunk retry path#1413

Open
BryanFauble wants to merge 1 commit into
developfrom
synsd-2567-download-retry
Open

[SYNSD-2567] Fix httpx.ResponseNotRead crash in download chunk retry path#1413
BryanFauble wants to merge 1 commit into
developfrom
synsd-2567-download-retry

Conversation

@BryanFauble

@BryanFauble BryanFauble commented Jun 22, 2026

Copy link
Copy Markdown
Member

Problem

A user reported a crash while downloading an 8 MiB+ S3 file (multi-threaded download path). A chunk range request hit a retryable HTTP status, and the retry machinery then crashed trying to read the body of a streaming response that was never read:

download_async.py:477  _stream_and_write_chunk
  -> retry.py:498       with_retry_time_based
  -> retry.py:576       _is_retryable        response_message = response_message or _get_message(response)
  -> retry.py:661       _get_message         return response.text
  -> httpx _models.py   raise ResponseNotRead()

The download eventually succeeded because the outer count-based retry in download_by_file_handle() (60 retries / ~30 min) re-ran the whole download. So this is a resiliency bug, not data loss: a transient retryable status (e.g. 503/429 from S3) that should be silently retried at the chunk level instead raised an unexpected ResponseNotRead, aborted the file, deleted it, and forced a full-file restart — wasting the work of every other in-flight chunk.

Root cause

The streaming download deliberately tells the retry layer not to read response bodies, because the body is an unconsumed stream:

  • download_async.py calls with_retry_time_based(..., read_response_content=False).
  • _execute_stream_and_write_chunk opens with session.stream(...) as response: and calls _raise_for_status_httpx(response, read_response_content=False), which raises SynapseHTTPError without reading the body and attaches the unread, stream-closed response as ex.response.

read_response_content was only forwarded to _log_for_retry, not to _is_retryable. So _is_retryable unconditionally called _get_message(response)response.text, raising httpx.ResponseNotRead. Its except (AttributeError, ValueError) did not catch it — httpx.ResponseNotRead is a RuntimeError subclass (ResponseNotRead -> StreamError -> RuntimeError) — so the exception escaped the retry loop it was meant to drive.

This affected both with_retry_time_based and with_retry_time_based_async.

Download lifecycle (REST API as source of truth)

S3 file handle, multi-threaded path. Range requests must return 206 Partial Content per the Synapse/S3 REST contract; transient throttling/availability errors come back as 429/503 etc. and are meant to be retried.

sequenceDiagram
    autonumber
    participant DBH as download_by_file_handle()<br/>(outer retry: 60x / ~30min)
    participant MT as download_from_url_multi_threaded()
    participant DL as _MultithreadedDownloader.download_file()
    participant EX as _execute_download_tasks()
    participant CW as _stream_and_write_chunk()<br/>(with_retry_time_based)
    participant ST as _execute_stream_and_write_chunk()
    participant S3 as S3 (presigned URL)

    DBH->>MT: download (file > 8MiB, S3_FILE_HANDLE)
    MT->>DL: download_file(request)
    DL->>S3: GET (stream) Content-Length  [retry 403, read_response_content=False]
    S3-->>DL: 200 + headers
    DL->>DL: _prep_file() (truncate to empty)
    DL->>EX: create N chunk tasks (8MiB ranges)

    loop each chunk (concurrent, thread pool)
        EX->>CW: _stream_and_write_chunk_wrapper -> run_in_executor
        CW->>ST: with_retry_time_based(expected=206, read_response_content=False)
        ST->>S3: GET Range: bytes=start-end  (session.stream)
        alt 206 Partial Content (happy path)
            S3-->>ST: 206 + body stream
            ST->>ST: response.read(), _write_chunk(seek+write), record span
            ST-->>CW: end byte
        else transient 429/503 (THE BUG)
            S3-->>ST: 503 (body NOT read)
            ST->>ST: _raise_for_status_httpx(read_response_content=False)<br/>raises SynapseHTTPError(response=unread stream)
            ST-->>CW: raises
            Note over CW: _is_retryable() ignores read_response_content,<br/>calls _get_message()->response.text<br/>=> httpx.ResponseNotRead  ❌ escapes retry
        end
    end

    EX->>EX: completed_task.result() re-raises ResponseNotRead
    EX->>EX: set _aborted, log "Failed downloading", os.remove(path)
    EX-->>DL: raise
    DL-->>MT: raise
    MT-->>DBH: raise
    DBH->>DBH: is_retryable_download_error? -> restart ENTIRE file (wasteful)
Loading

…path

The multi-threaded chunk download passes read_response_content=False to the
retry layer because the chunk's HTTP body is an unconsumed httpx stream. That
flag was forwarded to _log_for_retry but not to _is_retryable, so on a retryable
status (e.g. 503/429 from S3) _is_retryable called _get_message() ->
response.text on the unread stream, raising httpx.ResponseNotRead. Because that
is a RuntimeError subclass, _get_message's except did not catch it and the
exception escaped the retry loop, aborting the file and forcing a wasteful
full-file restart via the outer retry.

- Thread read_response_content into _is_retryable and guard both _get_message
  call sites with it (status-code retry decisions need no body)
- Defensively widen _get_message's except to include httpx.StreamError
  (parent of httpx.ResponseNotRead)
- Add sync + async regression tests using a genuinely unread httpx streaming
  response
@BryanFauble BryanFauble requested a review from a team as a code owner June 22, 2026 21:16
Copilot AI review requested due to automatic review settings June 22, 2026 21:16

Copilot AI left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Fixes a resiliency bug in the retry layer where handling a retryable HTTP status for an unread streaming httpx.Response could crash the retry loop with httpx.ResponseNotRead (notably in the multi-threaded download chunk retry path). This ensures chunk-level retries work as intended without aborting and restarting the entire file download.

Changes:

  • Thread read_response_content into _is_retryable() and propagate it from both sync/async time-based retry helpers.
  • Make _get_message() resilient to unread streaming responses by catching httpx.StreamError (covers ResponseNotRead).
  • Add sync + async regression tests that reproduce the unread-streaming-response scenario and assert the retry loop does not crash.

Reviewed changes

Copilot reviewed 2 out of 2 changed files in this pull request and generated no comments.

File Description
synapseclient/core/retry.py Propagates read_response_content into retry decision logic and hardens message extraction for streaming responses.
tests/unit/synapseclient/core/unit_test_retry.py Adds regression tests covering unread httpx streaming responses for both sync and async time-based retries.
Comments suppressed due to low confidence (1)

synapseclient/core/retry.py:607

  • _is_retryable() can crash when _get_message(response) returns None (e.g., truncated body or unread/streaming body). In that case response_message.lower() raises AttributeError, which will escape the retry loop. This becomes more likely now that _get_message catches httpx.StreamError and returns None. Guard against a None message before calling .lower() / substring checks.
        elif read_response_content and response.status_code not in range(200, 299):
            # For all other non 200 messages look for retryable errors in the body or reason field
            response_message = response_message or _get_message(response)
            if (
                any([msg.lower() in response_message.lower() for msg in retry_errors])
                # special case for message throttling
                or response_message
                and (
                    "Please slow down.  You may send a maximum of 10 message"
                    in response_message
                )
            ):

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

@andrewelamb andrewelamb left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I found a possible bug unrelated to the actual ticket. LGTM!

# For all other non 200 messages look for retryable errors in the body or reason field
response_message = response_message or _get_message(response)
if (
any([msg.lower() in response_message.lower() for msg in retry_errors])

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is out of scope for this ticket, but response_message could be None at this point and calling lower on it would cause an exception.

@thomasyu888 thomasyu888 left a comment

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🔥 LGTM! Thanks for doing this, the diagram was really helpful in stepping through where read_response_content is False, and the comment on httpx.ResponseNotRead being a subclass of httpx.StreamError is helpful.

@linglp linglp left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @BryanFauble ! I spent some time understanding the problem and the fixes. It seems like that the real fix is the guard in _is_retryable to avoid reading response content, and the fix in _get_message is a good defensive addition.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants