[SYNSD-2567] Fix httpx.ResponseNotRead crash in download chunk retry path#1413
[SYNSD-2567] Fix httpx.ResponseNotRead crash in download chunk retry path#1413BryanFauble wants to merge 1 commit into
Conversation
…path The multi-threaded chunk download passes read_response_content=False to the retry layer because the chunk's HTTP body is an unconsumed httpx stream. That flag was forwarded to _log_for_retry but not to _is_retryable, so on a retryable status (e.g. 503/429 from S3) _is_retryable called _get_message() -> response.text on the unread stream, raising httpx.ResponseNotRead. Because that is a RuntimeError subclass, _get_message's except did not catch it and the exception escaped the retry loop, aborting the file and forcing a wasteful full-file restart via the outer retry. - Thread read_response_content into _is_retryable and guard both _get_message call sites with it (status-code retry decisions need no body) - Defensively widen _get_message's except to include httpx.StreamError (parent of httpx.ResponseNotRead) - Add sync + async regression tests using a genuinely unread httpx streaming response
There was a problem hiding this comment.
Pull request overview
Fixes a resiliency bug in the retry layer where handling a retryable HTTP status for an unread streaming httpx.Response could crash the retry loop with httpx.ResponseNotRead (notably in the multi-threaded download chunk retry path). This ensures chunk-level retries work as intended without aborting and restarting the entire file download.
Changes:
- Thread
read_response_contentinto_is_retryable()and propagate it from both sync/async time-based retry helpers. - Make
_get_message()resilient to unread streaming responses by catchinghttpx.StreamError(coversResponseNotRead). - Add sync + async regression tests that reproduce the unread-streaming-response scenario and assert the retry loop does not crash.
Reviewed changes
Copilot reviewed 2 out of 2 changed files in this pull request and generated no comments.
| File | Description |
|---|---|
synapseclient/core/retry.py |
Propagates read_response_content into retry decision logic and hardens message extraction for streaming responses. |
tests/unit/synapseclient/core/unit_test_retry.py |
Adds regression tests covering unread httpx streaming responses for both sync and async time-based retries. |
Comments suppressed due to low confidence (1)
synapseclient/core/retry.py:607
_is_retryable()can crash when_get_message(response)returnsNone(e.g., truncated body or unread/streaming body). In that caseresponse_message.lower()raisesAttributeError, which will escape the retry loop. This becomes more likely now that_get_messagecatcheshttpx.StreamErrorand returnsNone. Guard against aNonemessage before calling.lower()/ substring checks.
elif read_response_content and response.status_code not in range(200, 299):
# For all other non 200 messages look for retryable errors in the body or reason field
response_message = response_message or _get_message(response)
if (
any([msg.lower() in response_message.lower() for msg in retry_errors])
# special case for message throttling
or response_message
and (
"Please slow down. You may send a maximum of 10 message"
in response_message
)
):
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
andrewelamb
left a comment
There was a problem hiding this comment.
I found a possible bug unrelated to the actual ticket. LGTM!
| # For all other non 200 messages look for retryable errors in the body or reason field | ||
| response_message = response_message or _get_message(response) | ||
| if ( | ||
| any([msg.lower() in response_message.lower() for msg in retry_errors]) |
There was a problem hiding this comment.
This is out of scope for this ticket, but response_message could be None at this point and calling lower on it would cause an exception.
thomasyu888
left a comment
There was a problem hiding this comment.
🔥 LGTM! Thanks for doing this, the diagram was really helpful in stepping through where read_response_content is False, and the comment on httpx.ResponseNotRead being a subclass of httpx.StreamError is helpful.
linglp
left a comment
There was a problem hiding this comment.
Thanks @BryanFauble ! I spent some time understanding the problem and the fixes. It seems like that the real fix is the guard in _is_retryable to avoid reading response content, and the fix in _get_message is a good defensive addition.
Problem
A user reported a crash while downloading an 8 MiB+ S3 file (multi-threaded download path). A chunk range request hit a retryable HTTP status, and the retry machinery then crashed trying to read the body of a streaming response that was never read:
The download eventually succeeded because the outer count-based retry in
download_by_file_handle()(60 retries / ~30 min) re-ran the whole download. So this is a resiliency bug, not data loss: a transient retryable status (e.g. 503/429 from S3) that should be silently retried at the chunk level instead raised an unexpectedResponseNotRead, aborted the file, deleted it, and forced a full-file restart — wasting the work of every other in-flight chunk.Root cause
The streaming download deliberately tells the retry layer not to read response bodies, because the body is an unconsumed stream:
download_async.pycallswith_retry_time_based(..., read_response_content=False)._execute_stream_and_write_chunkopenswith session.stream(...) as response:and calls_raise_for_status_httpx(response, read_response_content=False), which raisesSynapseHTTPErrorwithout reading the body and attaches the unread, stream-closed response asex.response.read_response_contentwas only forwarded to_log_for_retry, not to_is_retryable. So_is_retryableunconditionally called_get_message(response)→response.text, raisinghttpx.ResponseNotRead. Itsexcept (AttributeError, ValueError)did not catch it —httpx.ResponseNotReadis aRuntimeErrorsubclass (ResponseNotRead -> StreamError -> RuntimeError) — so the exception escaped the retry loop it was meant to drive.This affected both
with_retry_time_basedandwith_retry_time_based_async.Download lifecycle (REST API as source of truth)
S3 file handle, multi-threaded path. Range requests must return 206 Partial Content per the Synapse/S3 REST contract; transient throttling/availability errors come back as 429/503 etc. and are meant to be retried.
sequenceDiagram autonumber participant DBH as download_by_file_handle()<br/>(outer retry: 60x / ~30min) participant MT as download_from_url_multi_threaded() participant DL as _MultithreadedDownloader.download_file() participant EX as _execute_download_tasks() participant CW as _stream_and_write_chunk()<br/>(with_retry_time_based) participant ST as _execute_stream_and_write_chunk() participant S3 as S3 (presigned URL) DBH->>MT: download (file > 8MiB, S3_FILE_HANDLE) MT->>DL: download_file(request) DL->>S3: GET (stream) Content-Length [retry 403, read_response_content=False] S3-->>DL: 200 + headers DL->>DL: _prep_file() (truncate to empty) DL->>EX: create N chunk tasks (8MiB ranges) loop each chunk (concurrent, thread pool) EX->>CW: _stream_and_write_chunk_wrapper -> run_in_executor CW->>ST: with_retry_time_based(expected=206, read_response_content=False) ST->>S3: GET Range: bytes=start-end (session.stream) alt 206 Partial Content (happy path) S3-->>ST: 206 + body stream ST->>ST: response.read(), _write_chunk(seek+write), record span ST-->>CW: end byte else transient 429/503 (THE BUG) S3-->>ST: 503 (body NOT read) ST->>ST: _raise_for_status_httpx(read_response_content=False)<br/>raises SynapseHTTPError(response=unread stream) ST-->>CW: raises Note over CW: _is_retryable() ignores read_response_content,<br/>calls _get_message()->response.text<br/>=> httpx.ResponseNotRead ❌ escapes retry end end EX->>EX: completed_task.result() re-raises ResponseNotRead EX->>EX: set _aborted, log "Failed downloading", os.remove(path) EX-->>DL: raise DL-->>MT: raise MT-->>DBH: raise DBH->>DBH: is_retryable_download_error? -> restart ENTIRE file (wasteful)