Skip to content

Fix: Extended Query Protocol Parse packets corrupted by interceptor#18

Open
nik-localstack wants to merge 1 commit into
masterfrom
fix/parametrized-upsert
Open

Fix: Extended Query Protocol Parse packets corrupted by interceptor#18
nik-localstack wants to merge 1 commit into
masterfrom
fix/parametrized-upsert

Conversation

@nik-localstack
Copy link
Copy Markdown

@nik-localstack nik-localstack commented May 22, 2026

Motivation

psycopg v3 (and asyncpg) use the Extended Query Protocol, sending a Parse packet whose body ends with binary uint32 parameter-type OIDs.
The interceptor sliced the body with data[1:-2], leaking those OID bytes into a UTF-8 decode.
The jsonb OID (3802 = 0x0EDA) contains 0xDA which is invalid UTF-8 and as a result crashing the interceptor and leaving the connection in a broken state.

Fix

Use find(b"\x00") to locate the true null-terminators of the statement name and query text, so the binary OID suffix is never fed to the text decoder.

Tests

A new e2e regression test that connects via psycopg v3 through the proxy and executes a parameterized INSERT with jsonb and text arguments.

@nik-localstack nik-localstack self-assigned this May 22, 2026
…y OID suffix

The Parse packet body (type 'P') has the format:
  statement_name\x00 + query\x00 + int16(param_count) + uint32[] OIDs
The old handler used data[1:-2] / data[-2:] which treated only the last
2 bytes as the param-count field, leaking binary OID bytes into the query
slice fed to _intercept_query().  When a parameter type OID contains a
byte >= 0x80 (e.g. jsonb OID 3802 = 0x00000EDA), the subsequent UTF-8
decode raised UnicodeDecodeError, causing the connection to drop or hang.
Fix: use find(b'\x00') to locate the true boundaries of statement name
and query text, so the binary suffix is never touched by the text decoder.
Adds a regression test using psycopg v3 (Extended Query Protocol) with
a jsonb+text parameterized INSERT through the proxy.

Co-authored-by: GitHub Copilot <copilot@github.com>
@nik-localstack nik-localstack force-pushed the fix/parametrized-upsert branch from 2964192 to a143e8b Compare May 23, 2026 10:10
@nik-localstack nik-localstack marked this pull request as ready for review May 23, 2026 10:27
Copy link
Copy Markdown

@bentsku bentsku left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, that's interesting! Nice find!

I looked a bit in the PG docs to find more info about the structure of the Parse message to understand it.

I mostly have a comment about the nested find() usage, and about the testing, if we could add some Interceptors in our tests to be able to assert what is intercepted so we could make sure we're properly understanding the requests.

Would that make sense?

# Parse packet body:
# statement_name\x00 + query\x00 + int16(param_count) + uint32[]
# Keep the binary suffix untouched (count + OID array).
statement_end = data.find(b"\x00")
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wonder if find() is the best tool here when we need to find multiple times? split might be easier but I can see the null byte being part of the query being an expectation of _intercept_query, so this isn't great as it removes it...

split_data = data.split(b"\x00")
if len(split_data) == 3:
    statement, raw_query, params = split_data
    query = self._intercept_query(raw_query, ic_queries)
    # I would have preferred to use `b"\x00".join` but `_intercept_query` adds the null byte at the end and is patched in LocalStack, so better not
    data = statement + b"\x00" + query + params

And in _intercept_query, we could do this:

# Remove null byte at the end
- query = query[:-1].decode(codec)
+ query = query.rstrip(b"\x00").decode(codec)

Or if we keep the find() one, I would maybe simplify it a little:

statement_end = data.find(b"\x00")
query_end = data.rfind(b"\x00")
if statement_end != -1 and query_end != -1:
    query_start = statement_end + 1
    # _intercept_query expects the null byte separator ending the query part
    query = self._intercept_query(
        data[query_start : query_end + 1], ic_queries
    )
    # return statement + rewritten query + parameters
    data = data[:query_start] + query + data[query_end + 1 :]

params = data[-2:]
data = statement + query + params
# Parse packet body:
# statement_name\x00 + query\x00 + int16(param_count) + uint32[]
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I found this for the Parse message type:

Parse (F)

  • Byte1('P')
    Identifies the message as a Parse command.
  • Int32
    Length of message contents in bytes, including self.
  • String
    The name of the destination prepared statement (an empty string selects the unnamed prepared statement).
  • String
    The query string to be parsed.
  • Int16
    The number of parameter data types specified (may be zero). Note that this is not an indication of the number of parameters that might appear in the query string, only the number that the frontend wants to prespecify types for.

Then, for each parameter, there is the following:

  • Int32
    Specifies the object ID of the parameter data type. Placing a zero here is equivalent to leaving the type unspecified.

And we can see String is separated by null byte here:

String(s)
A null-terminated string (C-style string). There is no specific length limitation on strings. If s is specified it is the exact value that will appear, otherwise the value is variable. Eg. String, String("user").

Comment thread tests/test_proxy.py
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

related to the issue, I wonder if we should add a dummy QueryInterceptor here in the proxy and store what we intercept, and assert on the intercepter data, that we have exactly what we are expecting? This way, we could really assert that the interceptor is working as expected, if that's easy enough to add

# Query, ends with b'\x00'
data = self._intercept_query(data, ic_queries)
elif packet_type == b"P":
# Statement that needs parsing.
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

might be worth keeping/updating the comment that this is a Parse message with statements?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants