Fix: Extended Query Protocol Parse packets corrupted by interceptor#18
Fix: Extended Query Protocol Parse packets corrupted by interceptor#18nik-localstack wants to merge 1 commit into
Conversation
…y OID suffix The Parse packet body (type 'P') has the format: statement_name\x00 + query\x00 + int16(param_count) + uint32[] OIDs The old handler used data[1:-2] / data[-2:] which treated only the last 2 bytes as the param-count field, leaking binary OID bytes into the query slice fed to _intercept_query(). When a parameter type OID contains a byte >= 0x80 (e.g. jsonb OID 3802 = 0x00000EDA), the subsequent UTF-8 decode raised UnicodeDecodeError, causing the connection to drop or hang. Fix: use find(b'\x00') to locate the true boundaries of statement name and query text, so the binary suffix is never touched by the text decoder. Adds a regression test using psycopg v3 (Extended Query Protocol) with a jsonb+text parameterized INSERT through the proxy. Co-authored-by: GitHub Copilot <copilot@github.com>
2964192 to
a143e8b
Compare
bentsku
left a comment
There was a problem hiding this comment.
Ah, that's interesting! Nice find!
I looked a bit in the PG docs to find more info about the structure of the Parse message to understand it.
- https://www.postgresql.org/docs/8.1/protocol-flow.html#AEN60706
- https://www.postgresql.org/docs/8.1/protocol-message-types.html
- https://www.postgresql.org/docs/8.1/protocol-message-formats.html#:~:text=of%20the%20parameter.-,Parse%20(F),-Byte1('P
I mostly have a comment about the nested find() usage, and about the testing, if we could add some Interceptors in our tests to be able to assert what is intercepted so we could make sure we're properly understanding the requests.
Would that make sense?
| # Parse packet body: | ||
| # statement_name\x00 + query\x00 + int16(param_count) + uint32[] | ||
| # Keep the binary suffix untouched (count + OID array). | ||
| statement_end = data.find(b"\x00") |
There was a problem hiding this comment.
I wonder if find() is the best tool here when we need to find multiple times? split might be easier but I can see the null byte being part of the query being an expectation of _intercept_query, so this isn't great as it removes it...
split_data = data.split(b"\x00")
if len(split_data) == 3:
statement, raw_query, params = split_data
query = self._intercept_query(raw_query, ic_queries)
# I would have preferred to use `b"\x00".join` but `_intercept_query` adds the null byte at the end and is patched in LocalStack, so better not
data = statement + b"\x00" + query + params
And in _intercept_query, we could do this:
# Remove null byte at the end
- query = query[:-1].decode(codec)
+ query = query.rstrip(b"\x00").decode(codec)Or if we keep the find() one, I would maybe simplify it a little:
statement_end = data.find(b"\x00")
query_end = data.rfind(b"\x00")
if statement_end != -1 and query_end != -1:
query_start = statement_end + 1
# _intercept_query expects the null byte separator ending the query part
query = self._intercept_query(
data[query_start : query_end + 1], ic_queries
)
# return statement + rewritten query + parameters
data = data[:query_start] + query + data[query_end + 1 :]| params = data[-2:] | ||
| data = statement + query + params | ||
| # Parse packet body: | ||
| # statement_name\x00 + query\x00 + int16(param_count) + uint32[] |
There was a problem hiding this comment.
I found this for the Parse message type:
Parse (F)
- Byte1('P')
Identifies the message as a Parse command.- Int32
Length of message contents in bytes, including self.- String
The name of the destination prepared statement (an empty string selects the unnamed prepared statement).- String
The query string to be parsed.- Int16
The number of parameter data types specified (may be zero). Note that this is not an indication of the number of parameters that might appear in the query string, only the number that the frontend wants to prespecify types for.Then, for each parameter, there is the following:
- Int32
Specifies the object ID of the parameter data type. Placing a zero here is equivalent to leaving the type unspecified.
And we can see String is separated by null byte here:
String(s)
A null-terminated string (C-style string). There is no specific length limitation on strings. If s is specified it is the exact value that will appear, otherwise the value is variable. Eg. String, String("user").
There was a problem hiding this comment.
related to the issue, I wonder if we should add a dummy QueryInterceptor here in the proxy and store what we intercept, and assert on the intercepter data, that we have exactly what we are expecting? This way, we could really assert that the interceptor is working as expected, if that's easy enough to add
| # Query, ends with b'\x00' | ||
| data = self._intercept_query(data, ic_queries) | ||
| elif packet_type == b"P": | ||
| # Statement that needs parsing. |
There was a problem hiding this comment.
might be worth keeping/updating the comment that this is a Parse message with statements?
Motivation
psycopg v3 (and asyncpg) use the Extended Query Protocol, sending a Parse packet whose body ends with binary uint32 parameter-type OIDs.
The interceptor sliced the body with
data[1:-2], leaking those OID bytes into a UTF-8 decode.The jsonb OID (
3802=0x0EDA) contains0xDAwhich is invalid UTF-8 and as a result crashing the interceptor and leaving the connection in a broken state.Fix
Use
find(b"\x00")to locate the true null-terminators of the statement name and query text, so the binary OID suffix is never fed to the text decoder.Tests
A new e2e regression test that connects via psycopg v3 through the proxy and executes a parameterized INSERT with jsonb and text arguments.