Parser: fix exponential parse time on speculative prefix parsing#2352
Parser: fix exponential parse time on speculative prefix parsing#2352LucaCappelletti94 wants to merge 1 commit into
Conversation
b6a762e to
7dcc611
Compare
|
No more exponential-class crashes for a day! Maybe we are done! |
| // Memoize failed speculative reserved-word parses. When | ||
| // the reserved arm (CASE, CURRENT_TIME, etc.) does | ||
| // exponential work but the unreserved fallback ultimately | ||
| // succeeds, the overall `parse_prefix` returns `Ok` and the | ||
| // outer cache never fires. Chains like `case-case-...c` | ||
| // need this per-arm cache to break the doubling. |
There was a problem hiding this comment.
I wasn't able to follow this comment, what is 'outer cache' and 'break doubling' referring to?
There was a problem hiding this comment.
Outer cache is the cache in the parent call before the recursive call, and doubling is that at each layer of depth it went in before, the operations doubled (hence the previous 2^layers complexity from before)
| } | ||
| let result = self.parse_prefix_inner(); | ||
| if let Err(ref e) = result { | ||
| self.failed_prefix_positions.insert(start_index, e.clone()); |
There was a problem hiding this comment.
wondering about memory usage, how large do we expect the caches to get in the worse case scenarios?
There was a problem hiding this comment.
I have a ~2GB limit on the fuzzer processes and after many billions of iterations it did not manage to OOM - it hit before the ASAN thread number limit than an OOM, so at the very least I believe it is not exponential. I can try and get a scaling though.
There was a problem hiding this comment.
hmm yeah I'm mostly worried about the increased memory usage. A lot of deployments don't have e.g. 2GB to allocate to parsing a sql query. I think the main issue is that the map contains strings, maybe if its a much cheaper/copy object somehow we were looking to store, it might be feasible.
There was a problem hiding this comment.
Got a scaling for the cache memory. The two caches are keyed by token index, so each entry is about 40 bytes (8 byte usize key plus a 32 byte ParserError). Panels plot entries, heap String bytes, and total memory vs chain length N for valid SQL, the nested if(current_time(...x chain, and the wide case-...c chain.
- On valid SQL both maps stay empty (0 entries at a 27 KiB SELECT).
- Heap String bytes are bounded by recursion depth, not input: past the recursion limit the cached error is RecursionLimitExceeded with no String, so strings peak at ~1.8 KiB and never grow with input.
- Total memory is linear with a small constant: capped at ~4 KiB for deep nesting, at most ~13x input for the adversarial wide shape (159 KiB at a 12 KiB input). Reaching 2 GB would need a ~154 MB single statement.
So it is linear and never exponential, and valid SQL costs nothing.
There was a problem hiding this comment.
valid SQL costs nothing
My understanding is that e.g. on the parse_expr_prefix_by_reserved_word path, each unique attempt will add an error string into the map, if so then we're potentially looking at one entry in that map per word roughly. is that the the case?
To be clear, its not about a particular amount of memory, main thing is that we're not increasing memory usage of the parser significantly - if the additional memory usage grows as a function of the sql string, already that is problematic, then to improve it we would like to have each entry as minimal as we can, or potentially consider other solutions
There was a problem hiding this comment.
I am not sure I follow, currently the runtime is exponential, so the parser is just failing on these inputs. With the fix, the time requirement becomes linear and the the memory requirements remain linear with input size even for pathological cases

parse_prefixspeculatively tries the next token as a reserved-word expression head and, on failure, falls back to treating it as an unreserved identifier. Both arms can independently recurse into the same downstream position. Two complementary caches now break the 2^N exploration: an outer per-position failure cache catches chains where both arms fail at every level, and a per-arm cache on the reserved-wordtry_parsecatches chains where the arm fails but the fallback succeeds.Supersedes #2349 (named-arg chain) and #2351 (NOT-prefix chain). Does NOT supersede #2350 (compound-keyword
.not-bchain, which lives inparse_compound_expr).Same machine, release build, apache main
182eae81baseline:if(current_time(...x, n=30.foo(t--,i)chain, n=25 (was #2349)x-not-b.chain, n=30 (was #2351)case\t-chain, n=30 (new)cAse#f-#9-cAse#f...fuzz artifactRegression tests in
tests/sqlparser_common.rsand benches undersqlparser_benchcover theif(current_timeandcase-caseshapes.