Parser: fix exponential parse time on speculative prefix parsing by LucaCappelletti94 · Pull Request #2352 · apache/datafusion-sqlparser-rs

LucaCappelletti94 · 2026-05-23T22:16:23Z

parse_prefix speculatively tries the next token as a reserved-word expression head and, on failure, falls back to treating it as an unreserved identifier. Both arms can independently recurse into the same downstream position. Two complementary caches now break the 2^N exploration: an outer per-position failure cache catches chains where both arms fail at every level, and a per-arm cache on the reserved-word try_parse catches chains where the arm fails but the fallback succeeds.

Supersedes #2349 (named-arg chain) and #2351 (NOT-prefix chain). Does NOT supersede #2350 (compound-keyword .not-b chain, which lives in parse_compound_expr).

Same machine, release build, apache main 182eae81 baseline:

Input	Dialect	Size	Before	After
`if(current_time(...x`, n=30	PG	394 B	>60s	591us
`.foo(t--,i)` chain, n=25 (was #2349)	PG	308 B	>60s	550us
`x-not-b.` chain, n=30 (was #2351)	Generic	247 B	>60s	1.45ms
`case\t-` chain, n=30 (new)	SQLite	181 B	>60s	700us
`cAse#f-#9-cAse#f...` fuzz artifact	PG	142 B	86ms	486us

Regression tests in tests/sqlparser_common.rs and benches under sqlparser_bench cover the if(current_time and case-case shapes.

LucaCappelletti94 · 2026-05-23T22:17:56Z

Me watching yet another fuzzer crash:

LucaCappelletti94 · 2026-05-25T08:53:26Z

No more exponential-class crashes for a day! Maybe we are done!

iffyio · 2026-05-27T10:52:11Z

+                // Memoize failed speculative reserved-word parses. When
+                // the reserved arm (CASE, CURRENT_TIME, etc.) does
+                // exponential work but the unreserved fallback ultimately
+                // succeeds, the overall `parse_prefix` returns `Ok` and the
+                // outer cache never fires. Chains like `case-case-...c`
+                // need this per-arm cache to break the doubling.


I wasn't able to follow this comment, what is 'outer cache' and 'break doubling' referring to?

Outer cache is the cache in the parent call before the recursive call, and doubling is that at each layer of depth it went in before, the operations doubled (hence the previous 2^layers complexity from before)

iffyio · 2026-05-27T10:53:33Z

+        }
+        let result = self.parse_prefix_inner();
+        if let Err(ref e) = result {
+            self.failed_prefix_positions.insert(start_index, e.clone());


wondering about memory usage, how large do we expect the caches to get in the worse case scenarios?

I have a ~2GB limit on the fuzzer processes and after many billions of iterations it did not manage to OOM - it hit before the ASAN thread number limit than an OOM, so at the very least I believe it is not exponential. I can try and get a scaling though.

hmm yeah I'm mostly worried about the increased memory usage. A lot of deployments don't have e.g. 2GB to allocate to parsing a sql query. I think the main issue is that the map contains strings, maybe if its a much cheaper/copy object somehow we were looking to store, it might be feasible.

Got a scaling for the cache memory. The two caches are keyed by token index, so each entry is about 40 bytes (8 byte usize key plus a 32 byte ParserError). Panels plot entries, heap String bytes, and total memory vs chain length N for valid SQL, the nested if(current_time(...x chain, and the wide case-...c chain.

On valid SQL both maps stay empty (0 entries at a 27 KiB SELECT).

Heap String bytes are bounded by recursion depth, not input: past the recursion limit the cached error is RecursionLimitExceeded with no String, so strings peak at ~1.8 KiB and never grow with input.

Total memory is linear with a small constant: capped at ~4 KiB for deep nesting, at most ~13x input for the adversarial wide shape (159 KiB at a 12 KiB input). Reaching 2 GB would need a ~154 MB single statement.

So it is linear and never exponential, and valid SQL costs nothing.

valid SQL costs nothing

My understanding is that e.g. on the parse_expr_prefix_by_reserved_word path, each unique attempt will add an error string into the map, if so then we're potentially looking at one entry in that map per word roughly. is that the the case?

To be clear, its not about a particular amount of memory, main thing is that we're not increasing memory usage of the parser significantly - if the additional memory usage grows as a function of the sql string, already that is problematic, then to improve it we would like to have each entry as minimal as we can, or potentially consider other solutions

I am not sure I follow, currently the runtime is exponential, so the parser is just failing on these inputs. With the fix, the time requirement becomes linear and the the memory requirements remain linear with input size even for pathological cases

Parser: fix exponential parse time on speculative prefix parsing

7dcc611

LucaCappelletti94 force-pushed the pathological6 branch from b6a762e to 7dcc611 Compare May 24, 2026 07:36

This was referenced May 24, 2026

Parser: fix exponential parse time on named-arg chains #2349

Closed

Parser: fix exponential parse time on speculative NOT prefix parsing #2351

Closed

iffyio reviewed May 27, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Parser: fix exponential parse time on speculative prefix parsing#2352

Parser: fix exponential parse time on speculative prefix parsing#2352
LucaCappelletti94 wants to merge 1 commit into
apache:mainfrom
LucaCappelletti94:pathological6

LucaCappelletti94 commented May 23, 2026 •

edited

Loading

Uh oh!

LucaCappelletti94 commented May 23, 2026

Uh oh!

LucaCappelletti94 commented May 25, 2026

Uh oh!

iffyio May 27, 2026

Uh oh!

LucaCappelletti94 May 27, 2026

Uh oh!

iffyio May 27, 2026

Uh oh!

LucaCappelletti94 May 27, 2026

Uh oh!

iffyio May 27, 2026

Uh oh!

LucaCappelletti94 May 27, 2026

Uh oh!

iffyio May 28, 2026

Uh oh!

LucaCappelletti94 May 28, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

LucaCappelletti94 commented May 23, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

LucaCappelletti94 commented May 23, 2026

Uh oh!

LucaCappelletti94 commented May 25, 2026

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

LucaCappelletti94 commented May 23, 2026 •

edited

Loading