Optimize parsers: O(n²) → O(n) offset calculation by rhukster · Pull Request #121 · thunderer/Shortcode

rhukster · 2026-06-17T17:29:52Z

Summary

RegexParser and WordpressParser computed each match's character offset with mb_strlen(substr($text, 0, $match[1])), which rescans the entire prefix for every match. That makes parsing O(n²) in the number of shortcodes, and on large documents it dominates the parse time.

This PR accumulates the character offset incrementally, measuring only the new segment of text since the previous match. preg_match_all (and the WordPress loop) return matches in ascending offset order, so the running total stays exact. The result is O(n).

Benchmark

504 KB document, 1,500 shortcodes:

Parser	before	after	speedup
`RegexParser`	223.5 ms	3.4 ms	66x
`WordpressParser`	221.7 ms	1.0 ms	222x

RegularParser is unaffected by the offset change (it tokenizes differently).

Also included (small, behavior-preserving)

Replace the hand-rolled per-character backslash escaping (preg_replace('/(.)/us', '\\$0', ...)) with preg_quote() in RegularParser and RegexBuilderUtility.
Drop a no-op array_filter() in RegexParser::parse(); parseSingle() always returns a ParsedShortcode, never null.
Iterate replacements with a reverse for loop instead of allocating an array_reverse() copy in Processor.
Short-circuit the Shortcode parameter-type validation on the first invalid value instead of array_filter over all of them, and cache a few Syntax accessors in RegexParser.

Tests

No behavior change. The full suite passes: 288 tests, 2256 assertions.

RegexParser and WordpressParser recomputed each match's character offset with mb_strlen(substr($text, 0, $match[1])), rescanning the whole prefix for every match, which is O(n^2) in the number of shortcodes. Accumulate the character offset incrementally instead, measuring only the new segment since the previous match. Matches come back in ascending offset order, so the running total stays exact. On a 504 KB document with 1,500 shortcodes: RegexParser 223.5 ms -> 3.4 ms (66x) WordpressParser 221.7 ms -> 1.0 ms (222x) Also: - replace the hand-rolled per-character backslash escaping with preg_quote() in RegularParser and RegexBuilderUtility, - drop a no-op array_filter() in RegexParser::parse() (parseSingle() never returns null), - iterate replacements with a reverse for-loop instead of allocating an array_reverse() copy in Processor, - short-circuit the Shortcode parameter-type check on the first invalid value instead of array_filter over all of them. No behavior change: the full test suite passes (288 tests, 2256 assertions).

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Optimize parsers: O(n²) → O(n) offset calculation#121

Optimize parsers: O(n²) → O(n) offset calculation#121
rhukster wants to merge 1 commit into
thunderer:masterfrom
rhukster:master

rhukster commented Jun 17, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

rhukster commented Jun 17, 2026

Summary

Benchmark

Also included (small, behavior-preserving)

Tests

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant