Anonymous View
Skip to content

Optimize parsers: O(n²) → O(n) offset calculation#121

Open
rhukster wants to merge 1 commit into
thunderer:masterfrom
rhukster:master
Open

Optimize parsers: O(n²) → O(n) offset calculation#121
rhukster wants to merge 1 commit into
thunderer:masterfrom
rhukster:master

Conversation

@rhukster

Copy link
Copy Markdown

Summary

RegexParser and WordpressParser computed each match's character offset with mb_strlen(substr($text, 0, $match[1])), which rescans the entire prefix for every match. That makes parsing O(n²) in the number of shortcodes, and on large documents it dominates the parse time.

This PR accumulates the character offset incrementally, measuring only the new segment of text since the previous match. preg_match_all (and the WordPress loop) return matches in ascending offset order, so the running total stays exact. The result is O(n).

Benchmark

504 KB document, 1,500 shortcodes:

Parser before after speedup
RegexParser 223.5 ms 3.4 ms 66x
WordpressParser 221.7 ms 1.0 ms 222x

RegularParser is unaffected by the offset change (it tokenizes differently).

Also included (small, behavior-preserving)

  • Replace the hand-rolled per-character backslash escaping (preg_replace('/(.)/us', '\\$0', ...)) with preg_quote() in RegularParser and RegexBuilderUtility.
  • Drop a no-op array_filter() in RegexParser::parse(); parseSingle() always returns a ParsedShortcode, never null.
  • Iterate replacements with a reverse for loop instead of allocating an array_reverse() copy in Processor.
  • Short-circuit the Shortcode parameter-type validation on the first invalid value instead of array_filter over all of them, and cache a few Syntax accessors in RegexParser.

Tests

No behavior change. The full suite passes: 288 tests, 2256 assertions.

RegexParser and WordpressParser recomputed each match's character offset
with mb_strlen(substr($text, 0, $match[1])), rescanning the whole prefix
for every match, which is O(n^2) in the number of shortcodes. Accumulate
the character offset incrementally instead, measuring only the new segment
since the previous match. Matches come back in ascending offset order, so
the running total stays exact.

On a 504 KB document with 1,500 shortcodes:

  RegexParser      223.5 ms -> 3.4 ms   (66x)
  WordpressParser  221.7 ms -> 1.0 ms   (222x)

Also:
- replace the hand-rolled per-character backslash escaping with preg_quote()
  in RegularParser and RegexBuilderUtility,
- drop a no-op array_filter() in RegexParser::parse() (parseSingle() never
  returns null),
- iterate replacements with a reverse for-loop instead of allocating an
  array_reverse() copy in Processor,
- short-circuit the Shortcode parameter-type check on the first invalid
  value instead of array_filter over all of them.

No behavior change: the full test suite passes (288 tests, 2256 assertions).
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant