Perl-compatible regular expression optimizer


0. Contents

1. Purpose

Optimizes perl-compatible regular expressions.

2. Usage

The general syntax for running the program is: regex-opt <regexp>

Example:
regex-opt 'xaz|xbz|xcz'
x[a-c]z

Try running regex-opt on Abigail's 7 kilobyte URL regexp. The result should be about 5 kilobytes long.

3. Supported syntax

  • * (repeat 0-inf)
  • + (repeat 1-inf)
  • ? (repeat 0-1)
  • {n} (repeat n)
  • {n,} (repeat n-inf)
  • {n,m} (repeat n-m)
  • . (accept any char except \n)
  • [a-z] (character sets)
  • [^a-z] (inverse character sets)
  • [[:alpha:]] (character classes)
  • \s (and other character classes and escapes)
  • x|y (alternatives)
  • (?:x|y) (non-capturing grouping)
  • *? (non-greedy repeat)

4. Unsupported syntax

  • ^ (match string-begin)
  • $ (match string-end)
  • () (capturing is converted to noncapturing)
  • Any (? -command that is not mentioned in supported syntax
  • Unicode-specific markup

5. Optimizations performed

  • Character set optimization: [A-Zabcdefgh-yz0-9%] becomes [[:alnum:]%]
  • Alternate characters: y|[yp]|[zx] becomes [px-z]
  • Counting: aaa* and aa+ become a{2,} and (a?){3} becomes a{0,3}
  • Combining: abcde|xycde becomes (?:ab|xy)cde
  • Parenthesis reduction: ((abc)) becomes abc, (xx|yy)|zz becomes xx|yy|zz
  • Compression: xyzyzxyzyz becomes (?:x(?:yz){2}){2}
    • This might not be always a good thing.
  • Choice counting: a+|aa+ becomes a+, (b|) becomes b?, dxxxxb|dxxxb|dxxb|dxb becomes dx{1,4}b

6. Optimizations not performed

  • Combining counts:
    • a?|b? should become (?:a|b)?, now becomes a?|b?
  • Redundancy removal (removal of alternatives that are subsets of other alternatives):
    • xfooy|x[a-q]+y should become x[a-q]+y, now becomes x(?:foo|[a-q]+)y
Help in solving these shortcomings would be welcome.

7. Copying

regex-opt has been written by Joel Yliluoma, a.k.a. Bisqwit,
and is distributed under the terms of the General Public License (GPL).

8. Requirements

For compiling you need the following GNU tools: g++, make.

9. Downloading

Downloading help

  • Do not download everything - you only need one file (newest version for your platform)!
  • Do not use download accelerators or you will be banned from this server before your download is complete!

The most recent source code (bleeding edge) for regex-opt can also be downloaded by cloning the Git repository by:

Date (Y-md-Hi) acc        Size Name                
2012-1218-1714 r--       28819 regex-opt-1.2.3.tar.bz2
2012-1218-1714 r--       30572 regex-opt-1.2.3.tar.gz
2007-0227-1646 r--       27855 regex-opt-1.2.2.tar.bz2
2007-0227-1646 r--       29034 regex-opt-1.2.2.tar.gz
2007-0227-1646 r--        1093 patch-regex-opt-1.2.1-1.2.2.bz2
2007-0227-1646 r--         982 patch-regex-opt-1.2.1-1.2.2.gz
2007-0221-1727 r--       27830 regex-opt-1.2.1.tar.bz2
2007-0221-1727 r--       28989 regex-opt-1.2.1.tar.gz
2007-0221-1727 r--        4617 patch-regex-opt-1.2.0-1.2.1.bz2
2007-0221-1727 r--        4541 patch-regex-opt-1.2.0-1.2.1.gz
2006-0117-0047 r--       27194 regex-opt-1.2.0.tar.bz2
2006-0117-0047 r--       28317 regex-opt-1.2.0.tar.gz
2006-0117-0047 r--       12255 patch-regex-opt-1.1.1-1.2.0.bz2
2006-0117-0047 r--       17098 patch-regex-opt-1.1.1-1.2.0.gz
2006-0116-2359 r--       27106 regex-opt-1.1.1.tar.bz2
2006-0116-2359 r--       28154 regex-opt-1.1.1.tar.gz
2006-0116-2359 r--        8769 patch-regex-opt-1.1.0-1.1.1.bz2
2006-0116-2359 r--        8829 patch-regex-opt-1.1.0-1.1.1.gz
2005-1025-1531 r--       23442 regex-opt-1.1.0.tar.bz2
2005-1025-1531 r--       23938 regex-opt-1.1.0.tar.gz
2005-1025-1531 r--        2247 patch-regex-opt-1.0.0-1.1.0.bz2
2005-1025-1531 r--        2056 patch-regex-opt-1.0.0-1.1.0.gz
2004-0712-2034 r--       23287 regex-opt-1.0.0.tar.bz2
Back to the source directory index at Bisqwit's homepage