Some people, when confronted with a problem, think "I know, I'll use regular expressions." Now they have two problems. - Jamie Zawinski
I love regular expressions. I like their pseudo computational theory roots (regular expressions as we think of them are not exactly "regular" expressions in the mathematical sense, but its mostly the same ideas), and they are always fun to write. I think that grep is the single greatest peice of software ever written.
That being said, I generally don't like to get too fancy with regexps in production code. For one, as we discovered, they can end up with exponential running time. That's bad when putting in an invalid email address in a form field acts as a denial of service attack on your site. They make great filters, and pretty good matchers when you need to pull out part of a string. And if you're writing a one-line shell script for log analysis, they are wonderful.
But in production code, regular expressions tend to violate the most important general principle for enterprise development, readability. Within a single, simple looking regex, can lie a ton of defects. When someone goes to fix a pattern that a regex is letting through, it's very difficult to ensure you didn't break other patterns.
Of course, it's not necessary to hand-write a full blown parser for most things. Simple regexes are good, so sometimes some combination of basic hand parsing and regular expressions can be very effective. There's no shame in using a few simple expressions instead of a big complicated one. They also don't do the best job of solving every problem.
For example, say you're trying to get the first 400 characters of long string, but you want it bounded at a word boundary, running slightly over 400 characters if necessary. The regex for that is something lke: /^(.{400}[^\s]*)\s/s (perl format). Now which is more readable, that thing, or:
firstWordBoundaryAfter400Characters = fooString.IndexOf(" ", 400);
return fooString.SubString(0, firstfirstWordBoundaryAfter400Characters + 1);
I know which one I'd rather fix a defect in. It's more explicit, and if it needs modification or fixing later, side effects are less likely. That's a pretty easy example though. But it does point out that sometimes it's easy to start seeing regexes as the answer to everything. Things like this can help readability too:
if (fooString.IndexOf("something") > 0)
fooString.regexReplace(somestuff)
else
fooString.regexReplace(otherstuff)
Sometimes doing lots of ors/ands in a regex can get totally hairy, so splitting them out into simpler expressions works well too (and helps show unit test coverage).
And finally, the links to tons and tons of info abour regular expressions.
Regular-Expressions.info - Covers everything very detailed.
Good discussion of performance - This is good for it's explanation of exponential running time issues:
Something like "(.*.*)*x" in a string not containing "x" can take exponential time, n^n, so a hundred characters of text could potentially take years to match unless an early match is found. The inner term ".*.*" takes n^2 because of the multiplicative effects of concatenation operator. (I used two wildcards just in case "(.*)*" is optimized away.) The outer wildcard compounds the running time by a power of n.
This is Rob Meyer's weblog, a weblog focused on software development and system administration based on 10 years of experience. Want to explore further? You can find out more me or see the rest of my website.
Wondering if I've written on something in particular? Try searching:
You might want to take a look at some of the more requested postings (as judged by incoming traffic):
Want more? Subscribe to this site
or contact me at rob at big dis dot com.
See my writings on: