Four more rules for Regular Expressions

Now that we have two basic base rules, we would like to add four more. Then we can build regular expressions recursively from small regular expressions. The first three of these rules describe the only three necessary operators in regular expressions. The fourth rule deals with parentheses:

  1. Alternation: If p and q are regular expressions, then will also p|q be a regular expression. The expression p|q matches the union of the strings matched by p and q. Think of it as: either p or q.
  2. Concatenation: If p and q are two regular expressions, then will also pq be a regular expression. Note that the symbol for concatenation is invisible. Some literature use + for concatenation, as in:p+q. The expression pq denotes a language consisting of all strings with a prefix matched by p, directly followed by a suffix matched by q.
  3. Closure: If p is a regular expression, then will also p* be a regular expression. This is the closure of concatenation with the expression itself. The expression p* match all strings that Completely can be divided into zero or more substrings, each of which is matched by p.
  4. Parentheses: If p is a regular expression, then will also (p) be a regular expression. This means that we can enclose an expression in parentheses without altering its meaning.

In addition to these rules we will shortly put some convenience rules for operator precedence. They are not necessary, but allow us to write shorter and more readable regular expressions. Quite soon you will also see real regular expression examples with these three operators. These six rules are sufficient for writing all possible regular expressions. All other regex operators you’ve seen are just abstractions and syntactic sugar. Or?

Do you remember George Bernard Shaw’s “The golden rule is that there are no golden rules.” and Mark Twain’s “It is a good idea to obey all the rules when you’re young just so you’ll have the strength to break them when you’re old.”? This is exactly how we should do. For now, these rules are all we need. However, in modern regex engines, there are functions like a back reference, and lookarounds. To implement these, we need more rules. But until we’re there, it will be very useful to think of regular expressions as a system consisting of only these six rules.

Regular expression is thus a mathematical theory and modern regex engines are based on a super set of that theory. With the help of the theory, we can prove the following:

  • For each regular expression, we can construct at least one deterministic finite automaton and at least one nondeterministic finite automaton, so that all three solves the same problem.
  • For every finite automaton — deterministic as well as nondeterministic — we can write a regular expression, so that both solve the same problem.

Solving a problem here means to determine whether a string is part of a language. The proofs mentioned above are not reproduced here, but they are easily accessed as they exist in every textbook on automata theory. The beauty of this analogy between regular expressions and finite automata is that I can explain several key features of regular expressions for you, with the help of graphs of finite automata. And as a matter of fact, a regex engine is really just a compiler that translates our hand-written regular expressions to computer friendly finite automata, or possibly the more advanced pushdown automata.

Pomodoro Technique Illustrated -- New book from The Pragmatic Programmers, LLC


%d bloggers like this: