
Now that we have two basic base rules, we would like to add four more. Then we can build regular expressions recursively from small regular expressions. The first three of these rules describe the only three necessary operators in regular expressions. The fourth rule deals with parentheses:
- Alternation: If
pandqare regular expressions, then will alsop|qbe a regular expression. The expressionp|qmatches the union of the strings matched bypand q. Think of it as: eitherporq. - Concatenation: If
pandqare two regular expressions, then will alsopqbe a regular expression. Note that the symbol for concatenation is invisible. Some literature use+for concatenation, as in:p+q. The expressionpqdenotes a language consisting of all strings with a prefix matched byp, directly followed by a suffix matched byq. - Closure: If
pis a regular expression, then will alsop*be a regular expression. This is the closure of concatenation with the expression itself. The expressionp*match all strings that Completely can be divided into zero or more substrings, each of which is matched byp. - Parentheses: If
pis a regular expression, then will also(p)be a regular expression. This means that we can enclose an expression in parentheses without altering its meaning.
In addition to these rules we will shortly put some convenience rules for operator precedence. They are not necessary, but allow us to write shorter and more readable regular expressions. Quite soon you will also see real regular expression examples with these three operators. These six rules are sufficient for writing all possible regular expressions. All other regex operators you’ve seen are just abstractions and syntactic sugar. Or?
Do you remember George Bernard Shaw’s “The golden rule is that there are no golden rules.” and Mark Twain’s “It is a good idea to obey all the rules when you’re young just so you’ll have the strength to break them when you’re old.”? This is exactly how we should do. For now, these rules are all we need. However, in modern regex engines, there are functions like a back reference, and lookarounds. To implement these, we need more rules. But until we’re there, it will be very useful to think of regular expressions as a system consisting of only these six rules.
Regular expression is thus a mathematical theory and modern regex engines are based on a super set of that theory. With the help of the theory, we can prove the following:
- For each regular expression, we can construct at least one deterministic finite automaton and at least one nondeterministic finite automaton, so that all three solves the same problem.
- For every finite automaton — deterministic as well as nondeterministic — we can write a regular expression, so that both solve the same problem.
Solving a problem here means to determine whether a string is part of a language. The proofs mentioned above are not reproduced here, but they are easily accessed as they exist in every textbook on automata theory. The beauty of this analogy between regular expressions and finite automata is that I can explain several key features of regular expressions for you, with the help of graphs of finite automata. And as a matter of fact, a regex engine is really just a compiler that translates our hand-written regular expressions to computer friendly finite automata, or possibly the more advanced pushdown automata.

0 Responses to “Four more rules for Regular Expressions”