Regular Expression Alternation

From rule number 2 and rule number 3 we can define paradigms — a number of possible patterns. This means that we add two or more languages by applying the set operator union to them. The union of the sets {a, b} and {c, d} is {a, b, c, d}. Hence, it’s all the elements that are either in one or more of the sets. In boolean logic, we call this the inclusively or. In regular expressions, it is called alternation and is written with a vertical bar |. Here are some examples:

'a'.match /a|b/ #=> #<MatchData "a"> - a is either a or b
'ab'.match /a|b/ #=> #<MatchData "a"> - leftmost chosen
'ba'.match /a|b/ #=> #<MatchData "b"> - leftmost chosen
'c'.match /a|b/ #=> nil - here we found neither a nor b

Note that most regex engines selects the leftmost alternative. There are exceptions to this rule. A regex engine based on DFA or POSIX NFA selects the longest alternative. Most regex engines are basic NFA and select the leftmost.

Can you write a regular expression that matches all binary strings of length one? The binary alphabet is { 0, 1 }. Since there aren’t a huge number of binary strings of length one, you can pretty quickly list them: { 0, 1 }. The regular expression with alternation then becomes 0|1:

'0'.match /0|1/ #=> #<MatchData "0">
'1'.match /0|1/ #=> #<MatchData "1">
'2'.match /0|1/ #=> nil
'10'.match /0|1/ #=> #<MatchData "1">

There are four binary strings of length two — {00, 01, 10, 11}. We can capture them with 00|10|01|11:

'10'.match /00|10|01|11/ #=> #<MatchData "10">
'01'.match /00|10|01|11/ #=> #<MatchData "01">
'12'.match /00|10|01|11/ #=> nil
'11'.match /00|10|01|11/ #=> #<MatchData "11">
'1210'.match /00|10|01|11/ #=> #<MatchData "10">

Maybe you didn’t notice, but we used concatenation in the regular expression above (can you see the invisible concatenation symbol between the two binary digits in the regular expression; if not, maybe you should make an appointment with an optometrist; or maybe not; not even an optometrist can help you see invisible symbols). Each of the binary strings of length two are made up of two concatenated binary strings of length one. Since concatenation has higher precedence than alternation, we didn’t need any parentheses.

Alternation is commutative: for two regular expressions p and q it holds that p|q = q|p. It is also associative: p|(q|r) = (p|q)|r. An interesting and very useful fact is that concatenation distributes over alternation. This means that for all regular expressions p, q and r it’s true that p(q|r) = pq|pr and (p|q)r = pq|pr. The consequence of that is that (0|1)(0|1) = (0|1)0|(0|1)1 = 00|10|01|11. So another way to match any binary strings of length two is:

'10'.match /(0|1)(0|1)/ #=> #<MatchData "10">
'01'.match /(0|1)(0|1)/ #=> #<MatchData "01">
'12'.match /(0|1)(0|1)/ #=> nil
'11'.match /(0|1)(0|1)/ #=> #<MatchData "11">
'1210'.match /(0|1)(0|1)/ #=> #<MatchData "10">

The brackets were needed, of course, because concatenation has higher precedence than alternation. We can also have the empty string ε as one of our alternatives:

'moda'.match /moda|/ #=> #<MatchData "moda"> - either moda or nothing is moda
'moda'.match /mado|/ #=> #<MatchData ""> - either mado or nothing is nothing

Pomodoro Technique Illustrated -- New book from The Pragmatic Programmers, LLC


%d bloggers like this: