Posts Tagged 'alternation'

From Regular Expression to Finite Automaton

For each regular expression — and I mean the three operators and the six recursive rules style — there is a finite automaton that accepts exactly the same strings. Since this is not a university book in mathematics, I’ll show you an inductive reasoning about this and not a formal proof.

The hypothesis is thus that for an arbitrary regular expression p, we can create a finite automaton that has exactly one start state, no paths into the start state, no paths out from the acceptance state and that accepts exactly the same strings that are matched by p.

  • The empty string ε is a regular expression corresponding to a finite automaton with a start state, a path that accepts the empty string ε and leads from the start state to an acceptance state. We’ll call this an //ε-path//.
  • The empty set Ø is the set equivalent to a regular expression that can’t match any single string — not even the empty string ε. It is the same as a two-state automaton, with no single path. One state is start and the other one acceptance. But, they are not linked.
  • A regular expression that only matches the symbol b corresponds to a finite automaton with two states: start and acceptance. There’s a path from start t acceptance, and it only accepts the symbol b.

All three finite automata above have two states. One is start and the other one is acceptance. The difference is that the first one has an ε-path from start to acceptance, the second one has no path, and the third one has a b-path. Now we’ll continue. Imagine that we have two regular expressions p and q corresponding to finite automata s and t respectively.

  • Concatenation of two regular expressions p and q means that we first match a string with p, directly followed by a string that’s matched by q. To create this finite automaton we first add ε-paths from every acceptance state in s to the start state of t. Then we deprive all acceptance states in s their acceptance status and we’ll also withdraw the start status of the start state in t.
  • Alternation of two regular expressions p and q, i.e. p|q is like a finite automaton with a new start state that has ε-paths to all start staes of s and t. The new finite automaton also has a new acceptance state that is reached with ε-paths from all acceptance states of s and t. The start and acceptance states of s and t are thus not start and acceptance state in our new automaton.
  • Kleene star is the concatenation closure. Assume that p = q*. Then s is the finite automaton we get if we take t and add two states and four paths as follows: One new state is the start state and the other one is an acceptance state. All acceptance states of ´s´ loses that status in s, but instead gets an ε-path to the new acceptance state. We add two ε-paths from the new initial state — one to the old start state and one to the new acceptance state. In addition to that, we insert one ε-path from each of the old acceptance states to the old start state.

Look at the pictures above. Then take a deep breath and feel if you can translate an arbitrary regular expression to a finite automaton. Finally assess the last picture where the regular expression (w|bb)* is depicted as a graph using the method described above. Does it feel reasonable?

Pomodoro Technique Illustrated -- New book from The Pragmatic Programmers, LLC

Simple Regular Expression Examples

Now, we have three operators and a small framework. After all this theory, you might wonder if it’s possible for us to solve any problems. Yes, of course we can. Here are some examples:

All binary strings with no more than one zero:

'01101'.match /1*(0|)1*/ #=> #<MatchData "011">
'0111'.match /1*(0|)1*/ #=> #<MatchData "0111">
'1101'.match /1*(0|)1*/ #=> #<MatchData "1101">
'11010'.match /1*(0|)1*/ #=> #<MatchData "1101">

All binary strings with at least one pair of consecutive zeroes:

'101001'.match /(1|0)*00(1|0)*/ #=> #<MatchData "101001">
'10101'.match /(1|0)*00(1|0)*/ #=> nil
'1010100'.match /(1|0)*00(1|0)*/ #=> #<MatchData "1010100">

All binary strings that have no pair of consecutive zeros:

'1010100'.match /1*(011*)*(0|)/ #=> #<MatchData "101010">
'101001'.match /1*(011*)*(0|)/ #=> #<MatchData "1010">
'0010101'.match /1*(011*)*(0|)/ #=> #<MatchData "0">
'0110101'.match /1*(011*)*(0|)/ #=> #<MatchData "0110101">

All binary strings ending in 01:

'110101'.match /(0|1)*01/ #=> #<MatchData "110101">
'11010'.match /(0|1)*01/ #=> #<MatchData "1101">
'1'.match /(0|1)*01/ #=> nil
'01'.match /(0|1)*01/ #=> #<MatchData "01">

All binary strings not ending in 01:

'010'.match /(0|1)*(0|11)|1|0|/ #=> #<MatchData "010">
'011'.match /(0|1)*(0|11)|1|0|/ #=> #<MatchData "011">
''.match /(0|1)*(0|11)|1|0|/ #=> #<MatchData "">
'1'.match /(0|1)*(0|11)|1|0|/ #=> #<MatchData "1">
'01'.match /(0|1)*(0|11)|1|0|/ #=> #<MatchData "0">
'101'.match /(0|1)*(0|11)|1|0|/ #=> #<MatchData "10">

All binary strings that have every pair of consecutive zeroes before every pair of consecutive ones:

'0110101'.match /0*(100*)*1*(011*)*(0|)/ #=> #<MatchData "0110101">
'00101100'.match /0*(100*)*1*(011*)*(0|)/ #=> #<MatchData "0010110">
'11001011'.match /0*(100*)*1*(011*)*(0|)/ #=> #<MatchData "110">
'1100'.match /0*(100*)*1*(011*)*(0|)/ #=> #<MatchData "110">
'0011'.match /0*(100*)*1*(011*)*(0|)/ #=> #<MatchData "0011">

See if you can find even better regular expressions that solve these problems. Remember that there’re an infinite number of synonyms to each regular expression.

Pomodoro Technique Illustrated -- New book from The Pragmatic Programmers, LLC

Regular Expression Precedence

You might be tempted to read the following regular expression as third or fifth row:

'fifth row'.match /third|fifth row/ #=> #<MatchData "fifth row">
'third row'.match /third|fifth row/ #=> #<MatchData "third">

But unfortunately, as you can see, it’s more like either third (only) or else fifth row. This is due to something called order of operations or operator precedence. The invisible operator for concatenation has higher precedence than the alternation operator |.

To oil these wheels, we now add parentheses to our three operators. In a regular expression, the sub expression enclosed in parentheses get the highest priority:

'fifth row'.match /(third|fifth) row/ #=> #<MatchData "fifth row">
'third row'.match /(third|fifth) row/ #=> #<MatchData "third row">

Note that the parentheses are meta-characters, not literals. They won’t match anything in the subject string. And of course it’s possible to nest parentheses:

'third row'.match /(third|(four|fif)th) row/ #=> #<MatchData "third row">
'fourth row'.match /(third|(four|fif)th) row/ #=> #<MatchData "fourth row">
'fifth row'.match /(third|(four|fif)th) row/ #=> #<MatchData "fifth row">

There are three things we need to remember, to know in what order and with what operands the regular expression engine will execute the operators:

  • Operator precedence is an ordered list that tells you if one operator should be executed before another operator in a regular expression. Several operators can have the same priority. In mathematics, the terms inside the parentheses have the highest priority. Multiplication and division have a lower priority. Addition and subtraction have the lowest. This is why 6+6/(2+1) = 8.
  • Operator position indicates where the operands are located in relation to the operator. The position can be prefix, infix, or postfix. If the operator is prefix, then the operand resides to the right of the operator, as the unary minus sign e.g. -3. An infix operator has an operand on each side, as in addition 1+2. A postfix operator stands to the right of its operand, as the exclamation point that represents the faculty operator in 5!.
  • Operator associativity tells us how to group two operators on the same precedence level. An infix operator can be right-associative, left-associative or non-associative. In mathematics, the infix operations addition and subtraction have the same precedence. Since both are left-associative the following equation holds: 1-2+3 = (1-2)+3 = 2. Prefix or postfix operators are either associative or non-associative. If they are associative, we start with the operator that is closest to the operand. An operator that is non-associative can’t compete with operators of same precedence.

Here goes the table for the operators we have studied so far. Later on, there’s a complete table of all regex operators.

Operator Symbol Precedence Position Associativity
Kleene star * 1 Postfix Associative
Concatenation N/A 2 Infix Left-associative
Alternation | 3 Infix Left-associative

If you think this is hard to remember, then try to memorize the mnemonic SCA. It stands for Star-Concat-Alter, i.e. the order of precedence in regular expressions.

Pomodoro Technique Illustrated -- New book from The Pragmatic Programmers, LLC

Regular Expression Alternation

From rule number 2 and rule number 3 we can define paradigms — a number of possible patterns. This means that we add two or more languages by applying the set operator union to them. The union of the sets {a, b} and {c, d} is {a, b, c, d}. Hence, it’s all the elements that are either in one or more of the sets. In boolean logic, we call this the inclusively or. In regular expressions, it is called alternation and is written with a vertical bar |. Here are some examples:

'a'.match /a|b/ #=> #<MatchData "a"> - a is either a or b
'ab'.match /a|b/ #=> #<MatchData "a"> - leftmost chosen
'ba'.match /a|b/ #=> #<MatchData "b"> - leftmost chosen
'c'.match /a|b/ #=> nil - here we found neither a nor b

Note that most regex engines selects the leftmost alternative. There are exceptions to this rule. A regex engine based on DFA or POSIX NFA selects the longest alternative. Most regex engines are basic NFA and select the leftmost.

Can you write a regular expression that matches all binary strings of length one? The binary alphabet is { 0, 1 }. Since there aren’t a huge number of binary strings of length one, you can pretty quickly list them: { 0, 1 }. The regular expression with alternation then becomes 0|1:

'0'.match /0|1/ #=> #<MatchData "0">
'1'.match /0|1/ #=> #<MatchData "1">
'2'.match /0|1/ #=> nil
'10'.match /0|1/ #=> #<MatchData "1">

There are four binary strings of length two — {00, 01, 10, 11}. We can capture them with 00|10|01|11:

'10'.match /00|10|01|11/ #=> #<MatchData "10">
'01'.match /00|10|01|11/ #=> #<MatchData "01">
'12'.match /00|10|01|11/ #=> nil
'11'.match /00|10|01|11/ #=> #<MatchData "11">
'1210'.match /00|10|01|11/ #=> #<MatchData "10">

Maybe you didn’t notice, but we used concatenation in the regular expression above (can you see the invisible concatenation symbol between the two binary digits in the regular expression; if not, maybe you should make an appointment with an optometrist; or maybe not; not even an optometrist can help you see invisible symbols). Each of the binary strings of length two are made up of two concatenated binary strings of length one. Since concatenation has higher precedence than alternation, we didn’t need any parentheses.

Alternation is commutative: for two regular expressions p and q it holds that p|q = q|p. It is also associative: p|(q|r) = (p|q)|r. An interesting and very useful fact is that concatenation distributes over alternation. This means that for all regular expressions p, q and r it’s true that p(q|r) = pq|pr and (p|q)r = pq|pr. The consequence of that is that (0|1)(0|1) = (0|1)0|(0|1)1 = 00|10|01|11. So another way to match any binary strings of length two is:

'10'.match /(0|1)(0|1)/ #=> #<MatchData "10">
'01'.match /(0|1)(0|1)/ #=> #<MatchData "01">
'12'.match /(0|1)(0|1)/ #=> nil
'11'.match /(0|1)(0|1)/ #=> #<MatchData "11">
'1210'.match /(0|1)(0|1)/ #=> #<MatchData "10">

The brackets were needed, of course, because concatenation has higher precedence than alternation. We can also have the empty string ε as one of our alternatives:

'moda'.match /moda|/ #=> #<MatchData "moda"> - either moda or nothing is moda
'moda'.match /mado|/ #=> #<MatchData ""> - either mado or nothing is nothing

Pomodoro Technique Illustrated -- New book from The Pragmatic Programmers, LLC



Follow

Get every new post delivered to your Inbox.