From Regular Expression to Finite Automaton

For each regular expression — and I mean the three operators and the six recursive rules style — there is a finite automaton that accepts exactly the same strings. Since this is not a university book in mathematics, I’ll show you an inductive reasoning about this and not a formal proof.

The hypothesis is thus that for an arbitrary regular expression p, we can create a finite automaton that has exactly one start state, no paths into the start state, no paths out from the acceptance state and that accepts exactly the same strings that are matched by p.

  • The empty string ε is a regular expression corresponding to a finite automaton with a start state, a path that accepts the empty string ε and leads from the start state to an acceptance state. We’ll call this an //ε-path//.
  • The empty set Ø is the set equivalent to a regular expression that can’t match any single string — not even the empty string ε. It is the same as a two-state automaton, with no single path. One state is start and the other one acceptance. But, they are not linked.
  • A regular expression that only matches the symbol b corresponds to a finite automaton with two states: start and acceptance. There’s a path from start t acceptance, and it only accepts the symbol b.

All three finite automata above have two states. One is start and the other one is acceptance. The difference is that the first one has an ε-path from start to acceptance, the second one has no path, and the third one has a b-path. Now we’ll continue. Imagine that we have two regular expressions p and q corresponding to finite automata s and t respectively.

  • Concatenation of two regular expressions p and q means that we first match a string with p, directly followed by a string that’s matched by q. To create this finite automaton we first add ε-paths from every acceptance state in s to the start state of t. Then we deprive all acceptance states in s their acceptance status and we’ll also withdraw the start status of the start state in t.
  • Alternation of two regular expressions p and q, i.e. p|q is like a finite automaton with a new start state that has ε-paths to all start staes of s and t. The new finite automaton also has a new acceptance state that is reached with ε-paths from all acceptance states of s and t. The start and acceptance states of s and t are thus not start and acceptance state in our new automaton.
  • Kleene star is the concatenation closure. Assume that p = q*. Then s is the finite automaton we get if we take t and add two states and four paths as follows: One new state is the start state and the other one is an acceptance state. All acceptance states of ´s´ loses that status in s, but instead gets an ε-path to the new acceptance state. We add two ε-paths from the new initial state — one to the old start state and one to the new acceptance state. In addition to that, we insert one ε-path from each of the old acceptance states to the old start state.

Look at the pictures above. Then take a deep breath and feel if you can translate an arbitrary regular expression to a finite automaton. Finally assess the last picture where the regular expression (w|bb)* is depicted as a graph using the method described above. Does it feel reasonable?

Pomodoro Technique Illustrated -- New book from The Pragmatic Programmers, LLC

Simple Regular Expression Examples

Now, we have three operators and a small framework. After all this theory, you might wonder if it’s possible for us to solve any problems. Yes, of course we can. Here are some examples:

All binary strings with no more than one zero:

'01101'.match /1*(0|)1*/ #=> #<MatchData "011">
'0111'.match /1*(0|)1*/ #=> #<MatchData "0111">
'1101'.match /1*(0|)1*/ #=> #<MatchData "1101">
'11010'.match /1*(0|)1*/ #=> #<MatchData "1101">

All binary strings with at least one pair of consecutive zeroes:

'101001'.match /(1|0)*00(1|0)*/ #=> #<MatchData "101001">
'10101'.match /(1|0)*00(1|0)*/ #=> nil
'1010100'.match /(1|0)*00(1|0)*/ #=> #<MatchData "1010100">

All binary strings that have no pair of consecutive zeros:

'1010100'.match /1*(011*)*(0|)/ #=> #<MatchData "101010">
'101001'.match /1*(011*)*(0|)/ #=> #<MatchData "1010">
'0010101'.match /1*(011*)*(0|)/ #=> #<MatchData "0">
'0110101'.match /1*(011*)*(0|)/ #=> #<MatchData "0110101">

All binary strings ending in 01:

'110101'.match /(0|1)*01/ #=> #<MatchData "110101">
'11010'.match /(0|1)*01/ #=> #<MatchData "1101">
'1'.match /(0|1)*01/ #=> nil
'01'.match /(0|1)*01/ #=> #<MatchData "01">

All binary strings not ending in 01:

'010'.match /(0|1)*(0|11)|1|0|/ #=> #<MatchData "010">
'011'.match /(0|1)*(0|11)|1|0|/ #=> #<MatchData "011">
''.match /(0|1)*(0|11)|1|0|/ #=> #<MatchData "">
'1'.match /(0|1)*(0|11)|1|0|/ #=> #<MatchData "1">
'01'.match /(0|1)*(0|11)|1|0|/ #=> #<MatchData "0">
'101'.match /(0|1)*(0|11)|1|0|/ #=> #<MatchData "10">

All binary strings that have every pair of consecutive zeroes before every pair of consecutive ones:

'0110101'.match /0*(100*)*1*(011*)*(0|)/ #=> #<MatchData "0110101">
'00101100'.match /0*(100*)*1*(011*)*(0|)/ #=> #<MatchData "0010110">
'11001011'.match /0*(100*)*1*(011*)*(0|)/ #=> #<MatchData "110">
'1100'.match /0*(100*)*1*(011*)*(0|)/ #=> #<MatchData "110">
'0011'.match /0*(100*)*1*(011*)*(0|)/ #=> #<MatchData "0011">

See if you can find even better regular expressions that solve these problems. Remember that there’re an infinite number of synonyms to each regular expression.

Pomodoro Technique Illustrated -- New book from The Pragmatic Programmers, LLC

Regular Expression Precedence

You might be tempted to read the following regular expression as third or fifth row:

'fifth row'.match /third|fifth row/ #=> #<MatchData "fifth row">
'third row'.match /third|fifth row/ #=> #<MatchData "third">

But unfortunately, as you can see, it’s more like either third (only) or else fifth row. This is due to something called order of operations or operator precedence. The invisible operator for concatenation has higher precedence than the alternation operator |.

To oil these wheels, we now add parentheses to our three operators. In a regular expression, the sub expression enclosed in parentheses get the highest priority:

'fifth row'.match /(third|fifth) row/ #=> #<MatchData "fifth row">
'third row'.match /(third|fifth) row/ #=> #<MatchData "third row">

Note that the parentheses are meta-characters, not literals. They won’t match anything in the subject string. And of course it’s possible to nest parentheses:

'third row'.match /(third|(four|fif)th) row/ #=> #<MatchData "third row">
'fourth row'.match /(third|(four|fif)th) row/ #=> #<MatchData "fourth row">
'fifth row'.match /(third|(four|fif)th) row/ #=> #<MatchData "fifth row">

There are three things we need to remember, to know in what order and with what operands the regular expression engine will execute the operators:

  • Operator precedence is an ordered list that tells you if one operator should be executed before another operator in a regular expression. Several operators can have the same priority. In mathematics, the terms inside the parentheses have the highest priority. Multiplication and division have a lower priority. Addition and subtraction have the lowest. This is why 6+6/(2+1) = 8.
  • Operator position indicates where the operands are located in relation to the operator. The position can be prefix, infix, or postfix. If the operator is prefix, then the operand resides to the right of the operator, as the unary minus sign e.g. -3. An infix operator has an operand on each side, as in addition 1+2. A postfix operator stands to the right of its operand, as the exclamation point that represents the faculty operator in 5!.
  • Operator associativity tells us how to group two operators on the same precedence level. An infix operator can be right-associative, left-associative or non-associative. In mathematics, the infix operations addition and subtraction have the same precedence. Since both are left-associative the following equation holds: 1-2+3 = (1-2)+3 = 2. Prefix or postfix operators are either associative or non-associative. If they are associative, we start with the operator that is closest to the operand. An operator that is non-associative can’t compete with operators of same precedence.

Here goes the table for the operators we have studied so far. Later on, there’s a complete table of all regex operators.

Operator Symbol Precedence Position Associativity
Kleene star * 1 Postfix Associative
Concatenation N/A 2 Infix Left-associative
Alternation | 3 Infix Left-associative

If you think this is hard to remember, then try to memorize the mnemonic SCA. It stands for Star-Concat-Alter, i.e. the order of precedence in regular expressions.

Pomodoro Technique Illustrated -- New book from The Pragmatic Programmers, LLC

The Kleene Star operator

All finite languages ​​can be described by regular expressions. You can simply list the strings as an alternation string1|string2|string3 etc. But there are also some languages with an infinite number of strings that can be described by regular expressions. To achieve that, we use an operator we call Kleene star after the American mathematician Stephen Cole Kleene. If p is a regular expression, then p* is the smallest superset of p that contains ε (the empty string) and is closed under the concatenation operation. It is the set of finite length strings that can be created by concatenating strings that match the expression p. If p can match any other string than ε, then p* will match an infinite number of possible strings.

The real name of the typographic symbol (glyph) used to denote the Kleene Star is asterisk. It’s a Greek (not geek) word and it logically enough means little star. Normally, it has five or six arms, but in its original purpose — to describe birth dates in a family tree — it had seven arms. This is a very popular symbol. In Unicode, there are a score of different variants and many communities have chosen their own meaning of the asterisk. In typography it means a footnote. In musical notation, it may mean that the sustain pedal of the piano should be lifted. On our cell phones, we use the asterisk to navigate menus in touch-tone systems. So there is no world-wide interpretation of *. However, in this book it’ll always mean the operator Kleene star.

Do you want to see a simple example? The concatenation closure of one single symbol, such as a, is a* = { ε, a, aa, aaa,... }. Want to see a more academic example? The concatenation closure of the set consisting solely of the empty string ε is — well, the set consisting solely of the empty string ε* = ε. Want to see a more complicated example? (1|0)* = { ε, 0, 1, 01, 10, 001, 010, 011,... }. It may thus be different matches of the expression that concatenates. Can you write a regular expression that matches all binary strings that contain at least one zero? Or all binary strings with an even number of ones?

The operator Kleene star — pronounced /ˈkleɪniː stɑ:r/ klay-nee star — is unary, i.e. it only takes one operand. The operand is the regular expression to the left, which allows us to say that it’s a postfix operator. It has the highest priority of all operators and it is associative. The latter means that if two operators of the same priority are competing, then the operator closest to the operand wins. Since p** = p*, we say that the Kleene star is idempotent. And I want to emphasize again that p* = (p|)*, i.e. the empty string ε is always present in a closure. We’ll later on see that there is a very common — and none the less disastrous — mistake to forget this very fact.

Here are some possible answers to two the questions above:

'110'.match /(0|1)*0(0|1)*/ #=> #<MatchData "110"> - all strings with at least oe zero
'1111'.match /(0|1)*0(0|1)*/ #=> nil
'1001'.match /((10*1)|0*)*/ #=> #<MatchData "1001"> - all strings with an even number of ones
'11001'.match /((10*1)|0*)*/ #=> #<MatchData "1100">
''.match /((10*1)|0*)*/ #=> #<MatchData ""> - even the empty string has an even number of ones
'1001'.match /((10*1)|0)|/ #=> #<MatchData "1001"> - another way to express an even number of ones
'11001'.match /((10*1)|0)|/ #=> #<MatchData "11">
''.match /((10*1)|0)|/ #=> #<MatchData "">
'1'.match /((10*1)|0)|/ #=> #<MatchData "">
'010'.match /((10*1)|0)|/ #=> #<MatchData "0">
'01'.match /((10*1)|0)|/ #=> #<MatchData "0">

The positive closure operator + and the at least n operator {n,} are abstractions for expressing infinite concatenation. We’ll soon explore them in more detail.

Pomodoro Technique Illustrated -- New book from The Pragmatic Programmers, LLC

Regular Expression Alternation

From rule number 2 and rule number 3 we can define paradigms — a number of possible patterns. This means that we add two or more languages by applying the set operator union to them. The union of the sets {a, b} and {c, d} is {a, b, c, d}. Hence, it’s all the elements that are either in one or more of the sets. In boolean logic, we call this the inclusively or. In regular expressions, it is called alternation and is written with a vertical bar |. Here are some examples:

'a'.match /a|b/ #=> #<MatchData "a"> - a is either a or b
'ab'.match /a|b/ #=> #<MatchData "a"> - leftmost chosen
'ba'.match /a|b/ #=> #<MatchData "b"> - leftmost chosen
'c'.match /a|b/ #=> nil - here we found neither a nor b

Note that most regex engines selects the leftmost alternative. There are exceptions to this rule. A regex engine based on DFA or POSIX NFA selects the longest alternative. Most regex engines are basic NFA and select the leftmost.

Can you write a regular expression that matches all binary strings of length one? The binary alphabet is { 0, 1 }. Since there aren’t a huge number of binary strings of length one, you can pretty quickly list them: { 0, 1 }. The regular expression with alternation then becomes 0|1:

'0'.match /0|1/ #=> #<MatchData "0">
'1'.match /0|1/ #=> #<MatchData "1">
'2'.match /0|1/ #=> nil
'10'.match /0|1/ #=> #<MatchData "1">

There are four binary strings of length two — {00, 01, 10, 11}. We can capture them with 00|10|01|11:

'10'.match /00|10|01|11/ #=> #<MatchData "10">
'01'.match /00|10|01|11/ #=> #<MatchData "01">
'12'.match /00|10|01|11/ #=> nil
'11'.match /00|10|01|11/ #=> #<MatchData "11">
'1210'.match /00|10|01|11/ #=> #<MatchData "10">

Maybe you didn’t notice, but we used concatenation in the regular expression above (can you see the invisible concatenation symbol between the two binary digits in the regular expression; if not, maybe you should make an appointment with an optometrist; or maybe not; not even an optometrist can help you see invisible symbols). Each of the binary strings of length two are made up of two concatenated binary strings of length one. Since concatenation has higher precedence than alternation, we didn’t need any parentheses.

Alternation is commutative: for two regular expressions p and q it holds that p|q = q|p. It is also associative: p|(q|r) = (p|q)|r. An interesting and very useful fact is that concatenation distributes over alternation. This means that for all regular expressions p, q and r it’s true that p(q|r) = pq|pr and (p|q)r = pq|pr. The consequence of that is that (0|1)(0|1) = (0|1)0|(0|1)1 = 00|10|01|11. So another way to match any binary strings of length two is:

'10'.match /(0|1)(0|1)/ #=> #<MatchData "10">
'01'.match /(0|1)(0|1)/ #=> #<MatchData "01">
'12'.match /(0|1)(0|1)/ #=> nil
'11'.match /(0|1)(0|1)/ #=> #<MatchData "11">
'1210'.match /(0|1)(0|1)/ #=> #<MatchData "10">

The brackets were needed, of course, because concatenation has higher precedence than alternation. We can also have the empty string ε as one of our alternatives:

'moda'.match /moda|/ #=> #<MatchData "moda"> - either moda or nothing is moda
'moda'.match /mado|/ #=> #<MatchData ""> - either mado or nothing is nothing

Pomodoro Technique Illustrated -- New book from The Pragmatic Programmers, LLC

Regular Expression Concatenation

Using Rule number 2 and Rule number 4, we can create regular expressions that consists of any sequence of symbols from our alphabet. Rule number 2 said that if the symbol a is in the alphabet, then a is a regular expression. Rule number 4 said that if p and q are two regular expressions, then the concatenation pq is a regular expression as well. The concatenation symbol itself is invisible. Just write the two regular expressions right after each other:

'moda'[/m/] #=> "m" – we found the substring s in the string"moda"
'moda'[/o/] #=> "o"
'moda'[/mo/] #=> "mo" - /mo/ is /m/ concatenated with /o/
'moda'[/da/] #=> "da"
'moda'[/moda/] #=> "moda" - /moda/ is /mo/ concatenated with /da/
'moda'[/mado/] #=> nil – no match, since the order was changed

There are some handy terms we usually use for parts of strings:

  • Prefix: A prefix is the substring we have left if we remove zero or more symbols from the end of a string. The strings m, mo, mod, and moda are all prefixes of the string moda. Even the empty string ε is a prefix moda.
  • Suffix: The suffix is the substring that is left if we remove zero or more symbol from the beginning of the string. The strings moda, oda, da, a, and ε are all suffixes of the string moda.
  • Substring: A substring is what we have left if we remove a prefix and a suffix from a string. Note that the prefix and/or the suffix can be ε. Substrings must still be consecutive in the original string. The strings od and moda, but not mda, are substrings of moda.

For any regular expression p, it’s true that εp = pε = p, thus we say that the empty string ε is the identity under concatenation. There is no annihilator under concatenation, i.e., there’s no regular expression 0 so that for any regular expression p it holds that 0p = p0 = 0. Concatenation is not commutative, since pq is not equal to qp, but it’s associative since for any regular expressions p and q it’s true that p(qr) = (pq)r.

If we think of concatenation as a product, then regular expressions also support exponentiation. We write the exponent enclosed in braces to the right of the regular expression:

'aaa'[/aaa/] #=> "aaa"
'aaa'[/a{3}/] #=> "aaa" – yes, the string includes 3 concatenated a
'aaa'[/a{4}/] #=> nil – no, the string doesn't include 4 a

This is obviously just syntactic sugar. All regular expressions that we can write using the exponential operator, can also be unfolded. There are more shortcuts for finite repeated concatenations:

'aa'[/a?/] #=> "a" – the optional operator written as question mark
'b'[/a?/] #=> "" – zero repeats of a matches the empty string
'aa'[/a{,2}/] #=> "aa" – at least two a
'aa'[/a{1,2}/] #=> "aa" – at least one a and at moust two a
'a'[/a{1,2}/] #=> "a"

We will soon see that the concatenation of two regular expressions are not the same as the concatenation of two strings. Remember that a regular expression corresponds to a set of strings. For example, if p = {a, b} and q = {c, d}, then pq = {ac, ad, bc, bd}

Pomodoro Technique Illustrated -- New book from The Pragmatic Programmers, LLC

Four more rules for Regular Expressions

Now that we have two basic base rules, we would like to add four more. Then we can build regular expressions recursively from small regular expressions. The first three of these rules describe the only three necessary operators in regular expressions. The fourth rule deals with parentheses:

  1. Alternation: If p and q are regular expressions, then will also p|q be a regular expression. The expression p|q matches the union of the strings matched by p and q. Think of it as: either p or q.
  2. Concatenation: If p and q are two regular expressions, then will also pq be a regular expression. Note that the symbol for concatenation is invisible. Some literature use + for concatenation, as in:p+q. The expression pq denotes a language consisting of all strings with a prefix matched by p, directly followed by a suffix matched by q.
  3. Closure: If p is a regular expression, then will also p* be a regular expression. This is the closure of concatenation with the expression itself. The expression p* match all strings that Completely can be divided into zero or more substrings, each of which is matched by p.
  4. Parentheses: If p is a regular expression, then will also (p) be a regular expression. This means that we can enclose an expression in parentheses without altering its meaning.

In addition to these rules we will shortly put some convenience rules for operator precedence. They are not necessary, but allow us to write shorter and more readable regular expressions. Quite soon you will also see real regular expression examples with these three operators. These six rules are sufficient for writing all possible regular expressions. All other regex operators you’ve seen are just abstractions and syntactic sugar. Or?

Do you remember George Bernard Shaw’s “The golden rule is that there are no golden rules.” and Mark Twain’s “It is a good idea to obey all the rules when you’re young just so you’ll have the strength to break them when you’re old.”? This is exactly how we should do. For now, these rules are all we need. However, in modern regex engines, there are functions like a back reference, and lookarounds. To implement these, we need more rules. But until we’re there, it will be very useful to think of regular expressions as a system consisting of only these six rules.

Regular expression is thus a mathematical theory and modern regex engines are based on a super set of that theory. With the help of the theory, we can prove the following:

  • For each regular expression, we can construct at least one deterministic finite automaton and at least one nondeterministic finite automaton, so that all three solves the same problem.
  • For every finite automaton — deterministic as well as nondeterministic — we can write a regular expression, so that both solve the same problem.

Solving a problem here means to determine whether a string is part of a language. The proofs mentioned above are not reproduced here, but they are easily accessed as they exist in every textbook on automata theory. The beauty of this analogy between regular expressions and finite automata is that I can explain several key features of regular expressions for you, with the help of graphs of finite automata. And as a matter of fact, a regex engine is really just a compiler that translates our hand-written regular expressions to computer friendly finite automata, or possibly the more advanced pushdown automata.

Pomodoro Technique Illustrated -- New book from The Pragmatic Programmers, LLC

Alphabet, string, language…

An alphabet is a finite, nonempty set of symbols. Here are some examples of alphabets:

  • The binary alphabet {0, 1} contains only two symbols, namely 1 and 0.
  • The American Standard Code for Information Interchange, abbreviated ASCII, contains 128 symbols. Only 95 of these are printable. The others, such as Backspace (ASCII 8), are also symbols in our definition.
  • The set of the 95 printable symbols in ASCII is also an alphabet.
  • Unicode contains over 100,000 symbols – Arabic, Chinese, Latin and many other types. But even if it’s a very large number, it is still a finite number of symbols. Unicode is therefore an alphabet.
  • The set of the Cyrillic symbols in Unicode is an alphabet.
  • We can construct an alphabet consisting of a blue and a white dot. That is an alphabet consisting of two symbols, just like the binary alphabet.

Did you by any chance know that the word alphabet comes from alpha and beta, the first two letters of the 3000 year old Phoenician alphabet, and that alpha meant ox and beta meant house? However, the order of symbols does not make an alphabet unique. Thus, {a, b} is the same alphabet as {b, a}. The empty set — the set composed of zero symbols — is by definition not an alphabet.

A string is a finite (ordered) sequence of symbols chosen from an alphabet. Here are some examples of strings:

  • 0110 and 1001 are two strings from the binary alphabet.
  • The strings kadikoy are uskudar are constructed with symbols from the ASCII alphabet.
  • футбол is a string from the Cyrillic alphabet.
  • اللوز is a string from the Arabic alphabet.
  • From any alphabet, it’s possible to create an empty string, i.e. a string consisting of zero symbols. By convention, we denote that string with the character ε.

The very same symbol can occur multiple times in a string. The order is significant; gof is not the same string as fog.

A language is a countable set of strings from a fixed alphabet. The restriction is that the alphabet is finite. The language may well include an infinite number of strings. Here are some examples of languages:

  • All binary numbers ending with 0. It includes 10, 100, 110, etc.
  • All palindromes — strings that are the same forwards and backwards – constructed from the symbols of ASCII.
  • All strings with an even number of symbols from Unicode.
  • The finite set {dog, cat, bird}.
  • The empty set — the language with no strings. This alphabet is by convention denoted Ø.
  • The language {ε} consisting solely of the empty string ε.

Our focus here is on regular expressions. A regular expression is an algebraic description of an entire language. All languages can’t be described by regular expressions. No regular expression can for example describe the language consisting of all palindromes created with symbols from ASCII. I will try to convince you that regular expressions are ridiculously simple and amazingly powerful. Regular expressions are based on two basic rules:

  1. The empty string ε is a regular expression. It describes a language that has one single string, namely the empty string.
  2. If a symbol a is a member of an alphabet, then a is a regular expression. It describes a language that has the string a as its only member.

That was easy, wasn’t it? In addition to the two basic rules, we have three operators that we’ll catch in four more rules. Out of pure laziness, we have conventions for precedence between the three operators, but more on that in an upcoming blog post. First, you should try the two basic rules in IRB. We write regular expressions between two dashes. You can e.g. write the regulars expression a as /a/. IRB can help us to figure out if a string is in a language described by a particular regular expression:

'a'[/a/] #=> "a" – the string a is matched by the regular expression /a/
''[/a/] #=> nil – the empty string is not matched by the regular expression /a/
'b'[/a/] #=> nil – the string b is not matched by the regular expression /a/
'a'[//] #=> "" – the string a is not matched by the empty regular expression / /

Only when IRB returns the entire string, it’s matched by the regular expression.

Pomodoro Technique Illustrated -- New book from The Pragmatic Programmers, LLC

The forward tracking quantifier modifier

Because of backtracking, we might sometimes match more than we hoped for. The task below is to catch all the div tags in an HTML document and put them in a vector. Our naïve solution gives the wrong answer:

  • '<div>a</div><span>c</span><div>b</div>'.scan /<div>.*<\/div>/ #=> ["<div>a</div><span>c</span><div>b</div>"]

Quantifiers are greedy in regex. They devour as much as they can. The dot in the idiom .* matches anything except newline (remember Barbapapa?). And the asterisk means that this anything is repeated as many times as possible. Since the string ends with a “</div>” the entire substring “a</div><span>c</span><div>b” will be consumed by .*.

There are many stories about greedy people who claim more than they need. As Louis Blanc wrote already 1840 in The Organization of Work: "From each according to his abilities, to each according to his needs." We don't think that the asterisk and the dot in the above example needs to consume more than the substring "<div>a</div>". Alexander Pushkin describes in The Tale of the Fisherman and the Fish how a magic fish promises to fulfill any of the fisherman's whishes. The fisherman's wife starts asking the fish for bigger and better things -- and she gets them -- until she eventually wants to become Ruler of the Sea. Then the magic fish takes back everything he gave the fisherman's wife.

As a complement to backtracking, regex also provides forward tracking. The quantifier tries to take as little as it can and then holds to see if the whole expression can be matched. If the entire expression can't be matched, then the quantifier will capture one more letter and holds to see if it's now possible to match everything. The method is repeated until either we can match everything or we can confirm that there's no possible way to create a match. Does it matter if we use forward tracking instead of backtracking? Well, the strings that can be matched with backtracking can also be matched with forward tracking and vice versa. But, when there are multiple possible matches, forward tracking will sometimes choose a different match than backtracking does.

There is no special symbol for forward tracking quantifiers. Instead, there's modifier symbol -- written as a question mark -- that can be used right after any quantifier. While * means repeat as many times as possible, *? means repeat as few times as possible. Similarly, we can modify all the other quantifiers:

  • At least one, as few as possible: +?
  • Zero or one, preferably zero: ??
  • Between 3 and 5, as few as possible: {3.5}?

Note that the question mark that modifies quantifiers isn't the same question mark that's used as the conditional quantifier. They can even be used in conjunction, as ?? shows above.

Here are some examples. The first one is the solution to the div tag problem:

  • '<div>a</div><span>c</span><div>b</div>'.scan /<div>.*?<\/div>/ #=> ["<div>a</div>", "<div>b</div>"]
  • 'aa'[/a?/] #=> "a"
  • 'aa'[/a??/] #=> ""
  • 'aaaaa'[/a{2,4}/] #=> "aaaa"
  • 'aaaaa'[/a{2,4}?/] #=> "aa"
  • 'aaaaa'[/a{2,}/] #=> "aaaaa"
  • 'aaaaa'[/a{2,}?/] #=> "aa"
  • 'aaaaa'[/a{,4}/] #=> "aaaa"
  • 'aaaaa'[/a{,4}?/] #=> ""

Pomodoro Technique Illustrated -- New book from The Pragmatic Programmers, LLC

Regex memorizing — here’s the pushdown automaton

Pushdown automaton that matches all strings with the same number of white and blue dots.

Pushdown automaton that matches all strings with
the same number of white and blue dots.

Formal regular expressions can be described by a finite automaton, but modern regex engines support un-regular operators. The problem with finite automata is that they don’t have any memory. Once they are in a state, they have no idea, how they got there. Wise guys invented the stack. When we add a stack to a finite automaton it becomes very powerful, but also quite complex. And of course, it’s not a finite automaton anymore – it’s a pushdown automaton. Why do I then describe the regex engines as finite automata when pushdown automata is a superset of finite automata?

The finite automaton reads a tape serially. Every new symbol read from the tape, initiates a transition in the automaton from the current state to a new state (the new state may be the same state as the current state). If we are in an accept state when we have read all the tape, then we have a match. In pushdown automata, we add a stack. For every iteration, the pushdown automaton may read a symbol from the tape, pop a symbol from the stack, or both. Depending on the current state, the symbol read from the tape and/or the symbol popped from the stack, the pushdown automaton can now initiate a state transition, push a symbol to the stack, or both. If we end up in a accept state when all the tape is read and the stack is at that time empty then we have a match.

The figure above shows a pushdown automaton that matches any input with the same number of blue and white dots – something that is impossible to describe with a finite automaton. This automaton has two states: a start state and an accept state. The four icons in the middle represents that a dot is popped or pushed from the stack. Suppose that the input is blue-white-white-blue:

  1. Read blue dot and make a transition from the start state, through the third stack icon from the top – i.e. a blue dot is pushed to the stack – and back to the start state.
  2. Read white dot and make a transition from the start state, through the first stack icon – i.e. a blue dot is popped from the stack – and back to the start state. Now the stack is empty.
  3. Read white dot and make a transition from the start state, through the second stack icon from the top – i.e. a white dot is pushed to the stack – and back to the start state.
  4. Read blue dot and make a transition from the start state, through the fourth stack icon – i.e. a white dot is popped from the stack – and to the accept state. Now the stack is empty again and all the input is read. It’s a match.

You can probably feel that the pushdown automaton gives us a tool for all kind of recursive implementations. Now that you understand how, I’ll go back to finite automata in all explanations where possible. It’s important for you to understand e.g. what backtracking and greediness does to performance and what input that will be matched. You don’t need a complicated pushdown automaton to learn that. We’ll stick to finite automaton in this book and at the same time remember that many operators in modern regex engines rely on a stack.

By the way: Did you notice the corner case bug in the pushdown automaton above? Doesn’t an empty input string have the same number of blue and white dots?

Next Page »



Follow

Get every new post delivered to your Inbox.