Archive for the 'finite automata' Category

From Regular Expression to Finite Automaton

For each regular expression — and I mean the three operators and the six recursive rules style — there is a finite automaton that accepts exactly the same strings. Since this is not a university book in mathematics, I’ll show you an inductive reasoning about this and not a formal proof.

The hypothesis is thus that for an arbitrary regular expression p, we can create a finite automaton that has exactly one start state, no paths into the start state, no paths out from the acceptance state and that accepts exactly the same strings that are matched by p.

  • The empty string ε is a regular expression corresponding to a finite automaton with a start state, a path that accepts the empty string ε and leads from the start state to an acceptance state. We’ll call this an //ε-path//.
  • The empty set Ø is the set equivalent to a regular expression that can’t match any single string — not even the empty string ε. It is the same as a two-state automaton, with no single path. One state is start and the other one acceptance. But, they are not linked.
  • A regular expression that only matches the symbol b corresponds to a finite automaton with two states: start and acceptance. There’s a path from start t acceptance, and it only accepts the symbol b.

All three finite automata above have two states. One is start and the other one is acceptance. The difference is that the first one has an ε-path from start to acceptance, the second one has no path, and the third one has a b-path. Now we’ll continue. Imagine that we have two regular expressions p and q corresponding to finite automata s and t respectively.

  • Concatenation of two regular expressions p and q means that we first match a string with p, directly followed by a string that’s matched by q. To create this finite automaton we first add ε-paths from every acceptance state in s to the start state of t. Then we deprive all acceptance states in s their acceptance status and we’ll also withdraw the start status of the start state in t.
  • Alternation of two regular expressions p and q, i.e. p|q is like a finite automaton with a new start state that has ε-paths to all start staes of s and t. The new finite automaton also has a new acceptance state that is reached with ε-paths from all acceptance states of s and t. The start and acceptance states of s and t are thus not start and acceptance state in our new automaton.
  • Kleene star is the concatenation closure. Assume that p = q*. Then s is the finite automaton we get if we take t and add two states and four paths as follows: One new state is the start state and the other one is an acceptance state. All acceptance states of ´s´ loses that status in s, but instead gets an ε-path to the new acceptance state. We add two ε-paths from the new initial state — one to the old start state and one to the new acceptance state. In addition to that, we insert one ε-path from each of the old acceptance states to the old start state.

Look at the pictures above. Then take a deep breath and feel if you can translate an arbitrary regular expression to a finite automaton. Finally assess the last picture where the regular expression (w|bb)* is depicted as a graph using the method described above. Does it feel reasonable?

Pomodoro Technique Illustrated -- New book from The Pragmatic Programmers, LLC

Four more rules for Regular Expressions

Now that we have two basic base rules, we would like to add four more. Then we can build regular expressions recursively from small regular expressions. The first three of these rules describe the only three necessary operators in regular expressions. The fourth rule deals with parentheses:

  1. Alternation: If p and q are regular expressions, then will also p|q be a regular expression. The expression p|q matches the union of the strings matched by p and q. Think of it as: either p or q.
  2. Concatenation: If p and q are two regular expressions, then will also pq be a regular expression. Note that the symbol for concatenation is invisible. Some literature use + for concatenation, as in:p+q. The expression pq denotes a language consisting of all strings with a prefix matched by p, directly followed by a suffix matched by q.
  3. Closure: If p is a regular expression, then will also p* be a regular expression. This is the closure of concatenation with the expression itself. The expression p* match all strings that Completely can be divided into zero or more substrings, each of which is matched by p.
  4. Parentheses: If p is a regular expression, then will also (p) be a regular expression. This means that we can enclose an expression in parentheses without altering its meaning.

In addition to these rules we will shortly put some convenience rules for operator precedence. They are not necessary, but allow us to write shorter and more readable regular expressions. Quite soon you will also see real regular expression examples with these three operators. These six rules are sufficient for writing all possible regular expressions. All other regex operators you’ve seen are just abstractions and syntactic sugar. Or?

Do you remember George Bernard Shaw’s “The golden rule is that there are no golden rules.” and Mark Twain’s “It is a good idea to obey all the rules when you’re young just so you’ll have the strength to break them when you’re old.”? This is exactly how we should do. For now, these rules are all we need. However, in modern regex engines, there are functions like a back reference, and lookarounds. To implement these, we need more rules. But until we’re there, it will be very useful to think of regular expressions as a system consisting of only these six rules.

Regular expression is thus a mathematical theory and modern regex engines are based on a super set of that theory. With the help of the theory, we can prove the following:

  • For each regular expression, we can construct at least one deterministic finite automaton and at least one nondeterministic finite automaton, so that all three solves the same problem.
  • For every finite automaton — deterministic as well as nondeterministic — we can write a regular expression, so that both solve the same problem.

Solving a problem here means to determine whether a string is part of a language. The proofs mentioned above are not reproduced here, but they are easily accessed as they exist in every textbook on automata theory. The beauty of this analogy between regular expressions and finite automata is that I can explain several key features of regular expressions for you, with the help of graphs of finite automata. And as a matter of fact, a regex engine is really just a compiler that translates our hand-written regular expressions to computer friendly finite automata, or possibly the more advanced pushdown automata.

Pomodoro Technique Illustrated -- New book from The Pragmatic Programmers, LLC

Regex memorizing — here’s the pushdown automaton

Pushdown automaton that matches all strings with the same number of white and blue dots.

Pushdown automaton that matches all strings with
the same number of white and blue dots.

Formal regular expressions can be described by a finite automaton, but modern regex engines support un-regular operators. The problem with finite automata is that they don’t have any memory. Once they are in a state, they have no idea, how they got there. Wise guys invented the stack. When we add a stack to a finite automaton it becomes very powerful, but also quite complex. And of course, it’s not a finite automaton anymore – it’s a pushdown automaton. Why do I then describe the regex engines as finite automata when pushdown automata is a superset of finite automata?

The finite automaton reads a tape serially. Every new symbol read from the tape, initiates a transition in the automaton from the current state to a new state (the new state may be the same state as the current state). If we are in an accept state when we have read all the tape, then we have a match. In pushdown automata, we add a stack. For every iteration, the pushdown automaton may read a symbol from the tape, pop a symbol from the stack, or both. Depending on the current state, the symbol read from the tape and/or the symbol popped from the stack, the pushdown automaton can now initiate a state transition, push a symbol to the stack, or both. If we end up in a accept state when all the tape is read and the stack is at that time empty then we have a match.

The figure above shows a pushdown automaton that matches any input with the same number of blue and white dots – something that is impossible to describe with a finite automaton. This automaton has two states: a start state and an accept state. The four icons in the middle represents that a dot is popped or pushed from the stack. Suppose that the input is blue-white-white-blue:

  1. Read blue dot and make a transition from the start state, through the third stack icon from the top – i.e. a blue dot is pushed to the stack – and back to the start state.
  2. Read white dot and make a transition from the start state, through the first stack icon – i.e. a blue dot is popped from the stack – and back to the start state. Now the stack is empty.
  3. Read white dot and make a transition from the start state, through the second stack icon from the top – i.e. a white dot is pushed to the stack – and back to the start state.
  4. Read blue dot and make a transition from the start state, through the fourth stack icon – i.e. a white dot is popped from the stack – and to the accept state. Now the stack is empty again and all the input is read. It’s a match.

You can probably feel that the pushdown automaton gives us a tool for all kind of recursive implementations. Now that you understand how, I’ll go back to finite automata in all explanations where possible. It’s important for you to understand e.g. what backtracking and greediness does to performance and what input that will be matched. You don’t need a complicated pushdown automaton to learn that. We’ll stick to finite automaton in this book and at the same time remember that many operators in modern regex engines rely on a stack.

By the way: Did you notice the corner case bug in the pushdown automaton above? Doesn’t an empty input string have the same number of blue and white dots?



Follow

Get every new post delivered to your Inbox.