Formal regular expressions can be described by a finite automaton, but modern regex engines support un-regular operators. The problem with finite automata is that they don’t have any memory. Once they are in a state, they have no idea, how they got there. Wise guys invented the stack. When we add a stack to a finite automaton it becomes very powerful, but also quite complex. And of course, it’s not a finite automaton anymore – it’s a pushdown automaton. Why do I then describe the regex engines as finite automata when pushdown automata is a superset of finite automata?
The finite automaton reads a tape serially. Every new symbol read from the tape, initiates a transition in the automaton from the current state to a new state (the new state may be the same state as the current state). If we are in an accept state when we have read all the tape, then we have a match. In pushdown automata, we add a stack. For every iteration, the pushdown automaton may read a symbol from the tape, pop a symbol from the stack, or both. Depending on the current state, the symbol read from the tape and/or the symbol popped from the stack, the pushdown automaton can now initiate a state transition, push a symbol to the stack, or both. If we end up in a accept state when all the tape is read and the stack is at that time empty then we have a match.
The figure above shows a pushdown automaton that matches any input with the same number of blue and white dots – something that is impossible to describe with a finite automaton. This automaton has two states: a start state and an accept state. The four icons in the middle represents that a dot is popped or pushed from the stack. Suppose that the input is blue-white-white-blue:
- Read blue dot and make a transition from the start state, through the third stack icon from the top – i.e. a blue dot is pushed to the stack – and back to the start state.
- Read white dot and make a transition from the start state, through the first stack icon – i.e. a blue dot is popped from the stack – and back to the start state. Now the stack is empty.
- Read white dot and make a transition from the start state, through the second stack icon from the top – i.e. a white dot is pushed to the stack – and back to the start state.
- Read blue dot and make a transition from the start state, through the fourth stack icon – i.e. a white dot is popped from the stack – and to the accept state. Now the stack is empty again and all the input is read. It’s a match.
You can probably feel that the pushdown automaton gives us a tool for all kind of recursive implementations. Now that you understand how, I’ll go back to finite automata in all explanations where possible. It’s important for you to understand e.g. what backtracking and greediness does to performance and what input that will be matched. You don’t need a complicated pushdown automaton to learn that. We’ll stick to finite automaton in this book and at the same time remember that many operators in modern regex engines rely on a stack.
By the way: Did you notice the corner case bug in the pushdown automaton above? Doesn’t an empty input string have the same number of blue and white dots?