Archive Page 2

Four more rules for Regular Expressions

Now that we have two basic base rules, we would like to add four more. Then we can build regular expressions recursively from small regular expressions. The first three of these rules describe the only three necessary operators in regular expressions. The fourth rule deals with parentheses:

  1. Alternation: If p and q are regular expressions, then will also p|q be a regular expression. The expression p|q matches the union of the strings matched by p and q. Think of it as: either p or q.
  2. Concatenation: If p and q are two regular expressions, then will also pq be a regular expression. Note that the symbol for concatenation is invisible. Some literature use + for concatenation, as in:p+q. The expression pq denotes a language consisting of all strings with a prefix matched by p, directly followed by a suffix matched by q.
  3. Closure: If p is a regular expression, then will also p* be a regular expression. This is the closure of concatenation with the expression itself. The expression p* match all strings that Completely can be divided into zero or more substrings, each of which is matched by p.
  4. Parentheses: If p is a regular expression, then will also (p) be a regular expression. This means that we can enclose an expression in parentheses without altering its meaning.

In addition to these rules we will shortly put some convenience rules for operator precedence. They are not necessary, but allow us to write shorter and more readable regular expressions. Quite soon you will also see real regular expression examples with these three operators. These six rules are sufficient for writing all possible regular expressions. All other regex operators you’ve seen are just abstractions and syntactic sugar. Or?

Do you remember George Bernard Shaw’s “The golden rule is that there are no golden rules.” and Mark Twain’s “It is a good idea to obey all the rules when you’re young just so you’ll have the strength to break them when you’re old.”? This is exactly how we should do. For now, these rules are all we need. However, in modern regex engines, there are functions like a back reference, and lookarounds. To implement these, we need more rules. But until we’re there, it will be very useful to think of regular expressions as a system consisting of only these six rules.

Regular expression is thus a mathematical theory and modern regex engines are based on a super set of that theory. With the help of the theory, we can prove the following:

  • For each regular expression, we can construct at least one deterministic finite automaton and at least one nondeterministic finite automaton, so that all three solves the same problem.
  • For every finite automaton — deterministic as well as nondeterministic — we can write a regular expression, so that both solve the same problem.

Solving a problem here means to determine whether a string is part of a language. The proofs mentioned above are not reproduced here, but they are easily accessed as they exist in every textbook on automata theory. The beauty of this analogy between regular expressions and finite automata is that I can explain several key features of regular expressions for you, with the help of graphs of finite automata. And as a matter of fact, a regex engine is really just a compiler that translates our hand-written regular expressions to computer friendly finite automata, or possibly the more advanced pushdown automata.

Pomodoro Technique Illustrated -- New book from The Pragmatic Programmers, LLC

Alphabet, string, language…

An alphabet is a finite, nonempty set of symbols. Here are some examples of alphabets:

  • The binary alphabet {0, 1} contains only two symbols, namely 1 and 0.
  • The American Standard Code for Information Interchange, abbreviated ASCII, contains 128 symbols. Only 95 of these are printable. The others, such as Backspace (ASCII 8), are also symbols in our definition.
  • The set of the 95 printable symbols in ASCII is also an alphabet.
  • Unicode contains over 100,000 symbols – Arabic, Chinese, Latin and many other types. But even if it’s a very large number, it is still a finite number of symbols. Unicode is therefore an alphabet.
  • The set of the Cyrillic symbols in Unicode is an alphabet.
  • We can construct an alphabet consisting of a blue and a white dot. That is an alphabet consisting of two symbols, just like the binary alphabet.

Did you by any chance know that the word alphabet comes from alpha and beta, the first two letters of the 3000 year old Phoenician alphabet, and that alpha meant ox and beta meant house? However, the order of symbols does not make an alphabet unique. Thus, {a, b} is the same alphabet as {b, a}. The empty set — the set composed of zero symbols — is by definition not an alphabet.

A string is a finite (ordered) sequence of symbols chosen from an alphabet. Here are some examples of strings:

  • 0110 and 1001 are two strings from the binary alphabet.
  • The strings kadikoy are uskudar are constructed with symbols from the ASCII alphabet.
  • футбол is a string from the Cyrillic alphabet.
  • اللوز is a string from the Arabic alphabet.
  • From any alphabet, it’s possible to create an empty string, i.e. a string consisting of zero symbols. By convention, we denote that string with the character ε.

The very same symbol can occur multiple times in a string. The order is significant; gof is not the same string as fog.

A language is a countable set of strings from a fixed alphabet. The restriction is that the alphabet is finite. The language may well include an infinite number of strings. Here are some examples of languages:

  • All binary numbers ending with 0. It includes 10, 100, 110, etc.
  • All palindromes — strings that are the same forwards and backwards – constructed from the symbols of ASCII.
  • All strings with an even number of symbols from Unicode.
  • The finite set {dog, cat, bird}.
  • The empty set — the language with no strings. This alphabet is by convention denoted Ø.
  • The language {ε} consisting solely of the empty string ε.

Our focus here is on regular expressions. A regular expression is an algebraic description of an entire language. All languages can’t be described by regular expressions. No regular expression can for example describe the language consisting of all palindromes created with symbols from ASCII. I will try to convince you that regular expressions are ridiculously simple and amazingly powerful. Regular expressions are based on two basic rules:

  1. The empty string ε is a regular expression. It describes a language that has one single string, namely the empty string.
  2. If a symbol a is a member of an alphabet, then a is a regular expression. It describes a language that has the string a as its only member.

That was easy, wasn’t it? In addition to the two basic rules, we have three operators that we’ll catch in four more rules. Out of pure laziness, we have conventions for precedence between the three operators, but more on that in an upcoming blog post. First, you should try the two basic rules in IRB. We write regular expressions between two dashes. You can e.g. write the regulars expression a as /a/. IRB can help us to figure out if a string is in a language described by a particular regular expression:

'a'[/a/] #=> "a" – the string a is matched by the regular expression /a/
''[/a/] #=> nil – the empty string is not matched by the regular expression /a/
'b'[/a/] #=> nil – the string b is not matched by the regular expression /a/
'a'[//] #=> "" – the string a is not matched by the empty regular expression / /

Only when IRB returns the entire string, it’s matched by the regular expression.

Pomodoro Technique Illustrated -- New book from The Pragmatic Programmers, LLC

The forward tracking quantifier modifier

Because of backtracking, we might sometimes match more than we hoped for. The task below is to catch all the div tags in an HTML document and put them in a vector. Our naïve solution gives the wrong answer:

  • '<div>a</div><span>c</span><div>b</div>'.scan /<div>.*<\/div>/ #=> ["<div>a</div><span>c</span><div>b</div>"]

Quantifiers are greedy in regex. They devour as much as they can. The dot in the idiom .* matches anything except newline (remember Barbapapa?). And the asterisk means that this anything is repeated as many times as possible. Since the string ends with a “</div>” the entire substring “a</div><span>c</span><div>b” will be consumed by .*.

There are many stories about greedy people who claim more than they need. As Louis Blanc wrote already 1840 in The Organization of Work: "From each according to his abilities, to each according to his needs." We don't think that the asterisk and the dot in the above example needs to consume more than the substring "<div>a</div>". Alexander Pushkin describes in The Tale of the Fisherman and the Fish how a magic fish promises to fulfill any of the fisherman's whishes. The fisherman's wife starts asking the fish for bigger and better things -- and she gets them -- until she eventually wants to become Ruler of the Sea. Then the magic fish takes back everything he gave the fisherman's wife.

As a complement to backtracking, regex also provides forward tracking. The quantifier tries to take as little as it can and then holds to see if the whole expression can be matched. If the entire expression can't be matched, then the quantifier will capture one more letter and holds to see if it's now possible to match everything. The method is repeated until either we can match everything or we can confirm that there's no possible way to create a match. Does it matter if we use forward tracking instead of backtracking? Well, the strings that can be matched with backtracking can also be matched with forward tracking and vice versa. But, when there are multiple possible matches, forward tracking will sometimes choose a different match than backtracking does.

There is no special symbol for forward tracking quantifiers. Instead, there's modifier symbol -- written as a question mark -- that can be used right after any quantifier. While * means repeat as many times as possible, *? means repeat as few times as possible. Similarly, we can modify all the other quantifiers:

  • At least one, as few as possible: +?
  • Zero or one, preferably zero: ??
  • Between 3 and 5, as few as possible: {3.5}?

Note that the question mark that modifies quantifiers isn't the same question mark that's used as the conditional quantifier. They can even be used in conjunction, as ?? shows above.

Here are some examples. The first one is the solution to the div tag problem:

  • '<div>a</div><span>c</span><div>b</div>'.scan /<div>.*?<\/div>/ #=> ["<div>a</div>", "<div>b</div>"]
  • 'aa'[/a?/] #=> "a"
  • 'aa'[/a??/] #=> ""
  • 'aaaaa'[/a{2,4}/] #=> "aaaa"
  • 'aaaaa'[/a{2,4}?/] #=> "aa"
  • 'aaaaa'[/a{2,}/] #=> "aaaaa"
  • 'aaaaa'[/a{2,}?/] #=> "aa"
  • 'aaaaa'[/a{,4}/] #=> "aaaa"
  • 'aaaaa'[/a{,4}?/] #=> ""

Pomodoro Technique Illustrated -- New book from The Pragmatic Programmers, LLC

Regex memorizing — here’s the pushdown automaton

Pushdown automaton that matches all strings with the same number of white and blue dots.

Pushdown automaton that matches all strings with
the same number of white and blue dots.

Formal regular expressions can be described by a finite automaton, but modern regex engines support un-regular operators. The problem with finite automata is that they don’t have any memory. Once they are in a state, they have no idea, how they got there. Wise guys invented the stack. When we add a stack to a finite automaton it becomes very powerful, but also quite complex. And of course, it’s not a finite automaton anymore – it’s a pushdown automaton. Why do I then describe the regex engines as finite automata when pushdown automata is a superset of finite automata?

The finite automaton reads a tape serially. Every new symbol read from the tape, initiates a transition in the automaton from the current state to a new state (the new state may be the same state as the current state). If we are in an accept state when we have read all the tape, then we have a match. In pushdown automata, we add a stack. For every iteration, the pushdown automaton may read a symbol from the tape, pop a symbol from the stack, or both. Depending on the current state, the symbol read from the tape and/or the symbol popped from the stack, the pushdown automaton can now initiate a state transition, push a symbol to the stack, or both. If we end up in a accept state when all the tape is read and the stack is at that time empty then we have a match.

The figure above shows a pushdown automaton that matches any input with the same number of blue and white dots – something that is impossible to describe with a finite automaton. This automaton has two states: a start state and an accept state. The four icons in the middle represents that a dot is popped or pushed from the stack. Suppose that the input is blue-white-white-blue:

  1. Read blue dot and make a transition from the start state, through the third stack icon from the top – i.e. a blue dot is pushed to the stack – and back to the start state.
  2. Read white dot and make a transition from the start state, through the first stack icon – i.e. a blue dot is popped from the stack – and back to the start state. Now the stack is empty.
  3. Read white dot and make a transition from the start state, through the second stack icon from the top – i.e. a white dot is pushed to the stack – and back to the start state.
  4. Read blue dot and make a transition from the start state, through the fourth stack icon – i.e. a white dot is popped from the stack – and to the accept state. Now the stack is empty again and all the input is read. It’s a match.

You can probably feel that the pushdown automaton gives us a tool for all kind of recursive implementations. Now that you understand how, I’ll go back to finite automata in all explanations where possible. It’s important for you to understand e.g. what backtracking and greediness does to performance and what input that will be matched. You don’t need a complicated pushdown automaton to learn that. We’ll stick to finite automaton in this book and at the same time remember that many operators in modern regex engines rely on a stack.

By the way: Did you notice the corner case bug in the pushdown automaton above? Doesn’t an empty input string have the same number of blue and white dots?

Dynamic typing inside the Java Virtual Machine

Take a look at the following Java program. Albeit it’s dynamic typing style, it’s accepted by the Java compiler javac:

public class Dynamic {
  public static void main(String[] args) {
	Object o;
	o = System.out;
	o = System.in;
	o = 1;
	o = "Hello World";
  }
}

The method body part of the compiled JVM code looks something like this:

   0:   getstatic       #2; //Field java/lang/System.out:Ljava/io/PrintStream;
   3:   astore_1
   4:   getstatic       #3; //Field java/lang/System.in:Ljava/io/InputStream;
   7:   astore_1
   8:   iconst_1
   9:   invokestatic    #4; //Method java/lang/Integer.valueOf:(I)Ljava/lang/Integer;
   12:  astore_1
   13:  ldc     #5; //String Hello World
   15:  astore_1

At line 0, 4, 8, 9 and 13 something is pushed on the operand stack. Line 3, 7, 12 and 15 pop the top record from the operand stack and assign it to local variable #1. Line 8-9 is called boxing. Since o is a java.lang.Object and Java doesn’t allow java.lang.Object to take a primitive value, like the int 1, this value is boxed in a java.lang.Integer object at line 9. The Java compiler javac created line 9 because Java is based on static typing.

Dynamic typing means that values have types but variables don’t. In static typing, variables have types and are restricted to not refer to any random type of value. The local variable o has type java.lang.Object and can only refer to values of type java.lang.Object or its subtypes.

Unlike Java, local JVM variables are dynamically typed — NOT statically typed. We can safely remove line 9 (the conversion from int to java.lang.Integer) from the JVM code and it will still be perfectly valid JVM code. Local variable #1 will then in turn be assigned a PrintStream, an InputStream, an int and a String. This is possible because local variable #1 – like all local variables in the JVM – has no type.

Complicated logic calls for regex

When is regex the proper tool? One use case is when you try to verify a text, and the verification rules have complicated logic. I found the following Java method in real production code. It verifies that a time zone designator adheres to the ISO 8601 standard:

public boolean isTimeDesignator(String designator) {
  if (designator.equals("Z")) {
    return true;
  } else if (designator.startsWith(" ") && designator.trim().equals("UTC")) {
    return true;
  } else if (designator.startsWith("+") || designator.startsWith("-")) {
    try {
      int hour = Integer.parseInt(designator.substring(1,3));
      int minute;
      if (3 == designator.length()) {
        minute = 0;
      } else if (5 == designator.length()) {
        minute = Integer.parseInt(designator.substring(3,5));
      } else if (6 == designator.length() && ':' == designator.charAt(3)) {
        minute = Integer.parseInt(designator.substring(4,6));
      } else {
        return false;
      }
      return 0 <= hour && 24 >= hour && 0 <= minute && 59 >= minute;
    } catch (NumberFormatException e) {
      return false;
    }
  }
    
  return false;
}

Only counting the if and else, there are 7 ways the control flow can traverse this code. Adding the NumberFormatException there’s another 3. These ten routes need to be covered by tests.

In ISO 8601, global time is always represented as an offset to Coordinated Universal Time, abbreviated UTC. UTC is informally equivalent to the Greenwich Mean Time, also known as GMT. The designator is an offset in hours and minutes to UTC. Hours are mandatory, but minutes and a colon in-between minutes and hours are optional. Offset zero can be represented as a space and then ‘UCT’ or as ‘Z’. Complicated?

In regex — still in Java — it’s rather crisp to verify:

public boolean isTimeDesignator(String designator) {
  return designator.matches("Z| +UTC|[+-]([01]\\d|2[0-4])(:?[0-5]\\d)?");
}

The latter solution may seem cryptic, for people without experience in regex. But if you think about it, the first solution is cryptic for anyone not proficient in imperative coding. And if you ask me, the first solution is hard to follow for anybody. To understand a regex, you need to understand regex. That’s a tautology. With regex in your repertoire, some problem classes can be solved much more smoothly and the solutions are easier to maintain.

Regex syntax and semantics varies

Regex engines do differ in syntax and semantics. This is one reason why you can’t just find an expression with Google and use it in your code – without fully understand it.

Try for example the following in JIRB, the interactive JRuby tool:

irb(main):001:0> require 'java'
=> true
irb(main):002:0> java.util.regex.Pattern.compile("a$").matcher("a\nb").find
=> false
irb(main):003:0> "a\nb"[/a$/]
=> "a"

What happened?

The regex /a$/ matches the letter a just before something checked by a dollar sign assertion. The assert criteria is, by default, not the same in Ruby and Java. In Java, by default, the dollar sign matches at the end of the whole text and before any final line breaks. In Ruby, the dollar sign match at the end of every single line.

Here’s what the three lines of code above means:

  1. First, we need to include the Java libraries by writing require 'java'. This might not be necessary, depending on your setup.
  2. We compile a Java regex and test if part of the string of ‘a’ and ‘b’ with newline in-between can be matched. It can’t.
  3. We compile the same regex in Ruby and test if part of the string of ‘a’ and ‘b’ with newline in-between can be matched. It can.

This is just one of many examples. If you are going to use regexes in your program, you need to understand them. It’s as simple as that.

Pomodoro Technique Illustrated -- New book from The Pragmatic Programmers, LLC

Regex slides

Slides from my regex session at Turku Agile Day are now available below and at Slideshare:

If you can’t see the whole picture due to Slideshare’s embed limits, click ‘zoom’ -> ‘zoom to page’ in the Slideshare tool bar above.

My solution to Citerus/JFokus programming contest

It’s JFokus – the number one annual Java echo system event in Sweden – right now. I’ll present TDD with Regex in a session tomorrow Wednesday. However, Uppsala based consultancy company Citerus kicked off a programming contest during this conference.

The problem: You get one source string and a lot of short acronym strings. Delete acronyms from the source string until you have the shortest possible string. Use whatever programming language you like as long as it is executable on the JVM.

The example says delete the following acronyms:

  • TDD, DDD, DI, DO, OO, UI, ANT, CV, IOC, LOC, SU, VO

…from the source string:

  • VOCDIITEIOCRUDOIANTOCSLOIOCVESTAIOCVOLIOCENTSU

The first idea you probably get is to delete them in a double loop like this:

import java.util.*;
public class Stripper {
  final String [] acronyms;
  public Stripper(final String... acronyms) {
    this.acronyms = acronyms;
  }
  public String strip(final String source) {
    String result = source;
    for (int i = 0; i < source.length(); i++) {
      final String sufix = source.substring(i);
      for (final String acronym : acronyms) {
        if (sufix.startsWith(acronym)) {
          result = shortest(result, strip(source.substring(0, i)
              + sufix.substring(acronym.length())));
        }
      }
    }
    return result;
  }
  private static String shortest(final String s1, final String s2) {
    return s1.length() < s2.length() ? s1 : s2;
  }
  public static void main(String args[]) {
    final long start = System.currentTimeMillis();
    System.out.printf("%s (%d ms)\n", new Stripper("TDD", "DDD", "DI",
        "DO", "OO", "UI", "ANT", "CV", "IOC", "LOC", "SU",
        "VO").strip("VOCDIITEIOCRUDOIANTOCSLOIOCVESTAIOCVOLIOCENTSU"),
            System.currentTimeMillis() - start);
  }
}

Unfortunately, it’s slow. It took my Linux box 20 minutes to run the example. But, you may have noticed that the recursive call invokes the same query over and over again. Let’s add a cache:

import java.util.*;

public class Stripper {
  final private Map<String, String> stripCache = new HashMap<String, String>();
final String [] acronyms;

public Stripper(final String... acronyms) {
  this.acronyms = acronyms;
}
public String strip(final String source) {
  if (stripCache.containsKey(source)) return stripCache.get(source);
  String result = source;
  for (int i = 0; i < source.length(); i++) {
    final String sufix = source.substring(i);
    for (final String acronym : acronyms) {
      if (sufix.startsWith(acronym)) {
        result = shortest(result, strip(source.substring(0, i)
            + sufix.substring(acronym.length())));
      }
    }
  }
  stripCache.put(source, result);
  return result;
}
private static String shortest(final String s1, final String s2) {
  return s1.length() < s2.length() ? s1 : s2;
}
public static void main(String args[]) {
  final long start = System.currentTimeMillis();
  System.out.printf("%s (%d ms)\n", new Stripper("TDD", "DDD", "DI",
      "DO", "OO", "UI", "ANT", "CV", "IOC", "LOC", "SU", 
      "VO").strip("VOCDIITEIOCRUDOIANTOCSLOIOCVESTAIOCVOLIOCENTSU"),
          System.currentTimeMillis() - start);
  }
}

…and suddenly the example is solved in less than 50 ms. An improvement by 30.000 times!

But, it didn’t stop there. If you swap the loops, you can jump over some hopeless cases:

import java.util.*;
public class Stripper {
  final private Map<String, String> stripCache = new HashMap<String, String>();
  final String [] acronyms;
  public Stripper(final String... acronyms) {
    this.acronyms = acronyms;
  }
  public String strip(final String source) {
    if (stripCache.containsKey(source)) return stripCache.get(source);
    String result = source;
    for (final String acronym : acronyms) {
      for (int i = source.indexOf(acronym); 0 <= i && source.length() > i;
            i = source.indexOf(acronym, i + 1)) {
        result = shortest(result, strip(source.substring(0, i) 
              + source.substring(i + acronym.length())));
      }
    }
    stripCache.put(source, result);
    return result;
  }
  private static String shortest(final String s1, final String s2) {
    return s1.length() < s2.length() ? s1 : s2;
  }
  public static void main(String args[]) {
    final long start = System.currentTimeMillis();
  System.out.printf("%s (%d ms)\n", new Stripper("TDD", "DDD", "DI",
      "DO", "OO", "UI", "ANT", "CV", "IOC", "LOC", "SU", 
      "VO").strip("VOCDIITEIOCRUDOIANTOCSLOIOCVESTAIOCVOLIOCENTSU"),
          System.currentTimeMillis() - start);
  }
}

And that made it twice as fast. Now it takes less than 25 ms to solve it.

Three Regex presentations in Stockholm upcoming month

I have three Regex presentations in Stockholm during the upcoming month. Information below is in Swedish.

Tre tillfällen att förstå Regex bättre

Är du rädd för Regex? :-) Eller vill du kanske få en enkel och praktisk beskrivning av den underliggande matematiken? Vill du förstå och kunna använda Regex? Då ska du besöka någon av mina tre sessioner den närmaste månaden:

Fokus är på att förstå och bli bra på – fokus är inte på Regex-syntax.

Obs! Ingen av sessioner förutsätter förkunskaper om Regex.
Pomodorotekniken på svenska

Pomodoro-boken på svenska

Nu finns den svenska boken Pomodorotekniken. Det är alltså en översättning av den amerikanska boken (som även finns på tyska, japanska, kinesiska och koreanska). Du kan t ex köpa den på Adlibris: http://www.adlibris.com/se/product.aspx?isbn=9144070373

« Previous PageNext Page »



Follow

Get every new post delivered to your Inbox.